Closed Bug 588229 Opened 10 years ago Closed 9 years ago
_setup fails out on non-existent staging repos
The delete_repo step in the repo_setup factory fails (but doesn't halt) if the repo doesn't exist. This means a first run staging release can often fail. It would make it less annoying to run staging releases if it was smart about this (checked the url, maybe? http://hg.mozilla.org/users/stage-ffxbld/foobarbaz gives both "Not found: foobarbaz" and "The specified repository "foobarbaz" is unknown, sorry." in its response page).
More concerning: a bad configuration didn't cause a failure? bash -c ssh -l stage-ffxbld -i ~cltbld/.ssh/ffxbld_dsa hg.mozilla.org clone mobile-browser releases/mobile-browser I think what happened is it didn't delete the previous clone of mobile-browser due to the command line bash -c wget -O- http://hg.mozilla.org/releases/mobile-browser >/dev/null && ssh -l stage-ffxbld -i ~cltbld/.ssh/ffxbld_dsa hg.mozilla.org edit mobile-browser delete YES The first wget failed since there is no releases/mobile-browser, so it didn't proceed with the delete. Then I think the clone failed silently. To fix comment 0, the command line could add a 2nd wget -O- : bash -c wget -O- http://hg.mozilla.org/mobile-browser >/dev/null && wget -O- http://hg.mozilla.org/users/stage-ffxbld/mobile-browser && ssh -l stage-ffxbld -i ~cltbld/.ssh/ffxbld_dsa hg.mozilla.org edit mobile-browser delete YES To fix this comment, a) I can be less stupid going forward, but b) gotta think about it some more.
i think i hit this during a staging run for firefox 3.6.9 build 1 for a bunch of locales. bash -c wget -O- http://hg.mozilla.org/releases/l10n-mozilla-1.9.2/id >/dev/null && ssh -l stage-ffxbld -i ~cltbld/.ssh/ffxbld_dsa hg.mozilla.org edit id delete YES <snip env> --22:52:43-- http://hg.mozilla.org/releases/l10n-mozilla-1.9.2/id Resolving hg.mozilla.org... 10.2.74.67 Connecting to hg.mozilla.org|10.2.74.67|:80... connected. HTTP request sent, awaiting response... 200 Script output follows Length: 26397 (26K) [text/html] Saving to: `STDOUT' 0% [ ] 0 --.-K/s 100%[=======================================>] 26,397 --.-K/s in 0.02s 22:52:43 (1.65 MB/s) - `-' saved [26397/26397] Could not find the repository at /users/stage-ffxbld/id. Please check the list at https://hg.mozilla.org/users/stage-ffxbld
I hit this on my staging release. One of the delete_repo steps failed but the job did not go red. I assumed that a green repo_setup job would trigger the tagging builder. http://hg.mozilla.org/build/buildbotcustom/file/tip/process/factory.py#l3378 The code says that haltOnFailure and flunkOnFailure do not have any value different than the default values which seems to be False. http://hg.mozilla.org/build/buildbot/file/5a049fbe224b/master/buildbot/process/buildstep.py#l576 > haltOnFailure = False > flunkOnWarnings = False > flunkOnFailure = False Are these default values correct? Shouldn't haltOnFailure be True? It seems that the value is set to False since the import of 0.8.1: http://hg.mozilla.org/build/buildbot/rev/42babfd9ed35#l301.579 Doesn't this mean that a step (without haltOnFailure changed to True) it would NOT change the state of the job and NOT abort the job?
wow it seems that it has been False by default even in 0.7.12 https://github.com/buildbot/buildbot/blob/buildbot-0.7.12/buildbot/process/buildstep.py#L575 I thought all my Mozilla life that if a step fails it turns the job red and aborts (by default) . I just noticed that we set haltOnFailure to True in sooooo many places. Shame on me! It seems that we have to add haltOnFailure after all. Sorry for the noise.
(In reply to comment #4) > http://hg.mozilla.org/build/buildbot/file/5a049fbe224b/master/buildbot/process/buildstep.py#l576 > > haltOnFailure = False > > flunkOnWarnings = False > > flunkOnFailure = False > > Are these default values correct? > Shouldn't haltOnFailure be True? I assume you're talking about RepositorySetupFactory's steps, not the default. In any case, we do *not* want haltOnFailure for the deletions, because they will "fail" if the repository doesn't exist at all. In that situation, we should be proceeding. I don't think there's much we can do to improve the situation until bug 626641 is fixed, because we can't accurately judge existence of a repository at this point.
I am sorry I went off on a tangent. It makes sense what you say. We don't want to stop because the deletion failed; the problem is that the job did not trigger the tag builder.
9 years ago
9 years ago
No longer blocks: 478420
* Add another wget against the users repo (requires another releaseConfig variable :( )
Attachment #528563 - Flags: review?(aki)
Staging tests have been passed.
Comment on attachment 528563 [details] [diff] [review] buildbotcustom Now that we have a check to make sure the user repo exists before trying to delete it, should we haltOnFailure=True ?
Attachment #528563 - Flags: review?(aki) → review+
Comment on attachment 528564 [details] [diff] [review] configs Thanks for fixing this, Rail!
Attachment #528564 - Flags: review?(aki) → review+
(In reply to comment #11) > Now that we have a check to make sure the user repo exists before trying to > delete it, should we haltOnFailure=True ? Yeah. Interdiff is just 1 line.
Attachment #528648 - Flags: review?(aki) → review+
Comment on attachment 528564 [details] [diff] [review] configs http://hg.mozilla.org/build/buildbot-configs/rev/7e96916bb262
Attachment #528564 - Flags: checked-in+
Comment on attachment 528648 [details] [diff] [review] buildbotcustom http://hg.mozilla.org/build/buildbotcustom/rev/304da956f42b
Attachment #528648 - Flags: checked-in+
All done here. Closing.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.