repo_setup fails out on non-existent staging repos

RESOLVED FIXED

Status

Release Engineering
General
P5
enhancement
RESOLVED FIXED
8 years ago
5 years ago

People

(Reporter: aki, Assigned: rail)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [automation][releases])

Attachments

(2 attachments, 1 obsolete attachment)

(Reporter)

Description

8 years ago
The delete_repo step in the repo_setup factory fails (but doesn't halt) if the repo doesn't exist.

This means a first run staging release can often fail.

It would make it less annoying to run staging releases if it was smart about this (checked the url, maybe? http://hg.mozilla.org/users/stage-ffxbld/foobarbaz gives both "Not found: foobarbaz" and "The specified repository "foobarbaz" is unknown, sorry." in its response page).
(Reporter)

Comment 1

8 years ago
More concerning: a bad configuration didn't cause a failure?

bash -c ssh -l stage-ffxbld -i ~cltbld/.ssh/ffxbld_dsa hg.mozilla.org clone mobile-browser releases/mobile-browser

I think what happened is it didn't delete the previous clone of mobile-browser due to the command line

bash -c wget -O- http://hg.mozilla.org/releases/mobile-browser >/dev/null && ssh -l stage-ffxbld -i ~cltbld/.ssh/ffxbld_dsa hg.mozilla.org edit mobile-browser delete YES

The first wget failed since there is no releases/mobile-browser, so it didn't proceed with the delete.  Then I think the clone failed silently.

To fix comment 0, the command line could add a 2nd wget -O- :

bash -c wget -O- http://hg.mozilla.org/mobile-browser >/dev/null && wget -O- http://hg.mozilla.org/users/stage-ffxbld/mobile-browser && ssh -l stage-ffxbld -i ~cltbld/.ssh/ffxbld_dsa hg.mozilla.org edit mobile-browser delete YES

To fix this comment, a) I can be less stupid going forward, but b) gotta think about it some more.
i think i hit this during a staging run for firefox 3.6.9 build 1 for a bunch of locales.


bash -c wget -O- http://hg.mozilla.org/releases/l10n-mozilla-1.9.2/id >/dev/null && ssh -l stage-ffxbld -i ~cltbld/.ssh/ffxbld_dsa hg.mozilla.org edit id delete YES
<snip env>
--22:52:43--  http://hg.mozilla.org/releases/l10n-mozilla-1.9.2/id
Resolving hg.mozilla.org... 10.2.74.67
Connecting to hg.mozilla.org|10.2.74.67|:80... connected.
HTTP request sent, awaiting response... 200 Script output follows
Length: 26397 (26K) [text/html]
Saving to: `STDOUT'


 0% [                                        ] 0           --.-K/s             
100%[=======================================>] 26,397      --.-K/s   in 0.02s  

22:52:43 (1.65 MB/s) - `-' saved [26397/26397]

Could not find the repository at /users/stage-ffxbld/id.
Please check the list at https://hg.mozilla.org/users/stage-ffxbld
(Reporter)

Updated

8 years ago
Whiteboard: [automation][releases]

Updated

8 years ago
Duplicate of this bug: 613683
Depends on: 626641
I hit this on my staging release.

One of the delete_repo steps failed but the job did not go red.
I assumed that a green repo_setup job would trigger the tagging builder.

http://hg.mozilla.org/build/buildbotcustom/file/tip/process/factory.py#l3378
The code says that haltOnFailure and flunkOnFailure do not have any value different than the default values which seems to be False.

http://hg.mozilla.org/build/buildbot/file/5a049fbe224b/master/buildbot/process/buildstep.py#l576
> haltOnFailure = False
> flunkOnWarnings = False
> flunkOnFailure = False

Are these default values correct?
Shouldn't haltOnFailure be True?

It seems that the value is set to False since the import of 0.8.1:
http://hg.mozilla.org/build/buildbot/rev/42babfd9ed35#l301.579

Doesn't this mean that a step (without haltOnFailure changed to True) it would NOT change the state of the job and NOT abort the job?
wow it seems that it has been False by default even in 0.7.12
https://github.com/buildbot/buildbot/blob/buildbot-0.7.12/buildbot/process/buildstep.py#L575

I thought all my Mozilla life that if a step fails it turns the job red and aborts (by default) .

I just noticed that we set haltOnFailure to True in sooooo many places.
Shame on me!

It seems that we have to add haltOnFailure after all.
Sorry for the noise.
(In reply to comment #4)
> http://hg.mozilla.org/build/buildbot/file/5a049fbe224b/master/buildbot/process/buildstep.py#l576
> > haltOnFailure = False
> > flunkOnWarnings = False
> > flunkOnFailure = False
> 
> Are these default values correct?
> Shouldn't haltOnFailure be True?

I assume you're talking about RepositorySetupFactory's steps, not the default.

In any case, we do *not* want haltOnFailure for the deletions, because they
will "fail" if the repository doesn't exist at all. In that situation, we
should be proceeding.

I don't think there's much we can do to improve the situation until bug 626641
is fixed, because we can't accurately judge existence of a repository at this
point.
I am sorry I went off on a tangent. It makes sense what you say.
We don't want to stop because the deletion failed; the problem is that the job did not trigger the tag builder.
(Assignee)

Updated

7 years ago
Blocks: 627307

Updated

7 years ago
Assignee: nobody → catlee

Updated

7 years ago
Assignee: catlee → rail
(Assignee)

Comment 8

7 years ago
Created attachment 528563 [details] [diff] [review]
buildbotcustom

* Add another wget against the users repo (requires another releaseConfig variable :( )
Attachment #528563 - Flags: review?(aki)
(Assignee)

Comment 9

7 years ago
Created attachment 528564 [details] [diff] [review]
configs
Attachment #528564 - Flags: review?(aki)
(Assignee)

Comment 10

7 years ago
Staging tests have been passed.
(Reporter)

Comment 11

7 years ago
Comment on attachment 528563 [details] [diff] [review]
buildbotcustom

Now that we have a check to make sure the user repo exists before trying to delete it, should we haltOnFailure=True ?
Attachment #528563 - Flags: review?(aki) → review+
(Reporter)

Comment 12

7 years ago
Comment on attachment 528564 [details] [diff] [review]
configs

Thanks for fixing this, Rail!
Attachment #528564 - Flags: review?(aki) → review+
(Assignee)

Comment 13

7 years ago
Created attachment 528648 [details] [diff] [review]
buildbotcustom

(In reply to comment #11) 
> Now that we have a check to make sure the user repo exists before trying to
> delete it, should we haltOnFailure=True ?

Yeah. Interdiff is just 1 line.
Attachment #528563 - Attachment is obsolete: true
Attachment #528648 - Flags: review?(aki)
(Reporter)

Updated

7 years ago
Attachment #528648 - Flags: review?(aki) → review+
(Assignee)

Comment 16

7 years ago
All done here. Closing.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.