Closed Bug 694809 Opened 13 years ago Closed 13 years ago

RetryingShellCommand getting old malloc leak logs shouldn't cause a RETRY when it fails

Categories

(Release Engineering :: General, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: Callek, Assigned: bhearsum)

References

Details

Attachments

(2 files, 1 obsolete file)

So heres the downside, doing this (Bug 687832) actually causes us to CONSTANTLY fail on new trees with regards to leak/malloc logs.

Before we would get to http://mxr.mozilla.org/build/source/buildbotcustom/process/factory.py#1180 and if we fail, "o well"... moving on.

Eventually even on a failed build we would upload it: http://mxr.mozilla.org/build/source/buildbotcustom/process/factory.py#1271

Now, instead of continuing on with Retrying here we are failing out completely and triggering a RETRY, instead of uploading our newly created logs.

I hit this with SeaMonkey when deploying the changes from PGO that move these logs to ${platform}-debug rather than ${platform}
Attached patch [buildbotcustom] v1 (obsolete) — Splinter Review
Whichever of you get to this.

This patch implements version one of my theory here.
* add fatal kwarg to RetryingShellCommand and use that to determine if we do a --no-automation-errmsg which will allow retry.py to not cause an emmediate halt and restart of the build itself.

Alternative, which is probably cleaner (though feels *more* hacky to me) is to piggyback on the already existing kwarg |haltOnFailure| and if False do --no-automation-errmsg

I'll accept whomever wishes to review this.

(In the meantime I backed out the buildbotcustom part of Bug 687832 on the seamonkey-production branch)
Attachment #567318 - Flags: review?(catlee)
Attachment #567318 - Flags: review?(bhearsum)
following in Nick's footsteps from Bug 692240c#5 I performed the following to green up SeaMonkey comm*

for b in comm-central-trunk comm-aurora comm-1.9.1 comm-2.0 comm-release; do
    for p in linux linux64 macosx win32; do
        mkdir -p $b-$p-debug
        cp -pv $b-$p/{malloc.log,sdleak.tree} $b-$p-debug/
    done
done
Comment on attachment 567318 [details] [diff] [review]
[buildbotcustom] v1

Thanks for catching this. The naming of 'fatal' is a bit weird to me; it suggests that it directly impacts the overall status of the build, when all it does is change one of the retry.py arguments. I'm with the overall approach, but how about being more explicit and calling it something like 'dontPrintRetryErrmsg'?

I'm open to other suggestions, too, 'fatal' is just really weird because it sounds like it overlaps with haltOnFailure and/or flunkOnFailure.
Should this behaviour be tied to haltOnFailure?  Or maybe call it retryOnFailure, with default True?
(In reply to Chris AtLee [:catlee] from comment #4)
> Should this behaviour be tied to haltOnFailure?  Or maybe call it
> retryOnFailure, with default True?

I was thinking about tying it to haltOnFailure, but doing so isn't going to solve our issue here. The problem is that EXCEPTION and RETRY always halt the build, regardless of the haltOnFailure flag.

I'm not a _huge_ fan of calling it retryOnFailure, because that's not really what it's doing. (The retrying happens because of the global log evaluation. So, if we remove _that_, retryOnFailure will break, possibly mysteriously.)
We just chatted about this some on IRC and I'm onboard with the current patch, as long as we're looking at haltOnFailure instead of 'fatal'. After talking and thinking through it some more it seems the least bad of the options.
Comment on attachment 567318 [details] [diff] [review]
[buildbotcustom] v1

Obsolete per comment #6
Attachment #567318 - Attachment is obsolete: true
Attachment #567318 - Flags: review?(catlee)
Attachment #567318 - Flags: review?(bhearsum)
Another place I've had trouble with the more aggressive-retrying is doing release updates in staging. Lots of things use retry.py, like grabbing shipped-locales or doing cvs checkouts, and if I've not set up everything correctly I'll get many many purple builds until I interrupt it or fix the issue. The previous behaviour of failing once would be preferable, so perhaps bug 687832 is too broad of a fix ?
(In reply to Nick Thomas [:nthomas] from comment #8)
> Another place I've had trouble with the more aggressive-retrying is doing
> release updates in staging. Lots of things use retry.py, like grabbing
> shipped-locales or doing cvs checkouts, and if I've not set up everything
> correctly I'll get many many purple builds until I interrupt it or fix the
> issue. The previous behaviour of failing once would be preferable, so
> perhaps bug 687832 is too broad of a fix ?

I'm inclined not to worry about the staging issues unless it's something we think can happen in production, too. The number of different types of behaviour we have to deal with is already numerous, without taking into account staging-specific issues, where human error is often a factor. Or are you thinking that there could be other edge cases that we haven't thought of?
(In reply to Ben Hearsum [:bhearsum] from comment #9)
> (In reply to Nick Thomas [:nthomas] from comment #8)
> > Another place I've had trouble with the more aggressive-retrying is doing
> > release updates in staging. Lots of things use retry.py, like grabbing
> > shipped-locales or doing cvs checkouts, and if I've not set up everything
> > correctly I'll get many many purple builds until I interrupt it or fix the
> > issue. The previous behaviour of failing once would be preferable, so
> > perhaps bug 687832 is too broad of a fix ?
> 
> I'm inclined not to worry about the staging issues unless it's something we
> think can happen in production, too. The number of different types of
> behaviour we have to deal with is already numerous, without taking into
> account staging-specific issues, where human error is often a factor. Or are
> you thinking that there could be other edge cases that we haven't thought of?

Reading this over again, I think the staging problem boils down to us Retrying when the issue might just be a code/user issue and not a real cause for a Retry, which would be staging repeating builds lots over, adding extra unnecessary load. We can/might want to implement a way to disable this entirely (in a different bug of course) to satisfy that staging concern. But I'll leave that up to you.
IMO, load isn't a concern with unnecessary retries in staging -- we're only talking about a few slaves there instead of tens or hundreds.
My concern boils down to RetryingShellCommand being used in a lot of places in factory.py. AIUI we have changed its behaviour from 'retry this command N times', to 'retry this command N times, and redo the whole job if that failed'. 

Have we been through all the use cases and decided that the new behaviour is appropriate ? If so I can pay more attention when using staging. For at least some usage the change makes the error state worse rather than better. eg this bug, where we get as many compiles as it takes for someone to notice the problem.
I definitely did not go through all of the cases before bug 687832 landed, and your point is well taken. What do you think of setting --no-automation-errmsg by default, and letting specific steps override that as an opt-in?

Though, I'm beginning to think that there's so many different failure conditions that we may need to shunt the decision down to the actual scripts, which would be able to easily detect 40x vs. 50x and other such things....
We chatted about this a bit on IRC today and come up with a few possible solutions.
* Move all of the leak test work to an external script, which can guarantee that uploads will always happen, and also use Automation Error to ask for EXCEPTION or RETRY status when desirable.
* Set alwaysRun=True on the log upload steps, but use doStepIf to only run them when the logs exist (so that we avoid purple/red when builds fail prior to downloading/generating the logs). I'm not 100% sure this is possible, because doStepIf is master-side code....
* Just backout bug 687832, and improve the RETRY situation later.

Regardless of what we do to fix the leak tests, I think we should back out the retry.py patch because of the reason mentioned in comment #13.

I'm going to look into the feasibility of the first two options.
Assignee: bugspam.Callek → bhearsum
Not sure if there's a good reason we don't download and upload at the same time off hand....I think it's just something we inherited from Tinderbox, though.
Currently testing these patches in staging, let's see how it goes!
(In reply to Ben Hearsum [:bhearsum] from comment #14)
> We chatted about this a bit on IRC today and come up with a few possible
> solutions.
> * Move all of the leak test work to an external script, which can guarantee
> that uploads will always happen, and also use Automation Error to ask for
> EXCEPTION or RETRY status when desirable.

This plan didn't work out. Moving all of the leak test stuff outside of Buildbot is far from trivial because of all the log parsing we have to do. Just moving the download/upload stuff out isn't even easy because one of the logs we upload (sdleak.tree) doesn't exist until after some post processing, which has to happen _after_ downloading some of the logs.

> * Set alwaysRun=True on the log upload steps, but use doStepIf to only run
> them when the logs exist (so that we avoid purple/red when builds fail prior
> to downloading/generating the logs). I'm not 100% sure this is possible,
> because doStepIf is master-side code....

My suspicion about this was correct, too.

> * Just backout bug 687832, and improve the RETRY situation later.

So I'm just going to do this, as we've hit multiple retry loops, and I don't have another immediate fix!
I guess this is INVALID now, since the original bug is backed out?
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → INVALID
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: