Closed Bug 438324 Opened 16 years ago Closed 16 years ago

unit test boxes should show an error in the tinderbox log if a step times out

Categories

(Release Engineering :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ted, Assigned: coop)

References

()

Details

Attachments

(3 files)

In the log in the URL, a test in |make check| timed out, but there's no visible evidence in the log that that happened, so it's not clear at all why the machine is orange. We should get something in the log that we could pick up with the unittest error parser in tinderbox.
Component: Release Engineering → Release Engineering: Future
Priority: -- → P3
Blocks: 443323
Assignee: nobody → ccooper
Priority: P3 → P2
Status: NEW → ASSIGNED
Just want to make sure I'm on the right track here before I go to far (i.e. generating patches for all the various unittest masters), although I already have this working for unittest-staging. See:

http://tinderbox.mozilla.org/showlog.cgi?log=UnitTest/1215195141.1215195153.22519.gz

and:

http://qm-unittest02.mozilla.org:2005/Linux%20qm-stage-centos5-01%20dep%20unit%20test/builds/2253/step-check/2

In short:
- subclass ShellCommand and make all the steps inherit from it so they can propagate timeouts
- when we hit a timeout, add a new timeout log to buildbot that will show on the buildbot waterfall
- the new timeout log contains a string that will get parsed by the tinderbox server error parser *AND* a TinderboxPrint statement that will add a step timeout message to the tinderbox waterfall

Unrelated, but useful:
- shorten the timeouts on compile from 3 hours(!) to 15 minutes. These are timeouts without any output, not static blocks of time. I think if we don't have output after 15m, we're hosed anyway, why wait 3 hours?
- shorten timeout on "make check" from 40 minutes to 5 minutes. Same reasoning as above. Could likely be shortened even more.
Attachment #328158 - Flags: review?(bhearsum)
Comment on attachment 328158 [details] [diff] [review]
[checked in] Subclass ShellCommand to always log timeouts in a way that makes them available to tinderbox

A couple of questions for consideration:

are these timeouts long enough? Some of them look a little short.

Do we want to report these values to the tinderbox waterfall or should they just be propagated to the logs. I think they're currently being removed as they're not in stdio.
(In reply to comment #2)
> are these timeouts long enough? Some of them look a little short.

Maybe I'm misunderstanding how the timeouts work, but after having delved into the timeout code, I'm pretty sure the timeouts are based on inactivity rather than total command duration.

e.g. a command that takes 1.5 hours to run (such as the Win32 clobber nightly) will not timeout even with a timeout of 15 minutes, provided that the command is still making progress, as evidenced by new output to stdout.

Regardless, this part of the patch is "bonus." Don't let it detract from the real changes in mozbuild.py.

> Do we want to report these values to the tinderbox waterfall or should they
> just be propagated to the logs. I think they're currently being removed as
> they're not in stdio.

I dunno, that's why I'm asking for comments. 

If the tinderbox is orange because of a timeout, I think having that timeout information right in the waterfall is useful vs. forcing developers to drill down to determine the reason for the orange. I'm open to suggestions here though.
(In reply to comment #3)
> (In reply to comment #2)
> > are these timeouts long enough? Some of them look a little short.
> 
> Maybe I'm misunderstanding how the timeouts work, but after having delved into
> the timeout code, I'm pretty sure the timeouts are based on inactivity rather
> than total command duration.

Oh, that could be. I always assumed it was for the full duration of a command. Makes sense though and is easy to test.

> > Do we want to report these values to the tinderbox waterfall or should they
> > just be propagated to the logs. I think they're currently being removed as
> > they're not in stdio.
> 
> I dunno, that's why I'm asking for comments. 
> 
> If the tinderbox is orange because of a timeout, I think having that timeout
> information right in the waterfall is useful vs. forcing developers to drill
> down to determine the reason for the orange. I'm open to suggestions here
> though.

Also makes sense, though if we combined an error parser fix and some meaningful text in the logs it'd make it easy enough to find. We can try it on the tinderbox waterfall and see if people complain about it.
(In reply to comment #4)
> (In reply to comment #3)
> > (In reply to comment #2)
> > > are these timeouts long enough? Some of them look a little short.
> > 
> > Maybe I'm misunderstanding how the timeouts work, but after having delved into
> > the timeout code, I'm pretty sure the timeouts are based on inactivity rather
> > than total command duration.
> 
> Oh, that could be. I always assumed it was for the full duration of a command.
> Makes sense though and is easy to test.
> 

It's definitely time-without-output.
FWIW, the PGO build might go for a fair amount of time without output when linking libxul. The total link time is ~30 mins on my computer, but I don't recall how much of that is without output, so 15 mins might be fine.
I filed bug 443854 yesterday about reducing the timeout, without any knowledge of the discussion in this bug (honest!).
(In reply to comment #6)
> FWIW, the PGO build might go for a fair amount of time without output when
> linking libxul. The total link time is ~30 mins on my computer, but I don't
> recall how much of that is without output, so 15 mins might be fine.

Yeah, PGO linking is going to be the worst-case scenario, but the timeout there is 3 hours(!) currently. Even if we drop that single step timeout down to 1 hour, that won't be as painful when we do timeout.

Comment on attachment 328158 [details] [diff] [review]
[checked in] Subclass ShellCommand to always log timeouts in a way that makes them available to tinderbox

>Index: testing/unittest/master.cfg
>===================================================================
>@@ -737,11 +737,11 @@ win2k3PGOFactory.addStep(step.Compile, n
> win2k3PGOFactory.addStep(step.Compile, 
>         command=["make", "-f", "client.mk", "profiledbuild"], 
>         workdir='mozilla',
>-        timeout=60*60*3, # 60 Seconds * 60 Minutes * 3 Hours
>+        timeout=60*15,

I really think 15 minutes is too short here. We use 3600 (1h) for m-c PGO builds, IIRC we had timeouts even if the timeout was set to 1800 (30min). Please bump this to at least 45 minutes.



I haven't watched output from the various tests in a long time, so I'll have defer to Rob for that.


>Index: testing/unittest/mozbuild.py
>===================================================================

Everything in this file seems fine.

r=bhearsum provided Rob signs off on the test timeouts.
Attachment #328158 - Flags: review?(rcampbell)
Attachment #328158 - Flags: review?(bhearsum)
Attachment #328158 - Flags: review+
One more comment: If possible, these changes should land in buildbotcustom too. The less disparity we have, the better.
Comment on attachment 328158 [details] [diff] [review]
[checked in] Subclass ShellCommand to always log timeouts in a way that makes them available to tinderbox

looks good
Attachment #328158 - Flags: review?(rcampbell) → review+
Comment on attachment 328158 [details] [diff] [review]
[checked in] Subclass ShellCommand to always log timeouts in a way that makes them available to tinderbox

I set the compile timeout on all Windows builds to 1 hour just to be safe.

I also fixed up a naming issue with TinderboxShellCommand.
Attachment #328158 - Attachment description: Subclass ShellCommand to always log timeouts in a way that makes them available to tinderbox → [checked in] Subclass ShellCommand to always log timeouts in a way that makes them available to tinderbox
Love to get this in - we still have Oranges on the physical Mac box for Firefox tree that have no error info in them :(
Attachment #328734 - Flags: review?(rcampbell) → review+
Attachment #328736 - Flags: review?(rcampbell) → review+
Attachment #328736 - Attachment description: Use ShellCommandReportTimeout for moz2 unittesting → [checked in] Use ShellCommandReportTimeout for moz2 unittesting
Comment on attachment 328734 [details] [diff] [review]
[checked in] Subclass ShellCommand in buildbotcustom

Planning to push all these patches to production and reconfig the buildbot masters during the maintenance window tonight which starts at 7pm PDT.

The same/similar patches are already running on the unittest-staging and leaktest masters, so I'm not very worried about this change.
Attachment #328734 - Attachment description: Subclass ShellCommand in buildbotcustom → [checked in] Subclass ShellCommand in buildbotcustom
Firefox 3.0 buildbot master has been restarted with the patch.

I'm waiting to restart the Moz2 master until after the boxes involved in bug 444674 go green.
Moz2 master restarted with the patch.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Depends on: 446198
Some of the changes here got reverted. At least
* the 1.9 master on production-master.b.m.o has the timeout's back at 40 minutes
* the m-c master on qm-moz2-unittest01.m.o isn't using ShellCommandReportTimeout (diff is against cvs, but needs to be unrotted and land in hg). The timeouts should probably be shortened too, see also bug 443854.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #19)
> Some of the changes here got reverted. At least
> * the 1.9 master on production-master.b.m.o has the timeout's back at 40
> minutes

These changes exist in CVS tip, but the master isn't running with the most recent version of master.cfg.

> * the m-c master on qm-moz2-unittest01.m.o isn't using
> ShellCommandReportTimeout (diff is against cvs, but needs to be unrotted and
> land in hg). The timeouts should probably be shortened too, see also bug
> 443854.

See attachment 341672 [details] [diff] [review] in bug 443854. We'll handle followup for both issues there.
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
Moving closed Future bugs into Release Engineering in preparation for removing the Future component.
Component: Release Engineering: Future → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: