Closed
Bug 438324
Opened 16 years ago
Closed 16 years ago
unit test boxes should show an error in the tinderbox log if a step times out
Categories
(Release Engineering :: General, defect, P2)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ted, Assigned: coop)
References
()
Details
Attachments
(3 files)
12.72 KB,
patch
|
bhearsum
:
review+
rcampbell
:
review+
|
Details | Diff | Splinter Review |
8.82 KB,
patch
|
rcampbell
:
review+
|
Details | Diff | Splinter Review |
5.76 KB,
patch
|
rcampbell
:
review+
|
Details | Diff | Splinter Review |
In the log in the URL, a test in |make check| timed out, but there's no visible evidence in the log that that happened, so it's not clear at all why the machine is orange. We should get something in the log that we could pick up with the unittest error parser in tinderbox.
Updated•16 years ago
|
Component: Release Engineering → Release Engineering: Future
Priority: -- → P3
Assignee | ||
Updated•16 years ago
|
Assignee: nobody → ccooper
Priority: P3 → P2
Assignee | ||
Updated•16 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 1•16 years ago
|
||
Just want to make sure I'm on the right track here before I go to far (i.e. generating patches for all the various unittest masters), although I already have this working for unittest-staging. See:
http://tinderbox.mozilla.org/showlog.cgi?log=UnitTest/1215195141.1215195153.22519.gz
and:
http://qm-unittest02.mozilla.org:2005/Linux%20qm-stage-centos5-01%20dep%20unit%20test/builds/2253/step-check/2
In short:
- subclass ShellCommand and make all the steps inherit from it so they can propagate timeouts
- when we hit a timeout, add a new timeout log to buildbot that will show on the buildbot waterfall
- the new timeout log contains a string that will get parsed by the tinderbox server error parser *AND* a TinderboxPrint statement that will add a step timeout message to the tinderbox waterfall
Unrelated, but useful:
- shorten the timeouts on compile from 3 hours(!) to 15 minutes. These are timeouts without any output, not static blocks of time. I think if we don't have output after 15m, we're hosed anyway, why wait 3 hours?
- shorten timeout on "make check" from 40 minutes to 5 minutes. Same reasoning as above. Could likely be shortened even more.
Attachment #328158 -
Flags: review?(bhearsum)
Comment 2•16 years ago
|
||
Comment on attachment 328158 [details] [diff] [review]
[checked in] Subclass ShellCommand to always log timeouts in a way that makes them available to tinderbox
A couple of questions for consideration:
are these timeouts long enough? Some of them look a little short.
Do we want to report these values to the tinderbox waterfall or should they just be propagated to the logs. I think they're currently being removed as they're not in stdio.
Assignee | ||
Comment 3•16 years ago
|
||
(In reply to comment #2)
> are these timeouts long enough? Some of them look a little short.
Maybe I'm misunderstanding how the timeouts work, but after having delved into the timeout code, I'm pretty sure the timeouts are based on inactivity rather than total command duration.
e.g. a command that takes 1.5 hours to run (such as the Win32 clobber nightly) will not timeout even with a timeout of 15 minutes, provided that the command is still making progress, as evidenced by new output to stdout.
Regardless, this part of the patch is "bonus." Don't let it detract from the real changes in mozbuild.py.
> Do we want to report these values to the tinderbox waterfall or should they
> just be propagated to the logs. I think they're currently being removed as
> they're not in stdio.
I dunno, that's why I'm asking for comments.
If the tinderbox is orange because of a timeout, I think having that timeout information right in the waterfall is useful vs. forcing developers to drill down to determine the reason for the orange. I'm open to suggestions here though.
Comment 4•16 years ago
|
||
(In reply to comment #3)
> (In reply to comment #2)
> > are these timeouts long enough? Some of them look a little short.
>
> Maybe I'm misunderstanding how the timeouts work, but after having delved into
> the timeout code, I'm pretty sure the timeouts are based on inactivity rather
> than total command duration.
Oh, that could be. I always assumed it was for the full duration of a command. Makes sense though and is easy to test.
> > Do we want to report these values to the tinderbox waterfall or should they
> > just be propagated to the logs. I think they're currently being removed as
> > they're not in stdio.
>
> I dunno, that's why I'm asking for comments.
>
> If the tinderbox is orange because of a timeout, I think having that timeout
> information right in the waterfall is useful vs. forcing developers to drill
> down to determine the reason for the orange. I'm open to suggestions here
> though.
Also makes sense, though if we combined an error parser fix and some meaningful text in the logs it'd make it easy enough to find. We can try it on the tinderbox waterfall and see if people complain about it.
Comment 5•16 years ago
|
||
(In reply to comment #4)
> (In reply to comment #3)
> > (In reply to comment #2)
> > > are these timeouts long enough? Some of them look a little short.
> >
> > Maybe I'm misunderstanding how the timeouts work, but after having delved into
> > the timeout code, I'm pretty sure the timeouts are based on inactivity rather
> > than total command duration.
>
> Oh, that could be. I always assumed it was for the full duration of a command.
> Makes sense though and is easy to test.
>
It's definitely time-without-output.
Reporter | ||
Comment 6•16 years ago
|
||
FWIW, the PGO build might go for a fair amount of time without output when linking libxul. The total link time is ~30 mins on my computer, but I don't recall how much of that is without output, so 15 mins might be fine.
Comment 7•16 years ago
|
||
I filed bug 443854 yesterday about reducing the timeout, without any knowledge of the discussion in this bug (honest!).
Assignee | ||
Comment 8•16 years ago
|
||
(In reply to comment #6)
> FWIW, the PGO build might go for a fair amount of time without output when
> linking libxul. The total link time is ~30 mins on my computer, but I don't
> recall how much of that is without output, so 15 mins might be fine.
Yeah, PGO linking is going to be the worst-case scenario, but the timeout there is 3 hours(!) currently. Even if we drop that single step timeout down to 1 hour, that won't be as painful when we do timeout.
Comment 9•16 years ago
|
||
Comment on attachment 328158 [details] [diff] [review]
[checked in] Subclass ShellCommand to always log timeouts in a way that makes them available to tinderbox
>Index: testing/unittest/master.cfg
>===================================================================
>@@ -737,11 +737,11 @@ win2k3PGOFactory.addStep(step.Compile, n
> win2k3PGOFactory.addStep(step.Compile,
> command=["make", "-f", "client.mk", "profiledbuild"],
> workdir='mozilla',
>- timeout=60*60*3, # 60 Seconds * 60 Minutes * 3 Hours
>+ timeout=60*15,
I really think 15 minutes is too short here. We use 3600 (1h) for m-c PGO builds, IIRC we had timeouts even if the timeout was set to 1800 (30min). Please bump this to at least 45 minutes.
I haven't watched output from the various tests in a long time, so I'll have defer to Rob for that.
>Index: testing/unittest/mozbuild.py
>===================================================================
Everything in this file seems fine.
r=bhearsum provided Rob signs off on the test timeouts.
Attachment #328158 -
Flags: review?(rcampbell)
Attachment #328158 -
Flags: review?(bhearsum)
Attachment #328158 -
Flags: review+
Comment 10•16 years ago
|
||
One more comment: If possible, these changes should land in buildbotcustom too. The less disparity we have, the better.
Comment 11•16 years ago
|
||
Comment on attachment 328158 [details] [diff] [review]
[checked in] Subclass ShellCommand to always log timeouts in a way that makes them available to tinderbox
looks good
Attachment #328158 -
Flags: review?(rcampbell) → review+
Assignee | ||
Comment 12•16 years ago
|
||
Comment on attachment 328158 [details] [diff] [review]
[checked in] Subclass ShellCommand to always log timeouts in a way that makes them available to tinderbox
I set the compile timeout on all Windows builds to 1 hour just to be safe.
I also fixed up a naming issue with TinderboxShellCommand.
Attachment #328158 -
Attachment description: Subclass ShellCommand to always log timeouts in a way that makes them available to tinderbox → [checked in] Subclass ShellCommand to always log timeouts in a way that makes them available to tinderbox
Assignee | ||
Comment 13•16 years ago
|
||
Attachment #328734 -
Flags: review?(rcampbell)
Assignee | ||
Comment 14•16 years ago
|
||
Attachment #328736 -
Flags: review?(rcampbell)
Comment 15•16 years ago
|
||
Love to get this in - we still have Oranges on the physical Mac box for Firefox tree that have no error info in them :(
Updated•16 years ago
|
Attachment #328734 -
Flags: review?(rcampbell) → review+
Updated•16 years ago
|
Attachment #328736 -
Flags: review?(rcampbell) → review+
Assignee | ||
Updated•16 years ago
|
Attachment #328736 -
Attachment description: Use ShellCommandReportTimeout for moz2 unittesting → [checked in] Use ShellCommandReportTimeout for moz2 unittesting
Assignee | ||
Comment 16•16 years ago
|
||
Comment on attachment 328734 [details] [diff] [review]
[checked in] Subclass ShellCommand in buildbotcustom
Planning to push all these patches to production and reconfig the buildbot masters during the maintenance window tonight which starts at 7pm PDT.
The same/similar patches are already running on the unittest-staging and leaktest masters, so I'm not very worried about this change.
Attachment #328734 -
Attachment description: Subclass ShellCommand in buildbotcustom → [checked in] Subclass ShellCommand in buildbotcustom
Assignee | ||
Comment 17•16 years ago
|
||
Firefox 3.0 buildbot master has been restarted with the patch.
I'm waiting to restart the Moz2 master until after the boxes involved in bug 444674 go green.
Assignee | ||
Comment 18•16 years ago
|
||
Moz2 master restarted with the patch.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Comment 19•16 years ago
|
||
Some of the changes here got reverted. At least
* the 1.9 master on production-master.b.m.o has the timeout's back at 40 minutes
* the m-c master on qm-moz2-unittest01.m.o isn't using ShellCommandReportTimeout (diff is against cvs, but needs to be unrotted and land in hg). The timeouts should probably be shortened too, see also bug 443854.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 20•16 years ago
|
||
(In reply to comment #19)
> Some of the changes here got reverted. At least
> * the 1.9 master on production-master.b.m.o has the timeout's back at 40
> minutes
These changes exist in CVS tip, but the master isn't running with the most recent version of master.cfg.
> * the m-c master on qm-moz2-unittest01.m.o isn't using
> ShellCommandReportTimeout (diff is against cvs, but needs to be unrotted and
> land in hg). The timeouts should probably be shortened too, see also bug
> 443854.
See attachment 341672 [details] [diff] [review] in bug 443854. We'll handle followup for both issues there.
Status: REOPENED → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 21•15 years ago
|
||
Moving closed Future bugs into Release Engineering in preparation for removing the Future component.
Component: Release Engineering: Future → Release Engineering
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•