Closed
Bug 761500
Opened 13 years ago
Closed 13 years ago
21600 seconds without output is a very long time to wait on Talos runs
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Assigned: mozilla)
Details
Attachments
(1 file)
749 bytes,
patch
|
Callek
:
review+
mozilla
:
checked-in+
|
Details | Diff | Splinter Review |
Something happened to the graphserver a bit after 15:00 today, when a whole bunch of Dromaeo runs and a scattering of other flavors were finishing up on merge-related pushes, on both aurora and central. Instead of properly dying from the graphserver unreachable failures, they just sat.
But that isn't what I want to talk about, the amount of time they sat is: 21600 seconds. https://tbpl.mozilla.org/php/getParsedLog.php?id=12374222&tree=Firefox and at least 34 other slaves spent 6 hours sitting muttering to themselves, when our longest actual runs seem to be around 40 minutes, and some of the flavors that timed out after 6 hours actually only take 5 minutes. We might have needed a 6 hour timeout sometime in the dim dark past, but I don't think we do anymore.
Assignee | ||
Comment 1•13 years ago
|
||
How would you feel about an aggressive 3600 seconds? Or should we play it safer with 7200?
Comment 2•13 years ago
|
||
(In reply to Aki Sasaki [:aki] from comment #1)
> How would you feel about an aggressive 3600 seconds? Or should we play it
> safer with 7200?
I think 3600 timeout with a 10800 seconds *max* time is probably the safest/best choice.
Such that "1 hour no output" "3 hours total is too long" are our thoughts here. I can't think of anything that should run for an hour with no output, and *every* green [tegra] job I have seen takes <3 hours.
Assignee | ||
Comment 3•13 years ago
|
||
Attachment #630195 -
Flags: review?(bugspam.Callek)
Comment 4•13 years ago
|
||
Comment on attachment 630195 [details] [diff] [review]
as requested
Looks Good. IMO a staging run of this would be useful but not required. Either way we need to watch the pass/fail ratio and jobs-affected carefully for about 24 hours after this deploys, to be extra-safe that we are not causing real failures in legit long-running cases.
Attachment #630195 -
Flags: review?(bugspam.Callek) → review+
Assignee | ||
Comment 5•13 years ago
|
||
Comment on attachment 630195 [details] [diff] [review]
as requested
http://hg.mozilla.org/build/buildbotcustom/rev/035be908a233
Attachment #630195 -
Flags: checked-in+
Comment 6•13 years ago
|
||
The change got merged to the production branch and reconfiguration happened at 6:15 AM PDT.
Assignee | ||
Updated•13 years ago
|
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•13 years ago
|
Assignee: nobody → aki
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•