Closed Bug 761500 Opened 13 years ago Closed 13 years ago

21600 seconds without output is a very long time to wait on Talos runs

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: mozilla)

Details

Attachments

(1 file)

Something happened to the graphserver a bit after 15:00 today, when a whole bunch of Dromaeo runs and a scattering of other flavors were finishing up on merge-related pushes, on both aurora and central. Instead of properly dying from the graphserver unreachable failures, they just sat. But that isn't what I want to talk about, the amount of time they sat is: 21600 seconds. https://tbpl.mozilla.org/php/getParsedLog.php?id=12374222&tree=Firefox and at least 34 other slaves spent 6 hours sitting muttering to themselves, when our longest actual runs seem to be around 40 minutes, and some of the flavors that timed out after 6 hours actually only take 5 minutes. We might have needed a 6 hour timeout sometime in the dim dark past, but I don't think we do anymore.
How would you feel about an aggressive 3600 seconds? Or should we play it safer with 7200?
(In reply to Aki Sasaki [:aki] from comment #1) > How would you feel about an aggressive 3600 seconds? Or should we play it > safer with 7200? I think 3600 timeout with a 10800 seconds *max* time is probably the safest/best choice. Such that "1 hour no output" "3 hours total is too long" are our thoughts here. I can't think of anything that should run for an hour with no output, and *every* green [tegra] job I have seen takes <3 hours.
Attached patch as requestedSplinter Review
Attachment #630195 - Flags: review?(bugspam.Callek)
Comment on attachment 630195 [details] [diff] [review] as requested Looks Good. IMO a staging run of this would be useful but not required. Either way we need to watch the pass/fail ratio and jobs-affected carefully for about 24 hours after this deploys, to be extra-safe that we are not causing real failures in legit long-running cases.
Attachment #630195 - Flags: review?(bugspam.Callek) → review+
The change got merged to the production branch and reconfiguration happened at 6:15 AM PDT.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Assignee: nobody → aki
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: