Closed Bug 761500 Opened 11 years ago Closed 11 years ago

21600 seconds without output is a very long time to wait on Talos runs


(Release Engineering :: General, defect)

Not set


(Not tracked)



(Reporter: philor, Assigned: aki)



(1 file)

Something happened to the graphserver a bit after 15:00 today, when a whole bunch of Dromaeo runs and a scattering of other flavors were finishing up on merge-related pushes, on both aurora and central. Instead of properly dying from the graphserver unreachable failures, they just sat.

But that isn't what I want to talk about, the amount of time they sat is: 21600 seconds. and at least 34 other slaves spent 6 hours sitting muttering to themselves, when our longest actual runs seem to be around 40 minutes, and some of the flavors that timed out after 6 hours actually only take 5 minutes. We might have needed a 6 hour timeout sometime in the dim dark past, but I don't think we do anymore.
How would you feel about an aggressive 3600 seconds? Or should we play it safer with 7200?
(In reply to Aki Sasaki [:aki] from comment #1)
> How would you feel about an aggressive 3600 seconds? Or should we play it
> safer with 7200?

I think 3600 timeout with a 10800 seconds *max* time is probably the safest/best choice.

Such that "1 hour no output" "3 hours total is too long" are our thoughts here. I can't think of anything that should run for an hour with no output, and *every* green [tegra] job I have seen takes <3 hours.
Attached patch as requestedSplinter Review
Attachment #630195 - Flags: review?(bugspam.Callek)
Comment on attachment 630195 [details] [diff] [review]
as requested

Looks Good. IMO a staging run of this would be useful but not required. Either way we need to watch the pass/fail ratio and jobs-affected carefully for about 24 hours after this deploys, to be extra-safe that we are not causing real failures in legit long-running cases.
Attachment #630195 - Flags: review?(bugspam.Callek) → review+
The change got merged to the production branch and reconfiguration happened at 6:15 AM PDT.
Closed: 11 years ago
Resolution: --- → FIXED
Assignee: nobody → aki
Product: → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.