21600 seconds without output is a very long time to wait on Talos runs

RESOLVED FIXED

Status

Release Engineering
General Automation
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: philor, Assigned: aki)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

6 years ago
Something happened to the graphserver a bit after 15:00 today, when a whole bunch of Dromaeo runs and a scattering of other flavors were finishing up on merge-related pushes, on both aurora and central. Instead of properly dying from the graphserver unreachable failures, they just sat.

But that isn't what I want to talk about, the amount of time they sat is: 21600 seconds. https://tbpl.mozilla.org/php/getParsedLog.php?id=12374222&tree=Firefox and at least 34 other slaves spent 6 hours sitting muttering to themselves, when our longest actual runs seem to be around 40 minutes, and some of the flavors that timed out after 6 hours actually only take 5 minutes. We might have needed a 6 hour timeout sometime in the dim dark past, but I don't think we do anymore.
(Assignee)

Comment 1

6 years ago
How would you feel about an aggressive 3600 seconds? Or should we play it safer with 7200?
(In reply to Aki Sasaki [:aki] from comment #1)
> How would you feel about an aggressive 3600 seconds? Or should we play it
> safer with 7200?

I think 3600 timeout with a 10800 seconds *max* time is probably the safest/best choice.

Such that "1 hour no output" "3 hours total is too long" are our thoughts here. I can't think of anything that should run for an hour with no output, and *every* green [tegra] job I have seen takes <3 hours.
(Assignee)

Comment 3

6 years ago
Created attachment 630195 [details] [diff] [review]
as requested
Attachment #630195 - Flags: review?(bugspam.Callek)
Comment on attachment 630195 [details] [diff] [review]
as requested

Looks Good. IMO a staging run of this would be useful but not required. Either way we need to watch the pass/fail ratio and jobs-affected carefully for about 24 hours after this deploys, to be extra-safe that we are not causing real failures in legit long-running cases.
Attachment #630195 - Flags: review?(bugspam.Callek) → review+

Comment 6

6 years ago
The change got merged to the production branch and reconfiguration happened at 6:15 AM PDT.
(Assignee)

Updated

6 years ago
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED

Updated

6 years ago
Assignee: nobody → aki
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.