Closed Bug 761500 Opened 13 years ago Closed 13 years ago

21600 seconds without output is a very long time to wait on Talos runs

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: philor, Assigned: mozilla)

Details

Attachments

(1 file)

as requested 13 years ago Aki Sasaki (not active) 749 bytes, patch	Callek : review+ mozilla : checked-in+	Details \| Diff \| Splinter Review

Phil Ringnalda (:philor)

Reporter

Description

•

13 years ago

Something happened to the graphserver a bit after 15:00 today, when a whole bunch of Dromaeo runs and a scattering of other flavors were finishing up on merge-related pushes, on both aurora and central. Instead of properly dying from the graphserver unreachable failures, they just sat. But that isn't what I want to talk about, the amount of time they sat is: 21600 seconds. https://tbpl.mozilla.org/php/getParsedLog.php?id=12374222&tree=Firefox and at least 34 other slaves spent 6 hours sitting muttering to themselves, when our longest actual runs seem to be around 40 minutes, and some of the flavors that timed out after 6 hours actually only take 5 minutes. We might have needed a 6 hour timeout sometime in the dim dark past, but I don't think we do anymore.

Aki Sasaki (not active)

Assignee

Comment 1

•

13 years ago

How would you feel about an aggressive 3600 seconds? Or should we play it safer with 7200?

Justin Wood (:Callek)

Comment 2

•

13 years ago

(In reply to Aki Sasaki [:aki] from comment #1) > How would you feel about an aggressive 3600 seconds? Or should we play it > safer with 7200? I think 3600 timeout with a 10800 seconds *max* time is probably the safest/best choice. Such that "1 hour no output" "3 hours total is too long" are our thoughts here. I can't think of anything that should run for an hour with no output, and *every* green [tegra] job I have seen takes <3 hours.

Aki Sasaki (not active)

Assignee

Comment 3

•

13 years ago

Attached patch as requested — Details — Splinter Review

Attachment #630195 - Flags: review?(bugspam.Callek)

Justin Wood (:Callek)

Comment 4

•

13 years ago

Comment on attachment 630195 [details] [diff] [review] as requested Looks Good. IMO a staging run of this would be useful but not required. Either way we need to watch the pass/fail ratio and jobs-affected carefully for about 24 hours after this deploys, to be extra-safe that we are not causing real failures in legit long-running cases.

Attachment #630195 - Flags: review?(bugspam.Callek) → review+

Aki Sasaki (not active)

Assignee

Comment 5

•

13 years ago

Comment on attachment 630195 [details] [diff] [review] as requested http://hg.mozilla.org/build/buildbotcustom/rev/035be908a233

Attachment #630195 - Flags: checked-in+

Armen [:armenzg]

Comment 6

•

13 years ago

The change got merged to the production branch and reconfiguration happened at 6:15 AM PDT.

Aki Sasaki (not active)

Assignee

Updated

•

13 years ago

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Justin Wood (:Callek)

Updated

•

13 years ago

Assignee: nobody → aki

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: General Automation → General

You need to log in before you can comment on or make changes to this bug.

Bugzilla

21600 seconds without output is a very long time to wait on Talos runs

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: philor, Assigned: mozilla)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Updated

Updated

Updated

Attachment

General

Description

File Name

Content Type