build-macosx64-noopt/debug intermittently exceeds its max-run-time
Categories
(Taskcluster :: General, task)
Tracking
(Not tracked)
People
(Reporter: gbrown, Unassigned)
References
(Blocks 1 open bug)
Details
build-macosx64-noopt/debug has a max-run-time of 3600 (1 hour). In bug 1411358, there are task timeouts for a variety of tasks, but this task is one of the most frequent and most persistent offenders.
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=285807758&repo=autoland&lineNumber=20961
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=285646078&repo=autoland&lineNumber=50079
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=285092970&repo=autoland&lineNumber=49568
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=285258774&repo=autoland&lineNumber=49763
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=285298190&repo=autoland&lineNumber=48857
Comment 1•5 years ago
|
||
Only looked at the first two, but they spent most of the hour they're allocated to ... cloning from mercurial and downloading fetches. The build part itself is less than 20 minutes. This is not the first time I've seen things like this. For some reason networking and/or I/O sucks badly on some workers.
Reporter | ||
Comment 2•5 years ago
|
||
Agreed, cloning from mercurial and downloading fetches is unusually slow in these cases. And we have discussed this before, like https://bugzilla.mozilla.org/show_bug.cgi?id=1577598#c4. I have had some success in avoiding intermittent failed tasks by simply allowing +30 minutes for the max-run-time, but that approach has been resisted and called out as a hack. I would like to see some sort of solution as this type of intermittent failure is seen at least several times each week; in addition to the wasted machine time, these failures can "hide" more serious task timeouts, like test and product hangs.
Comment 3•5 years ago
|
||
So this could be anything from a cloud instance with bad I/O or a bad network connection to an issue with falling back to some slower option in either the hg client or the hg server.
If it's the former, I'm not sure there's much to do but notice and terminate the instance. I don't think we see enough of these to be able to characterize a "sick" instance accurately and quickly. And something that takes, say, 10 minutes at worker startup to decide whether the instance was sick would end up being phenomenally expensive and add a great deal of E2E time.
If it's the latter, then one thing you could do is wrap the hg command to time out after, say, ten minutes, and retry a few times within the task before failing.
Comment 4•4 years ago
|
||
I don't see this happening any more.
Description
•