releng-hardware/gecko-t-win10-64-ref-hw workers overloaded
Categories
(Testing :: Raptor, defect, P1)
Tracking
(firefox69 fixed)
Tracking | Status | |
---|---|---|
firefox69 | --- | fixed |
People
(Reporter: nataliaCs, Assigned: egao)
References
Details
Attachments
(1 file)
There are frequent failures on windows10-64-ref-hw-2017 opt, mostly talos and raptor which are shown as exceptions, but the jobs did not run - they have been queued for more than 1000 minutes.
e.g. Duration: Not started (queued for 1442 minute(s))
The issue has started from here: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&group_state=expanded&searchStr=windows10-64-ref-hw-2017%2Copt&revision=8bce401b8833e8a2b4d187c507c075079275d8ca
Example task details: https://tools.taskcluster.net/groups/Jwama3OAQ8yc4Yh3Y56enw/tasks/B46v7FpaQPSrya1XOzPDpw/details
No other details that we have access to, no run logs.
Comment 1•5 years ago
|
||
Looking at the run, it shows deadline-exceeded
as the the reason the job failed. If a tasks takes more than a day to run (from when it was scheduled), we don't try to run it. I assume that this happened because the worker pool was overloaded or not taking jobs during that time frame.
Comment 2•5 years ago
|
||
:bc any ideas here? Not sure this is a Raptor specific issue - maybe a taskcluster issue?
Comment 3•5 years ago
|
||
That push does include my test isolation stuff but that should only affect actions which wouldn't apply here I think.
I don't see any workers in https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-ref-hw
I do see them in https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw however with Pending tasks 4720
Perhaps Bug 1548614 ?
Assignee | ||
Comment 4•5 years ago
|
||
The changes were indeed made as part of bug 1548614. I can push out a patch to revert the changes.
To confirm:
- revert the naming of windows reference hardware to pre-bug 1548614
- revert the worker used to pre-bug 1548614
Please confirm additional work that needs to be done if any.
Comment 5•5 years ago
|
||
Do we have the set of machines up and running at bitbar? if so, then we need to ensure they are connected to taskcluster properly.
Assignee | ||
Comment 6•5 years ago
|
||
:jmaher - I checked with :markco, we are to move the machines and the workers back on to win10-64-ux for the time being while we get more win10-64-ref-2017-hw machines boostrapped at Bitbar. I will make a patch to temporarily revert the changes.
Assignee | ||
Comment 7•5 years ago
|
||
Pushed by egao@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0a2cb6c105cc temporarily revert changes made in bug 1548614, restore win10-64-ux workers r=jmaher
Comment 9•5 years ago
|
||
bugherder |
Updated•5 years ago
|
Comment 10•5 years ago
|
||
(In reply to Edwin Gao (:egao) from comment #4)
The changes were indeed made as part of bug 1548614. I can push out a patch to revert the changes.
To confirm:
- revert the naming of windows reference hardware to pre-bug 1548614
- revert the worker used to pre-bug 1548614
Please confirm additional work that needs to be done if any.
Clearing NI. Was confirmed on Slack.
Comment 11•5 years ago
|
||
== Change summary for alert #21605 (as of Wed, 26 Jun 2019 13:13:57 GMT) ==
Improvements:
40% build times windows2012-64-shippable opt nightly taskcluster-c4.4xlarge 6,299.97 -> 3,774.82
For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=21605
Description
•