Closed Bug 1560926 Opened 5 years ago Closed 5 years ago

releng-hardware/gecko-t-win10-64-ref-hw workers overloaded

Categories

(Testing :: Raptor, defect, P1)

defect

Tracking

(firefox69 fixed)

RESOLVED FIXED
mozilla69
Tracking Status
firefox69 --- fixed

People

(Reporter: nataliaCs, Assigned: egao)

References

Details

Attachments

(1 file)

There are frequent failures on windows10-64-ref-hw-2017 opt, mostly talos and raptor which are shown as exceptions, but the jobs did not run - they have been queued for more than 1000 minutes.
e.g. Duration: Not started (queued for 1442 minute(s))

Pushes: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&group_state=expanded&searchStr=windows10-64-ref-hw-2017%2Copt&tochange=bfee60ff0a54cadfdedd541a8607a56fd1959df2&fromchange=2af46ed2e59b9aab02bda25eebd5c610ef373e02&selectedJob=253009528

The issue has started from here: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&group_state=expanded&searchStr=windows10-64-ref-hw-2017%2Copt&revision=8bce401b8833e8a2b4d187c507c075079275d8ca

Example task details: https://tools.taskcluster.net/groups/Jwama3OAQ8yc4Yh3Y56enw/tasks/B46v7FpaQPSrya1XOzPDpw/details

No other details that we have access to, no run logs.

Looking at the run, it shows deadline-exceeded as the the reason the job failed. If a tasks takes more than a day to run (from when it was scheduled), we don't try to run it. I assume that this happened because the worker pool was overloaded or not taking jobs during that time frame.

Component: General → Raptor
Product: Release Engineering → Testing
QA Contact: catlee
Summary: Tasks not started shown as exceptions → releng-hardware/gecko-t-win10-64-ref-hw workers overloaded

:bc any ideas here? Not sure this is a Raptor specific issue - maybe a taskcluster issue?

Flags: needinfo?(bob)
Priority: -- → P1

That push does include my test isolation stuff but that should only affect actions which wouldn't apply here I think.

I don't see any workers in https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-ref-hw

I do see them in https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw however with Pending tasks 4720

Perhaps Bug 1548614 ?

Flags: needinfo?(bob) → needinfo?(egao)

The changes were indeed made as part of bug 1548614. I can push out a patch to revert the changes.

To confirm:

  • revert the naming of windows reference hardware to pre-bug 1548614
  • revert the worker used to pre-bug 1548614

Please confirm additional work that needs to be done if any.

Flags: needinfo?(mcornmesser)
Flags: needinfo?(jmaher)
Flags: needinfo?(egao)

Do we have the set of machines up and running at bitbar? if so, then we need to ensure they are connected to taskcluster properly.

Flags: needinfo?(jmaher)

:jmaher - I checked with :markco, we are to move the machines and the workers back on to win10-64-ux for the time being while we get more win10-64-ref-2017-hw machines boostrapped at Bitbar. I will make a patch to temporarily revert the changes.

Pushed by egao@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0a2cb6c105cc
temporarily revert changes made in bug 1548614, restore win10-64-ux workers r=jmaher
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla69
Assignee: nobody → egao

(In reply to Edwin Gao (:egao) from comment #4)

The changes were indeed made as part of bug 1548614. I can push out a patch to revert the changes.

To confirm:

  • revert the naming of windows reference hardware to pre-bug 1548614
  • revert the worker used to pre-bug 1548614

Please confirm additional work that needs to be done if any.

Clearing NI. Was confirmed on Slack.

Flags: needinfo?(mcornmesser)

== Change summary for alert #21605 (as of Wed, 26 Jun 2019 13:13:57 GMT) ==

Improvements:

40% build times windows2012-64-shippable opt nightly taskcluster-c4.4xlarge 6,299.97 -> 3,774.82

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=21605

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: