Closed Bug 1473589 Opened 7 years ago Closed 6 years ago

Investigate releng-hardware worker failures

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: dhouse, Unassigned)

References

Details

CIDuty/RelOps/DCOps has restarted and reimaged an increased number of releng-hardware workers in the past few months. Please begin investigating to see if there is a systemic or job problem causing this across all hardware types: Windows, Mac, and Linux. Workers have been found missing in taskcluster worker explorer. This happens when a machine's declaration has expired because of not taking jobs, or a machine has never started taking jobs (not declared itself to taskcluster). There are a few known causes for workers going missing: loaners, moving machines to a beta/staging worker type (queue). But that accounts for only a handful of machines, and we have seen hundreds needing reimaged or restarted to resume taking work.
Main machines that fail are: Windows and Mac. Linux is pretty stable, maybe we have to re-image 1-3/day. Windows we may have 20 to 30 per day. MacOSX we may have 10 to 20 per day. So if we start to investigate, the priority could be Windows -> MacOSX -> Linux
(In reply to Danut Labici [:dlabici] from comment #1) > Main machines that fail are: Windows and Mac. > Linux is pretty stable, maybe we have to re-image 1-3/day. > Windows we may have 20 to 30 per day. > MacOSX we may have 10 to 20 per day. > > So if we start to investigate, the priority could be Windows -> MacOSX -> > Linux I didn't know the number was so high for Windows also. Are there tracker bugs for the individual machines or where is there a record of which were reimaged? Are these all moonshot Windows instances?
Blocks: 1452133
Since we are doing the tracking by machines, issues and nodes, there is no point to have this bug around. If someone considers this bug useful please reopen it.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.