Android workers not taking jobs.
Categories
(Taskcluster :: General, defect, P1)
Tracking
(Not tracked)
People
(Reporter: bogdan_tara, Assigned: wcosta)
References
Details
At the moment there are many Android pending jobs after some machines were quarantined ( bug 1569856 ).
Reporter | ||
Comment 1•5 years ago
|
||
Reporter | ||
Updated•5 years ago
|
Comment 2•5 years ago
|
||
~1400 pending jobs on terraform-packet/gecko-t-linux, 11 machines quarantined. Unuarantined machines appear to have last done work 8 hours ago, claim-expired exception state on their last job. So haven't recovered after network work at provider ?
Comment 3•5 years ago
|
||
https://tools.taskcluster.net/provisioners/terraform-packet/worker-types/gecko-t-linux currently ~2000 pending jobs
Updated•5 years ago
|
Comment 4•5 years ago
|
||
Wander is sick today, so I'm doing what I can. :/
I'm starting by recreating a few instances to see whether they start taking jobs.
Comment 5•5 years ago
|
||
No joy, it's already complaining about the new machines:
Aug 01 09:32:26 machine-0 docker-worker: {"type":"[alert-operator] diskspace threshold reached","source":"top","provisionerId":"terraform-packet","workerId":"machine-0","workerGroup":"packet-sjc1","workerType":"gecko-t-linux","workerNodeType":"packet.net","volume":"/mnt","free":218328801280,"total":234007384064,"used":15678582784,"pctUsed":"6.7000","perTaskThreshold":100000000000,"availableWorkerCapacity":4,"totalthreshold":400000000000}
Based on the papertrail logs, that seems to be the theme, which points to bug 1569856.
Comment 6•5 years ago
|
||
(In reply to Chris Cooper [:coop] pronoun: he from comment #5)
Aug 01 09:32:26 machine-0 docker-worker: {"type":"[alert-operator] diskspace threshold reached","source":"top","provisionerId":"terraform-packet","workerId":"machine-0","workerGroup":"packet-sjc1","workerType":"gecko-t-linux","workerNodeType":"packet.net","volume":"/mnt","free":218328801280,"total":234007384064,"used":15678582784,"pctUsed":"6.7000","perTaskThreshold":100000000000,"availableWorkerCapacity":4,"totalthreshold":400000000000}
Turns out everything you need to understand this bug is in the log message.
We're setting the perTaskThreshold to 100000000000, and then using the availableWorkerCapacity to naively scale that up to the totalthreshold of 400000000000. Of course, this is bigger than the available disk, so we bail hard after that.
Wander is up now, so I'm handing this off to him.
Assignee | ||
Comment 7•5 years ago
|
||
The config is fixed and I can see jobs running.
Comment 8•5 years ago
|
||
FYI trees have been reopened.
Updated•5 years ago
|
Comment 9•5 years ago
•
|
||
The queue (terraform-packet/gecko-t-linux) is still very large as a result of this issue. I think jobs are expiring (the large dropoffs).
https://earthangel-b40313e5.influxcloud.net/d/wIJoZ4HWk/android-queues?orgId=1&refresh=5m&from=1564535897232&to=1564708697233&fullscreen&panelId=10 (click 'sign in with Oauth' for mozauth)
Should we spin up some more workers temporarily? I know they take 24 hours, but I'm not sure we'll catch up otherwise.
Comment 10•5 years ago
|
||
I've spun up 10 more instances (== 40 more workers) and the backlog seems to be going down.
Description
•