Closed Bug 1570570 Opened 5 years ago Closed 5 years ago

Android workers not taking jobs.

Categories

(Taskcluster :: General, defect, P1)

Unspecified
Android
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bogdan_tara, Assigned: wcosta)

References

Details

At the moment there are many Android pending jobs after some machines were quarantined ( bug 1569856 ).

Flags: needinfo?(wcosta)

~1400 pending jobs on terraform-packet/gecko-t-linux, 11 machines quarantined. Unuarantined machines appear to have last done work 8 hours ago, claim-expired exception state on their last job. So haven't recovered after network work at provider ?

Priority: -- → P1

Wander is sick today, so I'm doing what I can. :/

I'm starting by recreating a few instances to see whether they start taking jobs.

Assignee: nobody → coop
Status: NEW → ASSIGNED

No joy, it's already complaining about the new machines:

Aug 01 09:32:26 machine-0 docker-worker: {"type":"[alert-operator] diskspace threshold reached","source":"top","provisionerId":"terraform-packet","workerId":"machine-0","workerGroup":"packet-sjc1","workerType":"gecko-t-linux","workerNodeType":"packet.net","volume":"/mnt","free":218328801280,"total":234007384064,"used":15678582784,"pctUsed":"6.7000","perTaskThreshold":100000000000,"availableWorkerCapacity":4,"totalthreshold":400000000000}

Based on the papertrail logs, that seems to be the theme, which points to bug 1569856.

(In reply to Chris Cooper [:coop] pronoun: he from comment #5)

Aug 01 09:32:26 machine-0 docker-worker: {"type":"[alert-operator] diskspace threshold reached","source":"top","provisionerId":"terraform-packet","workerId":"machine-0","workerGroup":"packet-sjc1","workerType":"gecko-t-linux","workerNodeType":"packet.net","volume":"/mnt","free":218328801280,"total":234007384064,"used":15678582784,"pctUsed":"6.7000","perTaskThreshold":100000000000,"availableWorkerCapacity":4,"totalthreshold":400000000000}

Turns out everything you need to understand this bug is in the log message.

We're setting the perTaskThreshold to 100000000000, and then using the availableWorkerCapacity to naively scale that up to the totalthreshold of 400000000000. Of course, this is bigger than the available disk, so we bail hard after that.

Wander is up now, so I'm handing this off to him.

Assignee: coop → wcosta
Flags: needinfo?(wcosta)

The config is fixed and I can see jobs running.

FYI trees have been reopened.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Summary: TREES CLOSED Android workers not taking jobs. → Android workers not taking jobs.

The queue (terraform-packet/gecko-t-linux) is still very large as a result of this issue. I think jobs are expiring (the large dropoffs).

https://earthangel-b40313e5.influxcloud.net/d/wIJoZ4HWk/android-queues?orgId=1&refresh=5m&from=1564535897232&to=1564708697233&fullscreen&panelId=10 (click 'sign in with Oauth' for mozauth)

Should we spin up some more workers temporarily? I know they take 24 hours, but I'm not sure we'll catch up otherwise.

I've spun up 10 more instances (== 40 more workers) and the backlog seems to be going down.

You need to log in before you can comment on or make changes to this bug.