~1400 pending jobs on terraform-packet/gecko-t-linux, 11 machines quarantined. Unuarantined machines appear to have last done work 8 hours ago, claim-expired exception state on their last job. So haven't recovered after network work at provider ?

Cristina Coroiu [:ccoroiu]

Comment 3

•

5 years ago

https://tools.taskcluster.net/provisioners/terraform-packet/worker-types/gecko-t-linux currently ~2000 pending jobs

Cristina Coroiu [:ccoroiu]

Updated

•

5 years ago

Priority: -- → P1

Chris Cooper [:coop] (he/him)

Comment 4

•

5 years ago

Wander is sick today, so I'm doing what I can. :/

I'm starting by recreating a few instances to see whether they start taking jobs.

Assignee: nobody → coop

Status: NEW → ASSIGNED

Chris Cooper [:coop] (he/him)

Comment 5

•

5 years ago

No joy, it's already complaining about the new machines:

Aug 01 09:32:26 machine-0 docker-worker: {"type":"[alert-operator] diskspace threshold reached","source":"top","provisionerId":"terraform-packet","workerId":"machine-0","workerGroup":"packet-sjc1","workerType":"gecko-t-linux","workerNodeType":"packet.net","volume":"/mnt","free":218328801280,"total":234007384064,"used":15678582784,"pctUsed":"6.7000","perTaskThreshold":100000000000,"availableWorkerCapacity":4,"totalthreshold":400000000000}

Based on the papertrail logs, that seems to be the theme, which points to bug 1569856.

Chris Cooper [:coop] (he/him)

Comment 6

•

5 years ago

(In reply to Chris Cooper [:coop] pronoun: he from comment #5)

Aug 01 09:32:26 machine-0 docker-worker: {"type":"[alert-operator] diskspace threshold reached","source":"top","provisionerId":"terraform-packet","workerId":"machine-0","workerGroup":"packet-sjc1","workerType":"gecko-t-linux","workerNodeType":"packet.net","volume":"/mnt","free":218328801280,"total":234007384064,"used":15678582784,"pctUsed":"6.7000","perTaskThreshold":100000000000,"availableWorkerCapacity":4,"totalthreshold":400000000000}

Turns out everything you need to understand this bug is in the log message.

We're setting the perTaskThreshold to 100000000000, and then using the availableWorkerCapacity to naively scale that up to the totalthreshold of 400000000000. Of course, this is bigger than the available disk, so we bail hard after that.

Wander is up now, so I'm handing this off to him.

Assignee: coop → wcosta

Flags: needinfo?(wcosta)

Wander Lairson Costa

Assignee

Comment 7

•

5 years ago

The config is fixed and I can see jobs running.

Andreea Pavel [:apavel]

Comment 8

•

5 years ago

FYI trees have been reopened.

Andreea Pavel [:apavel]

Updated

•

5 years ago

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

Summary: TREES CLOSED Android workers not taking jobs. → Android workers not taking jobs.

Andrew Erickson [:aerickson]

Comment 9

•

5 years ago

•

Edited

The queue (terraform-packet/gecko-t-linux) is still very large as a result of this issue. I think jobs are expiring (the large dropoffs).

https://earthangel-b40313e5.influxcloud.net/d/wIJoZ4HWk/android-queues?orgId=1&refresh=5m&from=1564535897232&to=1564708697233&fullscreen&panelId=10 (click 'sign in with Oauth' for mozauth)

Should we spin up some more workers temporarily? I know they take 24 hours, but I'm not sure we'll catch up otherwise.

Chris Cooper [:coop] (he/him)

Comment 10

•

5 years ago

I've spun up 10 more instances (== 40 more workers) and the backlog seems to be going down.

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Android workers not taking jobs.

Categories

(Taskcluster :: General, defect, P1)

Tracking

(Not tracked)

People

(Reporter: bogdan_tara, Assigned: wcosta)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Comment 10