Closed Bug 1475711 Opened 7 years ago Closed 6 years ago

tasks in queue, but worker does not get tasks

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: dhouse, Unassigned)

References

Details

Sometimes a worker is running and requesting tasks, but it does not get tasks despite there being tasks in the queue. Please record details when this happens. We need to know at what time the queue was checked, which worker was not taking work, and that the worker was polling for tasks. Once we have some examples recorded, we can report the issue to the Taskcluster team for investigation.
I've just checked the Windows moonshots as we have a 1500 pending tasks. Current time is 9:00 am in UTC +3. Workers that were lazy, aka not taken tasks for at least 60+ mins while having above mentioned queue, are: T-W1064-MS-063 T-W1064-MS-066 T-W1064-MS-068 T-W1064-MS-069 T-W1064-MS-075 T-W1064-MS-081 T-W1064-MS-089 T-W1064-MS-106 T-W1064-MS-107 T-W1064-MS-110 T-W1064-MS-115 T-W1064-MS-123 T-W1064-MS-127 T-W1064-MS-129 T-W1064-MS-134 T-W1064-MS-157 T-W1064-MS-161 T-W1064-MS-162 T-W1064-MS-164 T-W1064-MS-165 T-W1064-MS-167 T-W1064-MS-168 T-W1064-MS-169 T-W1064-MS-220 T-W1064-MS-256 T-W1064-MS-264 T-W1064-MS-265 T-W1064-MS-294
So I rebooted all of them a little over half of them are still lazy. I've checked the logs from papertrail for the ones that didn't recover. To help dhouse and the taskcluster team I'll leave a few of the machines untouched so they can troubleshoot the problem themselves. I'll proceed ahead with re-imaging the rest. Untouched workers after reboot: T-W1064-MS-066 - Simple worker issue: doesn't pick up tasks after reboot. https://papertrailapp.com/systems/1737001651/events T-W1064-MS-068 - Didn't report to papertrail since july 13th. https://papertrailapp.com/systems/1670078631/events T-W1064-MS-089 - Seems like it's looking to take jobs. from papertrail: generic-worker: Checking for C:\dsc\task-claim-state.valid file... #015 https://papertrailapp.com/systems/1644165731/events T-W1064-MS-106 - An example of worker becoming lazy with the last task completed as exception. https://papertrailapp.com/systems/1730304451/events
I want to check in logs to see if this problem has repeated. I have heard of it happening for Windows workers also, and perhaps being an issue that affects specific workers for periods of time.
Flags: needinfo?(dhouse)
We do not have the historic data for the taskcluster queues to easily check this right now. If we see this happen, we can record it and investigate more.
Flags: needinfo?(dhouse)

Hey Dave, can this be closed? I don't think this issue has been occurring lately.

Flags: needinfo?(dhouse)

(In reply to Zsolt Fay [:zfay] from comment #5)

Hey Dave, can this be closed? I don't think this issue has been occurring lately.

Sure. I'll close it out. Thanks!

Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(dhouse)
Resolution: --- → WORKSFORME
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.