Closed Bug 1493467 Opened 7 years ago Closed 7 years ago

[MDC1] t-w1064-ms-036 not picking up tasks

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: zfay, Assigned: markco)

References

Details

Not exactly sure what is happening. Worker is not appearing in TC, tried to re-image it 2 days ago and hasn't recovered. The papertrail logs are showing generic-worker as: Sep 22 16:11:16 T-W1064-MS-036.mdc1.mozilla.com generic-worker: Checking for C:\dsc\task-claim-state.valid file... #015 Sep 22 16:11:17 T-W1064-MS-036.mdc1.mozilla.com generic-worker: Checking for C:\dsc\task-claim-state.valid file... #015 Sep 22 16:11:18 T-W1064-MS-036.mdc1.mozilla.com generic-worker: Checking for C:\dsc\task-claim-state.valid file... #015
@pmoore I cc-ed you in here as well to keep you in the loop. Not sure what is the issue yet or how much it has to do with the generic-worker.
Assignee: relops → mcornmesser
This is indicative that OCC did not complete or never ran. Rundsc.ps1 creates it here: https://github.com/mozilla-releng/OpenCloudConfig/blob/fb0c0f3dd021cee4c02c82feb3714ce38d0c2a82/userdata/rundsc.ps1#L1394 And the gneric-worker wrapper script looks for it here: https://github.com/mozilla-releng/OpenCloudConfig/blob/fb0c0f3dd021cee4c02c82feb3714ce38d0c2a82/userdata/Configuration/GenericWorker/run-hw-generic-worker-10-and-reboot.bat#L43 I am going to kick off a new reimage. If it fails we have the time of this comment as marker for when to start looking in the logs.
It has installed through and is looking for a task: ep 23 15:03:03 T-W1064-MS-036.mdc1.mozilla.com generic-worker: 2018/09/23 22:03:02 Disk available: 43561398272 bytes#015 Sep 23 15:03:23 T-W1064-MS-036.mdc1.mozilla.com generic-worker: 2018/09/23 22:03:23 No task claimed. Idle for 1m12.1304384s (will exit if no task claimed in 1h58m47.8695616s).#015
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Re-opened ticket as the machine didn't took any jobs already for more then 8 hours. I've rebooted it. I will check it again later https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw/workers/mdc1/T-W1064-MS-036
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This node is up and running now. It looks like it spent 12 hours trying to recover: https://papertrailapp.com/groups/1141234/events?q=ms-036%20AND%20Generic-worker.exe%20has%20not%20started%20within%20the%20expected%20time&focus=988477445416062984 Then recovered here: https://papertrailapp.com/systems/2115494792/events?focus=988477156034240528&selected=988477156034240528 I will need to see a node as it is looping to troubleshoot this issue. Closing this bug for now.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.