Yesterday I triggered a windows builder, the build started, but didn't complete. There was no live log nor anything: https://tools.taskcluster.net/task-inspector/#HEt5RUoeSISpYn0tJj44Pg/ In the end I cancelled it because only one win2012r2 can be allocated currently.
Although I've cancelled it, it appears that this is still holding onto the builder in some manner & more win2012r2 builds aren't starting.
Severity: normal → critical
Thanks for raising this Mark. This has highlighted some issues with the worker... (In reply to Mark Banner (:standard8) from comment #0) > Yesterday I triggered a windows builder, the build started, but didn't > complete. There was no live log nor anything: > > https://tools.taskcluster.net/task-inspector/#HEt5RUoeSISpYn0tJj44Pg/ From the logs we can see the livelog was uploaded at 16:08 UTC: 2016/10/03 16:08:49 Uploading artifact: public/logs/live.log However, we expire the livelog to match the maxRunTime of the task: https://github.com/taskcluster/generic-worker/blob/45949063112b61c9496090f9df026a4b7162f10c/livelog.go#L116 Since the maxRunTime was 1200 seconds, this artifact would have expired around 16:28 UTC (20 mins later), and thus is no longer shown in the Task Inspector UI. It is possible if a browser page was open where the artifact was shown, that the livelog could have stopped working, even with the livelog listed on the page (I believe expired artifacts don't dynamically get removed from the web view of the task inspector). When I logged onto the worker 19 hours later, it was still running the task. It believed it was running the 12th command of the task payload: `npm run funcnonbash`. The worker log (secrets removed) and the task log are attached to demonstrate this. Two obvious things are amiss though: 1) Why did the task not notice that the process had finished? (see task log, attached to this bug) I'm wondering if a process got orphaned, or the process tree was somehow mutated. I'll look into this. 2) Why did the maxRunTime timeout not abort the task? We do see in the worker log: 2016/10/03 16:28:49 Not able to update status to Aborted - current status Reclaimed, allowed current status for update: map[Claimed:true Reclaimed:true] At the moment I'm not sure why this is - this looks like a worker bug, hopefully should be simple to fix.
Attachment #8797577 - Attachment mime type: text/x-log → text/plain
Attachment #8797578 - Attachment mime type: text/x-log → text/plain
Commit pushed to master at https://github.com/taskcluster/generic-worker https://github.com/taskcluster/generic-worker/commit/3f66d2089d1168e2a212ec37e8393560f776299a Bug 1307383 - bug fix: check if task status change can be made against current status, not target status
(In reply to Pete Moore [:pmoore][:pete] from comment #4) > 2) Why did the maxRunTime timeout not abort the task? > > We do see in the worker log: > > 2016/10/03 16:28:49 Not able to update status to Aborted - current status > Reclaimed, allowed current status for update: map[Claimed:true > Reclaimed:true] > > At the moment I'm not sure why this is - this looks like a worker bug, > hopefully should be simple to fix. Fixed in the above commit (comment 5).
Commit pushed to master at https://github.com/taskcluster/generic-worker https://github.com/taskcluster/generic-worker/commit/c9ce3f448bcf6dc0479bdc5ef331a49793959447 Bug 1307383 - fix for the fix ... whoops
Mark, I believe this is fixed now. Are you happy to close? Thanks, Pete
The windows builders seem to be running OK. The tests that I'm running are now failing, but I'm not sure why that is. For now, I'll assume that's a separate issue.
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.