Closed
Bug 1307383
Opened 8 years ago
Closed 8 years ago
win2012r2 hung / didn't run correctly
Categories
(Taskcluster :: Workers, defect)
Taskcluster
Workers
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: standard8, Assigned: pmoore)
Details
Attachments
(2 files)
Yesterday I triggered a windows builder, the build started, but didn't complete. There was no live log nor anything: https://tools.taskcluster.net/task-inspector/#HEt5RUoeSISpYn0tJj44Pg/ In the end I cancelled it because only one win2012r2 can be allocated currently.
Reporter | ||
Comment 1•8 years ago
|
||
Although I've cancelled it, it appears that this is still holding onto the builder in some manner & more win2012r2 builds aren't starting.
Severity: normal → critical
Assignee | ||
Comment 2•8 years ago
|
||
Assignee: nobody → pmoore
Assignee | ||
Comment 3•8 years ago
|
||
Assignee | ||
Comment 4•8 years ago
|
||
Thanks for raising this Mark. This has highlighted some issues with the worker... (In reply to Mark Banner (:standard8) from comment #0) > Yesterday I triggered a windows builder, the build started, but didn't > complete. There was no live log nor anything: > > https://tools.taskcluster.net/task-inspector/#HEt5RUoeSISpYn0tJj44Pg/ From the logs we can see the livelog was uploaded at 16:08 UTC: 2016/10/03 16:08:49 Uploading artifact: public/logs/live.log However, we expire the livelog to match the maxRunTime of the task: https://github.com/taskcluster/generic-worker/blob/45949063112b61c9496090f9df026a4b7162f10c/livelog.go#L116 Since the maxRunTime was 1200 seconds, this artifact would have expired around 16:28 UTC (20 mins later), and thus is no longer shown in the Task Inspector UI. It is possible if a browser page was open where the artifact was shown, that the livelog could have stopped working, even with the livelog listed on the page (I believe expired artifacts don't dynamically get removed from the web view of the task inspector). When I logged onto the worker 19 hours later, it was still running the task. It believed it was running the 12th command of the task payload: `npm run funcnonbash`. The worker log (secrets removed) and the task log are attached to demonstrate this. Two obvious things are amiss though: 1) Why did the task not notice that the process had finished? (see task log, attached to this bug) I'm wondering if a process got orphaned, or the process tree was somehow mutated. I'll look into this. 2) Why did the maxRunTime timeout not abort the task? We do see in the worker log: 2016/10/03 16:28:49 Not able to update status to Aborted - current status Reclaimed, allowed current status for update: map[Claimed:true Reclaimed:true] At the moment I'm not sure why this is - this looks like a worker bug, hopefully should be simple to fix.
Assignee | ||
Updated•8 years ago
|
Attachment #8797577 -
Attachment mime type: text/x-log → text/plain
Assignee | ||
Updated•8 years ago
|
Attachment #8797578 -
Attachment mime type: text/x-log → text/plain
Comment 5•8 years ago
|
||
Commit pushed to master at https://github.com/taskcluster/generic-worker https://github.com/taskcluster/generic-worker/commit/3f66d2089d1168e2a212ec37e8393560f776299a Bug 1307383 - bug fix: check if task status change can be made against current status, not target status
Assignee | ||
Comment 6•8 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #4) > 2) Why did the maxRunTime timeout not abort the task? > > We do see in the worker log: > > 2016/10/03 16:28:49 Not able to update status to Aborted - current status > Reclaimed, allowed current status for update: map[Claimed:true > Reclaimed:true] > > At the moment I'm not sure why this is - this looks like a worker bug, > hopefully should be simple to fix. Fixed in the above commit (comment 5).
Comment 7•8 years ago
|
||
Commit pushed to master at https://github.com/taskcluster/generic-worker https://github.com/taskcluster/generic-worker/commit/c9ce3f448bcf6dc0479bdc5ef331a49793959447 Bug 1307383 - fix for the fix ... whoops
Assignee | ||
Comment 8•8 years ago
|
||
Mark, I believe this is fixed now. Are you happy to close? Thanks, Pete
Flags: needinfo?(standard8)
Reporter | ||
Comment 9•8 years ago
|
||
The windows builders seem to be running OK. The tests that I'm running are now failing, but I'm not sure why that is. For now, I'll assume that's a separate issue.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(standard8)
Resolution: --- → FIXED
Assignee | ||
Comment 10•6 years ago
|
||
Released in https://github.com/taskcluster/generic-worker/releases/tag/v6.0.0
QA Contact: pmoore
Updated•5 years ago
|
Component: Generic-Worker → Workers
You need to log in
before you can comment on or make changes to this bug.
Description
•