Closed Bug 1307383 Opened 8 years ago Closed 8 years ago

win2012r2 hung / didn't run correctly

Categories

(Taskcluster :: Workers, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: standard8, Assigned: pmoore)

Details

Attachments

(2 files)

Yesterday I triggered a windows builder, the build started, but didn't complete. There was no live log nor anything:

https://tools.taskcluster.net/task-inspector/#HEt5RUoeSISpYn0tJj44Pg/

In the end I cancelled it because only one win2012r2 can be allocated currently.
Although I've cancelled it, it appears that this is still holding onto the builder in some manner & more win2012r2 builds aren't starting.
Severity: normal → critical
Attached file generic-worker.log
Assignee: nobody → pmoore
Thanks for raising this Mark. This has highlighted some issues with the worker...

(In reply to Mark Banner (:standard8) from comment #0)
> Yesterday I triggered a windows builder, the build started, but didn't
> complete. There was no live log nor anything:
> 
> https://tools.taskcluster.net/task-inspector/#HEt5RUoeSISpYn0tJj44Pg/

From the logs we can see the livelog was uploaded at 16:08 UTC:

2016/10/03 16:08:49 Uploading artifact: public/logs/live.log

However, we expire the livelog to match the maxRunTime of the task:
https://github.com/taskcluster/generic-worker/blob/45949063112b61c9496090f9df026a4b7162f10c/livelog.go#L116

Since the maxRunTime was 1200 seconds, this artifact would have expired around 16:28 UTC (20 mins later), and thus is no longer shown in the Task Inspector UI.

It is possible if a browser page was open where the artifact was shown, that the livelog could have stopped working, even with the livelog listed on the page (I believe expired artifacts don't dynamically get removed from the web view of the task inspector).

When I logged onto the worker 19 hours later, it was still running the task. It believed it was running the 12th command of the task payload: `npm run funcnonbash`. The worker log (secrets removed) and the task log are attached to demonstrate this.

Two obvious things are amiss though:

1) Why did the task not notice that the process had finished? (see task log, attached to this bug)

I'm wondering if a process got orphaned, or the process tree was somehow mutated. I'll look into this.

2) Why did the maxRunTime timeout not abort the task?

We do see in the worker log:

2016/10/03 16:28:49 Not able to update status to Aborted - current status Reclaimed, allowed current status for update: map[Claimed:true Reclaimed:true]

At the moment I'm not sure why this is - this looks like a worker bug, hopefully should be simple to fix.
Attachment #8797577 - Attachment mime type: text/x-log → text/plain
Attachment #8797578 - Attachment mime type: text/x-log → text/plain
Commit pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/3f66d2089d1168e2a212ec37e8393560f776299a
Bug 1307383 - bug fix: check if task status change can be made against current status, not target status
(In reply to Pete Moore [:pmoore][:pete] from comment #4)
> 2) Why did the maxRunTime timeout not abort the task?
> 
> We do see in the worker log:
> 
> 2016/10/03 16:28:49 Not able to update status to Aborted - current status
> Reclaimed, allowed current status for update: map[Claimed:true
> Reclaimed:true]
> 
> At the moment I'm not sure why this is - this looks like a worker bug,
> hopefully should be simple to fix.

Fixed in the above commit (comment 5).
Mark,

I believe this is fixed now. Are you happy to close?

Thanks,
Pete
Flags: needinfo?(standard8)
The windows builders seem to be running OK. The tests that I'm running are now failing, but I'm not sure why that is.

For now, I'll assume that's a separate issue.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(standard8)
Resolution: --- → FIXED
Component: Generic-Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: