win2012r2 hung / didn't run correctly

RESOLVED FIXED

Status

Taskcluster
Generic-Worker
--
critical
RESOLVED FIXED
a year ago
a year ago

People

(Reporter: standard8, Assigned: pmoore)

Tracking

Details

Attachments

(2 attachments)

(Reporter)

Description

a year ago
Yesterday I triggered a windows builder, the build started, but didn't complete. There was no live log nor anything:

https://tools.taskcluster.net/task-inspector/#HEt5RUoeSISpYn0tJj44Pg/

In the end I cancelled it because only one win2012r2 can be allocated currently.
(Reporter)

Comment 1

a year ago
Although I've cancelled it, it appears that this is still holding onto the builder in some manner & more win2012r2 builds aren't starting.
Severity: normal → critical
(Assignee)

Comment 2

a year ago
Created attachment 8797577 [details]
generic-worker.log
Assignee: nobody → pmoore
(Assignee)

Comment 3

a year ago
Created attachment 8797578 [details]
live_backing.log
(Assignee)

Comment 4

a year ago
Thanks for raising this Mark. This has highlighted some issues with the worker...

(In reply to Mark Banner (:standard8) from comment #0)
> Yesterday I triggered a windows builder, the build started, but didn't
> complete. There was no live log nor anything:
> 
> https://tools.taskcluster.net/task-inspector/#HEt5RUoeSISpYn0tJj44Pg/

From the logs we can see the livelog was uploaded at 16:08 UTC:

2016/10/03 16:08:49 Uploading artifact: public/logs/live.log

However, we expire the livelog to match the maxRunTime of the task:
https://github.com/taskcluster/generic-worker/blob/45949063112b61c9496090f9df026a4b7162f10c/livelog.go#L116

Since the maxRunTime was 1200 seconds, this artifact would have expired around 16:28 UTC (20 mins later), and thus is no longer shown in the Task Inspector UI.

It is possible if a browser page was open where the artifact was shown, that the livelog could have stopped working, even with the livelog listed on the page (I believe expired artifacts don't dynamically get removed from the web view of the task inspector).

When I logged onto the worker 19 hours later, it was still running the task. It believed it was running the 12th command of the task payload: `npm run funcnonbash`. The worker log (secrets removed) and the task log are attached to demonstrate this.

Two obvious things are amiss though:

1) Why did the task not notice that the process had finished? (see task log, attached to this bug)

I'm wondering if a process got orphaned, or the process tree was somehow mutated. I'll look into this.

2) Why did the maxRunTime timeout not abort the task?

We do see in the worker log:

2016/10/03 16:28:49 Not able to update status to Aborted - current status Reclaimed, allowed current status for update: map[Claimed:true Reclaimed:true]

At the moment I'm not sure why this is - this looks like a worker bug, hopefully should be simple to fix.
(Assignee)

Updated

a year ago
Attachment #8797577 - Attachment mime type: text/x-log → text/plain
(Assignee)

Updated

a year ago
Attachment #8797578 - Attachment mime type: text/x-log → text/plain

Comment 5

a year ago
Commit pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/3f66d2089d1168e2a212ec37e8393560f776299a
Bug 1307383 - bug fix: check if task status change can be made against current status, not target status
(Assignee)

Comment 6

a year ago
(In reply to Pete Moore [:pmoore][:pete] from comment #4)
> 2) Why did the maxRunTime timeout not abort the task?
> 
> We do see in the worker log:
> 
> 2016/10/03 16:28:49 Not able to update status to Aborted - current status
> Reclaimed, allowed current status for update: map[Claimed:true
> Reclaimed:true]
> 
> At the moment I'm not sure why this is - this looks like a worker bug,
> hopefully should be simple to fix.

Fixed in the above commit (comment 5).
(Assignee)

Comment 8

a year ago
Mark,

I believe this is fixed now. Are you happy to close?

Thanks,
Pete
Flags: needinfo?(standard8)
(Reporter)

Comment 9

a year ago
The windows builders seem to be running OK. The tests that I'm running are now failing, but I'm not sure why that is.

For now, I'll assume that's a separate issue.
Status: NEW → RESOLVED
Last Resolved: a year ago
Flags: needinfo?(standard8)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.