hung generic-worker on macos
Categories
(Infrastructure & Operations :: RelOps: Posix OS, task)
Tracking
(Not tracked)
People
(Reporter: dhouse, Assigned: dhouse)
Details
I found a macos mojave worker (old version, single-user) with the g-w process active, and logs show it was running a task but there are no log entries for 47h:
[dhouse@t-mojave-r7-324.test.releng.mdc1.mozilla.com ~]$ gdate -r /var/log/genericworker/stdout.log
Tue Jan 21 20:13:09 GMT 2020
[dhouse@t-mojave-r7-324.test.releng.mdc1.mozilla.com ~]$ date
Thu Jan 23 21:47:42 GMT 2020
The g-w last few log entries are:
Date: Tue, 21 Jan 2020 20:38:15 GMT
Etag: "bc3c79f7270432ee019890c00afd06d8"
Server: AmazonS3
X-Amz-Id-2: TKRKbGJ2xR3zNlP67mcohD/W+4nb9S8i//FRO34dAuJn7uHNvha3LIDuPLzV4hn5qY9rdA4MfJY=
X-Amz-Request-Id: 8A6574FCE71B62AB
X-Amz-Version-Id: p7KPJe5pxyXEhWzdj3mcQ4WgwE7bDk2I
2020/01/21 20:38:14 Resolving task J_tjhz25RE-zDdx8EVL7gQ ...
2020/01/21 20:38:14 Reclaimed task J_tjhz25RE-zDdx8EVL7gQ successfully.
2020/01/21 20:38:14 Successfully reclaimed task J_tjhz25RE-zDdx8EVL7gQ
Resolving and then reclaiming doesn't sound right.
The task shows as timed-out and retried and completed on a different machine.
time-out recorded as: 2020-01-21T20:58:15.786Z (started 2020-01-21T20:21:14.279Z, and was taken-until 2020-01-21T20:58:14.411Z)
The livelog attached to that run shows it as success+complete, but it is recorded as an exception/timeout (https://firefox-ci-tc.services.mozilla.com/tasks/J_tjhz25RE-zDdx8EVL7gQ/runs/0/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FJ_tjhz25RE-zDdx8EVL7gQ%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive_backing.log#L22100).
I've rebooted the machine to recover it.
I'll remember this if we see other g-w's hang on macos. We are moving to the new g-w soon, but this might push us to do it sooner if this repeats.
Description
•