1474729 - T-w1064-ms-036 is running 2 task at a time

Assignee

Description

•

7 years ago

Attached image ms-036.jpg — Details

https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw/workers/mdc1/T-W1064-MS-036. This is currently running GW 10.8.5 . Previously I have seen 2 task user directories created at the same time, but this is first time I have seen a node run 2 task at one time. It was reimaged this afternoon, but it looks like it may had been doing it earlier in the day as well. It is now quarantined.

Mark Cornmesser [:markco]

Assignee

Comment 1

•

7 years ago

Pete: Rob: Any ideas?

Assignee: nobody → mcornmesser

Flags: needinfo?(rthijssen)

Flags: needinfo?(pmoore)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 2

•

7 years ago

i've often been confused by papertrail logs which appear to show new tasks starting before the previous task has resolved but don't know enough about how gw works to know if i was interpreting the logs correctly. i will invest some more effort in the app i wrote to record & investigate issues like this (github.com/minionator). it may give us better statistics on the frequency that this occurs. pete will probably be able to shed more light on the immediate issue.

Flags: needinfo?(rthijssen)

Mark Cornmesser [:markco]

Assignee

Updated

•

7 years ago

Blocks: 1443589

Pete Moore [:pmoore][:pete]

Comment 3

•

7 years ago

The two tasks runs from the screenshot are: Task dH0AL8HARDWaatDRHm5AWA Run 0 ================================= Worker: T-W1064-MS-036 Claimed: Time according to Queue: * 2018/07/10 21:06:36 UTC source: https://queue.taskcluster.net/v1/task/dH0AL8HARDWaatDRHm5AWA/status Time according to worker: * 2018/07/10 21:06:36 source: https://papertrailapp.com/systems/1725222761/events?focus=953447405515337752&q=program%3Ageneric-worker&selected=953447405515337752 Resolved by Queue (not by worker) Time according to Queue: * 2018/07/10 21:26:37 UTC source: https://queue.taskcluster.net/v1/task/dH0AL8HARDWaatDRHm5AWA/status Task LaJ9FBQZQrKZcyYy3oFu_g Run 0 ================================= Worker: T-W1064-MS-036 Claimed: Time according to Queue: * 2018/07/10 21:15:20 UTC source: https://queue.taskcluster.net/v1/task/LaJ9FBQZQrKZcyYy3oFu_g/status Time according to worker: * 2018/07/10 21:15:20 UTC source: https://papertrailapp.com/systems/1725222761/events?focus=953449598985338909&q=program%3Ageneric-worker&selected=953449598985338909 Resolved: Time according to Queue: * 2018/07/10 21:47:52 UTC source: https://queue.taskcluster.net/v1/task/LaJ9FBQZQrKZcyYy3oFu_g/status Time according to worker: * 2018/07/10 21:47:51 UTC source: https://papertrailapp.com/systems/1725222761/events?focus=953457783175544836&q=program%3Ageneric-worker&selected=953457783175544836 Looking at the logs, I can see that generic-worker was busy executing task dH0AL8HARDWaatDRHm5AWA and then at 21:07:19, we get a log entry: 21:07:19 Microsoft-Windows-DSC: The local configuration manager was shut down. After this all logs buffer up, and 6 minutes later all arrive at papertrail at precisely 21:13:36. In these buffered up logs, we see that the generic-worker was sent a termination signal: T-W1064-MS-036.mdc1.mozilla.com generic-worker: ^CTerminate batch job (Y/N)? 1 file(s) copied.#015 Then we see that the generic-worker logs which were already logged to papertrail before, get resent to papertrail, so for example we see all log lines repeated. Here is one, as an example: https://papertrailapp.com/systems/1725222761/events?focus=953447405515337752&selected=953447405515337752 https://papertrailapp.com/systems/1725222761/events?focus=953449158566641686&selected=953449158566641686 => Jul 10 21:06:38 T-W1064-MS-036.mdc1.mozilla.com generic-worker: 2018/07/10 21:06:36 Running task https://tools.taskcluster.net/task-inspector/#dH0AL8HARDWaatDRHm5AWA/0#015 Jul 10 21:13:36 T-W1064-MS-036.mdc1.mozilla.com generic-worker: 2018/07/10 21:06:36 Running task https://tools.taskcluster.net/task-inspector/#dH0AL8HARDWaatDRHm5AWA/0#015 As you see the papertrail timestamps are different (arriving Jul 10 21:06:38 vs arriving Jul 10 21:13:36) but the generic-worker timestamp is the same (2018/07/10 21:06:36). This is because it is the same log line, but it was sent to papertrail twice. We also see: Jul 10 23:13:36 T-W1064-MS-036.mdc1.mozilla.com Microsoft-Windows-Kernel-Power: The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.#015 So my guess, is this all comes back to bug 1372172. This Windows 10 machine crashed and rebooted while it was running task dH0AL8HARDWaatDRHm5AWA. It therefore could not resolve the task, so the queue thought it was still running. After it rebooted, it claimed a task and started running that, so indeed from the Queue's perspective, the worker appeared to be running two tasks in parallel. When the worker failed to reclaim task dH0AL8HARDWaatDRHm5AWA within 20 minutes, the Queue resolved the task run as claim-expired, and issued a new task run (run 1) that was claimed and executed by T-W1064-MS-297 (running generic-worker 8.3.0). Summary ======= * The worker and the Queue behaved correctly * Bug 1372172 urgently needs to be fixed as it is causing both AWS and hardware workers to lose network connectivity and crash * Monitoring/alerting should be set up to track when these crashes happen ("The system has rebooted without cleanly shutting down first") * All the information for troubleshooting is in the logs - they just need to be read * The generic-worker logs should not be sent to papertrail twice

Depends on: 1372172

Flags: needinfo?(pmoore)

Kendall Libby [:fubar] (he/him)

Comment 4

•

7 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #3) > > So my guess, is this all comes back to bug 1372172. > > * Bug 1372172 urgently needs to be fixed as it is causing both AWS and > hardware workers to lose network connectivity and crash I don't disagree that bug 1372172 needs fixing asap, but I don't see any indication in that bug that it has ever affected hardware workers. > * Monitoring/alerting should be set up to track when these crashes happen > ("The system has rebooted without cleanly shutting down first") That's an easy papertrail alert. > * All the information for troubleshooting is in the logs - they just need to > be read ... > * The generic-worker logs should not be sent to papertrail twice Depending on what's happening, this might be difficult; my guess is that nxlog is losing state of what it has sent when the system crashes.

Pete Moore [:pmoore][:pete]

Comment 5

•

7 years ago

(In reply to Kendall Libby [:fubar] from comment #4) > I don't disagree that bug 1372172 needs fixing asap, but I don't see any > indication in that bug that it has ever affected hardware workers. In both AWS and on our hardware, we have Windows 10 workers that lose network connectivity, buffer up logs, eventually reboot and send their buffered logs to papertrail, and show in their logs that they crashed unexpectedly without shutting down properly. These symptoms are shared between Windows 10 hardware workers and Windows 10 EC2 instances. Eric and Marco provided steps to reliably reproduce the issue in: https://bugzilla.mozilla.org/show_bug.cgi?id=1372172#c29 https://bugzilla.mozilla.org/show_bug.cgi?id=1372172#c30 Has anyone tried to reproduce, for example, to see if it fails at exactly the same point each time, trace what processes were running at the time of failure, disable individual tests around the time of failure, etc? This may (or may not) help identify the code that causes the crashes.

Pete Moore [:pmoore][:pete]

Comment 6

•

7 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #5) > Has anyone tried to reproduce, for example, to see if it fails at exactly > the same point each time, trace what processes were running at the time of > failure, disable individual tests around the time of failure, etc? This may > (or may not) help identify the code that causes the crashes. Just seen in later comments in the bug that Mark is looking into it... Many thanks.

Kendall Libby [:fubar] (he/him)

Comment 7

•

7 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #5) > > In both AWS and on our hardware, we have Windows 10 workers that lose > network connectivity, buffer up logs, eventually reboot and send their > buffered logs to papertrail, and show in their logs that they crashed > unexpectedly without shutting down properly. The issues on the hardware side are relatively recent, or more particularly have increased noticeably in the last few weeks.

Mark Cornmesser [:markco]

Assignee

Comment 8

•

7 years ago

I am going to close this bug. I have not seen this issue reooccur.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Bugzilla

T-w1064-ms-036 is running 2 task at a time

Categories

(Infrastructure & Operations :: RelOps: OpenCloudConfig, task)

Tracking

(Not tracked)

People

(Reporter: markco, Assigned: markco)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Attachment

General

Description

File Name

Content Type