Closed
Bug 1474729
Opened 7 years ago
Closed 7 years ago
T-w1064-ms-036 is running 2 task at a time
Categories
(Infrastructure & Operations :: RelOps: OpenCloudConfig, task)
Infrastructure & Operations
RelOps: OpenCloudConfig
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: markco, Assigned: markco)
References
Details
Attachments
(1 file)
916.35 KB,
image/jpeg
|
Details |
https://tools.taskcluster.net/provisioners/releng-hardware/worker-types/gecko-t-win10-64-hw/workers/mdc1/T-W1064-MS-036.
This is currently running GW 10.8.5 .
Previously I have seen 2 task user directories created at the same time, but this is first time I have seen a node run 2 task at one time.
It was reimaged this afternoon, but it looks like it may had been doing it earlier in the day as well.
It is now quarantined.
Assignee | ||
Comment 1•7 years ago
|
||
Pete: Rob: Any ideas?
Assignee: nobody → mcornmesser
Flags: needinfo?(rthijssen)
Flags: needinfo?(pmoore)
Comment 2•7 years ago
|
||
i've often been confused by papertrail logs which appear to show new tasks starting before the previous task has resolved but don't know enough about how gw works to know if i was interpreting the logs correctly.
i will invest some more effort in the app i wrote to record & investigate issues like this (github.com/minionator). it may give us better statistics on the frequency that this occurs.
pete will probably be able to shed more light on the immediate issue.
Flags: needinfo?(rthijssen)
Comment 3•7 years ago
|
||
The two tasks runs from the screenshot are:
Task dH0AL8HARDWaatDRHm5AWA Run 0
=================================
Worker: T-W1064-MS-036
Claimed:
Time according to Queue:
* 2018/07/10 21:06:36 UTC
source: https://queue.taskcluster.net/v1/task/dH0AL8HARDWaatDRHm5AWA/status
Time according to worker:
* 2018/07/10 21:06:36
source: https://papertrailapp.com/systems/1725222761/events?focus=953447405515337752&q=program%3Ageneric-worker&selected=953447405515337752
Resolved by Queue (not by worker)
Time according to Queue:
* 2018/07/10 21:26:37 UTC
source: https://queue.taskcluster.net/v1/task/dH0AL8HARDWaatDRHm5AWA/status
Task LaJ9FBQZQrKZcyYy3oFu_g Run 0
=================================
Worker: T-W1064-MS-036
Claimed:
Time according to Queue:
* 2018/07/10 21:15:20 UTC
source: https://queue.taskcluster.net/v1/task/LaJ9FBQZQrKZcyYy3oFu_g/status
Time according to worker:
* 2018/07/10 21:15:20 UTC
source: https://papertrailapp.com/systems/1725222761/events?focus=953449598985338909&q=program%3Ageneric-worker&selected=953449598985338909
Resolved:
Time according to Queue:
* 2018/07/10 21:47:52 UTC
source: https://queue.taskcluster.net/v1/task/LaJ9FBQZQrKZcyYy3oFu_g/status
Time according to worker:
* 2018/07/10 21:47:51 UTC
source: https://papertrailapp.com/systems/1725222761/events?focus=953457783175544836&q=program%3Ageneric-worker&selected=953457783175544836
Looking at the logs, I can see that generic-worker was busy executing task dH0AL8HARDWaatDRHm5AWA and then at 21:07:19, we get a log entry:
21:07:19 Microsoft-Windows-DSC: The local configuration manager was shut down.
After this all logs buffer up, and 6 minutes later all arrive at papertrail at precisely 21:13:36.
In these buffered up logs, we see that the generic-worker was sent a termination signal:
T-W1064-MS-036.mdc1.mozilla.com generic-worker: ^CTerminate batch job (Y/N)? 1 file(s) copied.#015
Then we see that the generic-worker logs which were already logged to papertrail before, get resent to papertrail, so for example we see all log lines repeated. Here is one, as an example:
https://papertrailapp.com/systems/1725222761/events?focus=953447405515337752&selected=953447405515337752
https://papertrailapp.com/systems/1725222761/events?focus=953449158566641686&selected=953449158566641686
=>
Jul 10 21:06:38 T-W1064-MS-036.mdc1.mozilla.com generic-worker: 2018/07/10 21:06:36 Running task https://tools.taskcluster.net/task-inspector/#dH0AL8HARDWaatDRHm5AWA/0#015
Jul 10 21:13:36 T-W1064-MS-036.mdc1.mozilla.com generic-worker: 2018/07/10 21:06:36 Running task https://tools.taskcluster.net/task-inspector/#dH0AL8HARDWaatDRHm5AWA/0#015
As you see the papertrail timestamps are different (arriving Jul 10 21:06:38 vs arriving Jul 10 21:13:36) but the generic-worker timestamp is the same (2018/07/10 21:06:36). This is because it is the same log line, but it was sent to papertrail twice.
We also see:
Jul 10 23:13:36 T-W1064-MS-036.mdc1.mozilla.com Microsoft-Windows-Kernel-Power: The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.#015
So my guess, is this all comes back to bug 1372172. This Windows 10 machine crashed and rebooted while it was running task dH0AL8HARDWaatDRHm5AWA. It therefore could not resolve the task, so the queue thought it was still running.
After it rebooted, it claimed a task and started running that, so indeed from the Queue's perspective, the worker appeared to be running two tasks in parallel.
When the worker failed to reclaim task dH0AL8HARDWaatDRHm5AWA within 20 minutes, the Queue resolved the task run as claim-expired, and issued a new task run (run 1) that was claimed and executed by T-W1064-MS-297 (running generic-worker 8.3.0).
Summary
=======
* The worker and the Queue behaved correctly
* Bug 1372172 urgently needs to be fixed as it is causing both AWS and hardware workers to lose network connectivity and crash
* Monitoring/alerting should be set up to track when these crashes happen ("The system has rebooted without cleanly shutting down first")
* All the information for troubleshooting is in the logs - they just need to be read
* The generic-worker logs should not be sent to papertrail twice
Depends on: 1372172
Flags: needinfo?(pmoore)
Comment 4•7 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #3)
>
> So my guess, is this all comes back to bug 1372172.
>
> * Bug 1372172 urgently needs to be fixed as it is causing both AWS and
> hardware workers to lose network connectivity and crash
I don't disagree that bug 1372172 needs fixing asap, but I don't see any indication in that bug that it has ever affected hardware workers.
> * Monitoring/alerting should be set up to track when these crashes happen
> ("The system has rebooted without cleanly shutting down first")
That's an easy papertrail alert.
> * All the information for troubleshooting is in the logs - they just need to
> be read
...
> * The generic-worker logs should not be sent to papertrail twice
Depending on what's happening, this might be difficult; my guess is that nxlog is losing state of what it has sent when the system crashes.
Comment 5•7 years ago
|
||
(In reply to Kendall Libby [:fubar] from comment #4)
> I don't disagree that bug 1372172 needs fixing asap, but I don't see any
> indication in that bug that it has ever affected hardware workers.
In both AWS and on our hardware, we have Windows 10 workers that lose network connectivity, buffer up logs, eventually reboot and send their buffered logs to papertrail, and show in their logs that they crashed unexpectedly without shutting down properly.
These symptoms are shared between Windows 10 hardware workers and Windows 10 EC2 instances.
Eric and Marco provided steps to reliably reproduce the issue in:
https://bugzilla.mozilla.org/show_bug.cgi?id=1372172#c29
https://bugzilla.mozilla.org/show_bug.cgi?id=1372172#c30
Has anyone tried to reproduce, for example, to see if it fails at exactly the same point each time, trace what processes were running at the time of failure, disable individual tests around the time of failure, etc? This may (or may not) help identify the code that causes the crashes.
Comment 6•7 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #5)
> Has anyone tried to reproduce, for example, to see if it fails at exactly
> the same point each time, trace what processes were running at the time of
> failure, disable individual tests around the time of failure, etc? This may
> (or may not) help identify the code that causes the crashes.
Just seen in later comments in the bug that Mark is looking into it... Many thanks.
Comment 7•7 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #5)
>
> In both AWS and on our hardware, we have Windows 10 workers that lose
> network connectivity, buffer up logs, eventually reboot and send their
> buffered logs to papertrail, and show in their logs that they crashed
> unexpectedly without shutting down properly.
The issues on the hardware side are relatively recent, or more particularly have increased noticeably in the last few weeks.
Assignee | ||
Comment 8•7 years ago
|
||
I am going to close this bug. I have not seen this issue reooccur.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•