Closed Bug 1339136 Opened 7 years ago Closed 6 years ago

Taskcluster jobs that hang don't save their logs and just retry themselves instead

Categories

(Infrastructure & Operations :: RelOps: OpenCloudConfig, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1372172

People

(Reporter: kats, Assigned: wcosta)

References

Details

I'm trying to stand up some mochitest suites (the webgl one, right now) on the QuantumRender builds. Sample try push at [1]. What happens is the job starts up, and via the "live log" I can see it starts to run the mochitests. At some point firefox seems to hang and the "live log" shows no new output for a long time. Then the job expires or times out, turns blue on TreeHerder, and the log is thrown away. This makes it quite hard to debug the failure because the logs are not accessible afterwards.

I have a new try push in progress right now (I just started it a few minutes ago) at [2] which will probably have the same behaviour for the M-e10s(gl) jobs.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=5b6935fb0f717dd969a26a67a759d439d35d7c11&group_state=expanded&selectedJob=75996268
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=7fcf8751457d522675d4d34fe8b27e97ee136d95
For posterity, :garndt diagnosed the problem here as the worker running out of memory. It's quite possible that the QR webgl mochitest is leaking memory somewhere so I'll follow up on that elsewhere. Leaving this bug for any taskcluster-side improvements to allow easier diagnosis of these sorts of problems.
Some notes:
1. The worker is crashing because the machine has run out of memory 
2. By design, when a worker crashes and restarts, we allow it to die off at the end of its billing cycle rather than claim more tasks because we do not know why the worker crashed to begin with. This includes resuming whatever tasks were currently running prior to the crash, which is why the task ends up getting resolved as claim-expired and no logs are uploaded.
3. We have talked about task statistics being uploaded at the end of a task, which would help knowing if a task over time has started consuming more memory, but it will not help for when a task has caused the machine to run out of memory
4. We could limit the memory that a container is allowed to use based on some % of total system memory to prevent issues of the worker crashing or other undefined behaviors
5. We capture memory statistics within the worker rand report to signalfx but it's meant more for managing pools of machines over time rather than responding to specific incidents for a smaller portion of the machines.  We could log to papertrail when a certain threshold is reached and setup an email alert around that.
6. One of the modules that cause the worker to crash when no memory is available is the module used for checking diskspace of the volumes used by docker.  We could catch the exception being thrown and log the failure.  This could cause other unintended things to happen though if the system is out of memory.
No longer blocks: 1342966
Blocks: 1343223
No longer depends on: 1343223
For the record, this is still an issue is quite annoying.
Blocks: 1424755
There's very little we can do in OOM situations - just about any activity in node requires allocation.  Currently, the workers often misbehave.  Bug 1458737 will have us restart instead, which should help the misbehavior.  But those logs will still be missing, because they are only on the worker until the task completes.

Probably the right fix is to either avoid tests that consume infinite memory (admitting this might happen in failure conditions), or use instances with enough memory to run these tasks without failing.
The problem is that in order to fix tests that are erroneously consuming infinite memory, we need the logs to debug what's going on.
Hm, https://docs.docker.com/config/containers/resource_constraints/#limit-a-containers-access-to-memory suggests that Docker has options to do this.

Options:
 - fork a second process that just watches the test runner's memory consumption in a loop and kills it if it starts growing
 - if these failures aren't highly intermittent, load the raw live.log in a browser tab while running -- that will stream the logs as they happen, so even if the worker crashes, it will have exported every bit of data to your browser.
 - reproduce locally or in a loaner and do something similar

Wander, how do you feel about adding something like a `memory` capability or `memoryLimit` feature that can limit the amount of memory allocated to a task?  I think that would cause the task to fail before the kernel starts OOM'ing.  I'm pretty strongly opposed to continuing to spend staff time on docker-worker, as it's slowing our progress toward using taskcluster-worker, but that's just my opinion.
Component: General → Docker-Worker
Flags: needinfo?(wcosta)
I can work on this after packet.net is done.
Assignee: nobody → wcosta
Status: NEW → ASSIGNED
Flags: needinfo?(wcosta)
:kats, do you have a recent failed task for test?
Flags: needinfo?(bugmail)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #9)
> You can see a few in this push from this morning:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=1de0ed14a21da6a34c04447804847d33a5d789df&filter-
> searchStr=qr%20debug%20mochi%20win

That looks like a generic-worker issue, rather than a docker-worker issue - I will create a separate bug for that...
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #9)
> You can see a few in this push from this morning:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=1de0ed14a21da6a34c04447804847d33a5d789df&filter-
> searchStr=qr%20debug%20mochi%20win

:kats, did you see this bug happening on Linux tests? If not, I will close this as pmoore will investigate the Windows problem in another bug.
Flags: needinfo?(bugmail)
I haven't seen this on Linux recently, no. Please let me know the bug number for the one pmoore is investigating.
Flags: needinfo?(bugmail)
:pmoore, could you please link here the bug you create to investigate the problem in the generic-worker?
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Flags: needinfo?(pmoore)
Resolution: --- → WORKSFORME
(In reply to Wander Lairson Costa [:wcosta] from comment #13)
> :pmoore, could you please link here the bug you create to investigate the
> problem in the generic-worker?

I think this was an OCC issue, but I think logs have expired now, so I don't have access any more. Let's open a new bug if it happens again.
Flags: needinfo?(pmoore)
This happens all the time. See https://treeherder.mozilla.org/#/jobs?repo=try&revision=1fbd50025ab477e455b1b6f29c8756e18e3e94d7 for a more recent example.
Flags: needinfo?(pmoore)
It looks like this is a case of bug 1372172.

The worker loses network connectivity and eventually the queue resolves the task as claim-expired.

Looking at papertrail, at a certain point it suddenly stops logging:

https://papertrailapp.com/systems/1973914161/events
Component: Docker-Worker → Relops: OpenCloudConfig
Depends on: 1372172
Flags: needinfo?(pmoore) → needinfo?(rthijssen)
Product: Taskcluster → Infrastructure & Operations
QA Contact: rthijssen
yes. the blue stutter on gecko-t-win10-64-gpu workers in treeherder occurs when we lose the instance mid task. we don't really know what causes it but it tends to mostly show up when a webgl test is running. my theory is that there is a driver level address conflict between the gpu and the network card and that some call to a gpu specific function causes the fault and we lose network connectivity with the instance.
Flags: needinfo?(rthijssen)
Resolution: WORKSFORME → DUPLICATE
Hi Kartikaya,

It might be worth tailing the livelog via a curl session while the task is running, so that when the worker goes offline, you've got logs to see what it was doing at the time the worker went offline.

To do this, go to the Artifacts tab and look for live.log, and then grab the url of it. In a shell, curl -L <URL> to tail it, and then once the task fails, you should have a better idea of just whereabouts it was just before it lost network connectivity.

If you could paste your findings in bug 1372172 that should help the relops team with diagnosis. Thanks!
Flags: needinfo?(bugmail)
I curl'd the livelog and posted the results at https://bugzilla.mozilla.org/show_bug.cgi?id=1372172#c33
Flags: needinfo?(bugmail)
You need to log in before you can comment on or make changes to this bug.