1339136 - Taskcluster jobs that hang don't save their logs and just retry themselves instead

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Description

•

7 years ago

I'm trying to stand up some mochitest suites (the webgl one, right now) on the QuantumRender builds. Sample try push at [1]. What happens is the job starts up, and via the "live log" I can see it starts to run the mochitests. At some point firefox seems to hang and the "live log" shows no new output for a long time. Then the job expires or times out, turns blue on TreeHerder, and the log is thrown away. This makes it quite hard to debug the failure because the logs are not accessible afterwards.

I have a new try push in progress right now (I just started it a few minutes ago) at [2] which will probably have the same behaviour for the M-e10s(gl) jobs.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=5b6935fb0f717dd969a26a67a759d439d35d7c11&group_state=expanded&selectedJob=75996268
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=7fcf8751457d522675d4d34fe8b27e97ee136d95

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Comment 1

•

7 years ago

For posterity, :garndt diagnosed the problem here as the worker running out of memory. It's quite possible that the QR webgl mochitest is leaking memory somewhere so I'll follow up on that elsewhere. Leaving this bug for any taskcluster-side improvements to allow easier diagnosis of these sorts of problems.

Greg Arndt [:garndt]

Comment 2

•

7 years ago

Some notes:
1. The worker is crashing because the machine has run out of memory 
2. By design, when a worker crashes and restarts, we allow it to die off at the end of its billing cycle rather than claim more tasks because we do not know why the worker crashed to begin with. This includes resuming whatever tasks were currently running prior to the crash, which is why the task ends up getting resolved as claim-expired and no logs are uploaded.
3. We have talked about task statistics being uploaded at the end of a task, which would help knowing if a task over time has started consuming more memory, but it will not help for when a task has caused the machine to run out of memory
4. We could limit the memory that a container is allowed to use based on some % of total system memory to prevent issues of the worker crashing or other undefined behaviors
5. We capture memory statistics within the worker rand report to signalfx but it's meant more for managing pools of machines over time rather than responding to specific incidents for a smaller portion of the machines.  We could log to papertrail when a certain threshold is reached and setup an email alert around that.
6. One of the modules that cause the worker to crash when no memory is available is the module used for checking diskspace of the volumes used by docker.  We could catch the exception being thrown and log the failure.  This could cause other unintended things to happen though if the system is out of memory.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Updated

•

7 years ago

Blocks: 1341745

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Updated

•

7 years ago

Blocks: 1342966

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Updated

•

7 years ago

No longer blocks: 1342966

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Updated

•

7 years ago

Depends on: 1343223

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Updated

•

7 years ago

Blocks: 1343223

No longer depends on: 1343223

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Comment 3

•

6 years ago

For the record, this is still an issue is quite annoying.

Blocks: 1424755

Dustin J. Mitchell [:dustin] (he/him)

Comment 4

•

6 years ago

There's very little we can do in OOM situations - just about any activity in node requires allocation.  Currently, the workers often misbehave.  Bug 1458737 will have us restart instead, which should help the misbehavior.  But those logs will still be missing, because they are only on the worker until the task completes.

Probably the right fix is to either avoid tests that consume infinite memory (admitting this might happen in failure conditions), or use instances with enough memory to run these tasks without failing.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Comment 5

•

6 years ago

The problem is that in order to fix tests that are erroneously consuming infinite memory, we need the logs to debug what's going on.

Dustin J. Mitchell [:dustin] (he/him)

Comment 6

•

6 years ago

Hm, https://docs.docker.com/config/containers/resource_constraints/#limit-a-containers-access-to-memory suggests that Docker has options to do this.

Options:
 - fork a second process that just watches the test runner's memory consumption in a loop and kills it if it starts growing
 - if these failures aren't highly intermittent, load the raw live.log in a browser tab while running -- that will stream the logs as they happen, so even if the worker crashes, it will have exported every bit of data to your browser.
 - reproduce locally or in a loaner and do something similar

Wander, how do you feel about adding something like a `memory` capability or `memoryLimit` feature that can limit the amount of memory allocated to a task?  I think that would cause the task to fail before the kernel starts OOM'ing.  I'm pretty strongly opposed to continuing to spend staff time on docker-worker, as it's slowing our progress toward using taskcluster-worker, but that's just my opinion.

Component: General → Docker-Worker

Flags: needinfo?(wcosta)

Wander Lairson Costa

Assignee

Comment 7

•

6 years ago

I can work on this after packet.net is done.

Assignee: nobody → wcosta

Status: NEW → ASSIGNED

Flags: needinfo?(wcosta)

Wander Lairson Costa

Assignee

Comment 8

•

6 years ago

:kats, do you have a recent failed task for test?

Flags: needinfo?(bugmail)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Comment 9

•

6 years ago

You can see a few in this push from this morning: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1de0ed14a21da6a34c04447804847d33a5d789df&filter-searchStr=qr%20debug%20mochi%20win

Flags: needinfo?(bugmail)

Pete Moore [:pmoore][:pete]

Comment 10

•

6 years ago

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #9)
> You can see a few in this push from this morning:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=1de0ed14a21da6a34c04447804847d33a5d789df&filter-
> searchStr=qr%20debug%20mochi%20win

That looks like a generic-worker issue, rather than a docker-worker issue - I will create a separate bug for that...

Wander Lairson Costa

Assignee

Comment 11

•

6 years ago

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #9)
> You can see a few in this push from this morning:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=1de0ed14a21da6a34c04447804847d33a5d789df&filter-
> searchStr=qr%20debug%20mochi%20win

:kats, did you see this bug happening on Linux tests? If not, I will close this as pmoore will investigate the Windows problem in another bug.

Flags: needinfo?(bugmail)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Comment 12

•

6 years ago

I haven't seen this on Linux recently, no. Please let me know the bug number for the one pmoore is investigating.

Flags: needinfo?(bugmail)

Wander Lairson Costa

Assignee

Comment 13

•

6 years ago

:pmoore, could you please link here the bug you create to investigate the problem in the generic-worker?

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Flags: needinfo?(pmoore)

Resolution: --- → WORKSFORME

Pete Moore [:pmoore][:pete]

Comment 14

•

6 years ago

(In reply to Wander Lairson Costa [:wcosta] from comment #13)
> :pmoore, could you please link here the bug you create to investigate the
> problem in the generic-worker?

I think this was an OCC issue, but I think logs have expired now, so I don't have access any more. Let's open a new bug if it happens again.

Flags: needinfo?(pmoore)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Comment 15

•

6 years ago

This happens all the time. See https://treeherder.mozilla.org/#/jobs?repo=try&revision=1fbd50025ab477e455b1b6f29c8756e18e3e94d7 for a more recent example.

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Comment 16

•

6 years ago

It looks like this is a case of bug 1372172.

The worker loses network connectivity and eventually the queue resolves the task as claim-expired.

Looking at papertrail, at a certain point it suddenly stops logging:

https://papertrailapp.com/systems/1973914161/events

Component: Docker-Worker → Relops: OpenCloudConfig

Depends on: 1372172

Flags: needinfo?(pmoore) → needinfo?(rthijssen)

Product: Taskcluster → Infrastructure & Operations

QA Contact: rthijssen

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 17

•

6 years ago

yes. the blue stutter on gecko-t-win10-64-gpu workers in treeherder occurs when we lose the instance mid task. we don't really know what causes it but it tends to mostly show up when a webgl test is running. my theory is that there is a driver level address conflict between the gpu and the network card and that some call to a gpu specific function causes the fault and we lose network connectivity with the instance.

Flags: needinfo?(rthijssen)

Resolution: WORKSFORME → DUPLICATE

Pete Moore [:pmoore][:pete]

Comment 18

•

6 years ago

Hi Kartikaya,

It might be worth tailing the livelog via a curl session while the task is running, so that when the worker goes offline, you've got logs to see what it was doing at the time the worker went offline.

To do this, go to the Artifacts tab and look for live.log, and then grab the url of it. In a shell, curl -L <URL> to tail it, and then once the task fails, you should have a better idea of just whereabouts it was just before it lost network connectivity.

If you could paste your findings in bug 1372172 that should help the relops team with diagnosis. Thanks!

Flags: needinfo?(bugmail)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Comment 19

•

6 years ago

I curl'd the livelog and posted the results at https://bugzilla.mozilla.org/show_bug.cgi?id=1372172#c33

Flags: needinfo?(bugmail)

Bugzilla

Quick Search

Taskcluster jobs that hang don't save their logs and just retry themselves instead

Categories

(Infrastructure & Operations :: RelOps: OpenCloudConfig, task)

Tracking

(Not tracked)

People

(Reporter: kats, Assigned: wcosta)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Updated

Updated

Updated

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Comment 19