1296547 - [docker-worker] Pulling docker image for Ubuntu 16.04 fails for desktop-test-large workers: No space left on device (ENOSPC)

Reporter

Description

•

8 years ago

Today we got a failure when running our Firefox ui tests on our qa-3-linux-fx-tests worker:

[taskcluster 2016-08-18 20:23:32.762Z] Download Progress: 97.76%
[taskcluster 2016-08-18 20:23:37.762Z] Download Progress: 99.11%
[taskcluster 2016-08-18 20:23:41.340Z] Downloaded artifact successfully.
[taskcluster 2016-08-18 20:23:41.341Z] Downloaded 3519.436 mb
[taskcluster 2016-08-18 20:23:41.342Z] Loading docker image from downloaded archive.
[taskcluster:error] Pulling docker image {"path":"public/image.tar","type":"task-image","taskId":"CSimoiBLQ5CD3aYSkIZQPw"} has failed: ENOSPC, write
[taskcluster 2016-08-18 20:23:56.271Z] Unsuccessful task run with exit code: -1 completed in 429.544 seconds

Looks like the disk is full during extraction of the docker image. So whether we might have to increase the disk size if possible, or find a solution for bug 1294264.

Flags: needinfo?(dustin)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Updated

•

8 years ago

URL: https://treeherder.mozilla.org/logvie...

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

8 years ago

Flags: needinfo?(dustin) → needinfo?(garndt)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 1

•

8 years ago

Looks like normal desktop-test instances are causing the same issue:

https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=Firefox%20UI&bugfiler&fromchange=20885c842fd32bde80566eacf5e8ad04ae0723bd&selectedJob=2197055

Maybe it is related to the large desktop-test changes? Do those have lesser disk-space?

Greg Arndt [:garndt]

Assignee

Comment 2

•

8 years ago

I'm slowly digging into this, but I think the worker will need some deeper profiling done at some point in the near future.

desktop-test-large most certainly has less space than desktop-test. Desktop-test-large has 30G, where desktop-test has 410G. Also, from looking at our metrics, it's clear that there was a spike in machines running out of diskspace starting yesterday afternoon. I spoke with Joel and we enabled more tests/branches on this workertype which will cause even more space to be taken up.

The worker will do what it can to free up diskspace during each garbage collection cycle (every 60 seconds by default), but this in some cases might not be quick enough.

Also, with only having a 30G disk, that means that potentially just downloading/extracting/importing one large docker image could take up quite a bit of space initially.

For instance, a 4GB task image could potentially use 16GB of space.The image is downloaded (4gb), then it's extracted and manipulated before importing (4gb), and then it's imported (another 4gb).

There is definitely some optimizations I have in mind here for the future. Including skipping out some of the steps for how we download and import the images. We've been fortunate to run on workers that had a lot of disk on them, but need to consider running on these instances that might not be as large.

Some investigation also needs to go into how the workers are claiming tasks. We set a per task threshold of 20GB for that worker type, so it shouldn't have claimed a task until it had at least that. That should have been more than enough to at least prevent the issue of the task failing in the bug description. Unless somehow something is causing / to run out of space as well.

Flags: needinfo?(garndt)

Greg Arndt [:garndt]

Assignee

Updated

•

8 years ago

Assignee: nobody → garndt

Component: Docker-Worker → Worker

Summary: Pulling docker image for Ubuntu 16.04 fails for qa-3-linux-fx-tests workers: No space left on device (ENOSPC) → [docker-worker] Pulling docker image for Ubuntu 16.04 fails for qa-3-linux-fx-tests workers: No space left on device (ENOSPC)

Greg Arndt [:garndt]

Assignee

Comment 3

•

8 years ago

I noticed in the configuration that the per task diskspace threshold* is 2000000000 bytes, which is just 2gb (default for docker-worker is 10gb).  Chances are it was intended for this to be a 20gb threshold, but missing an extra zero.  

Our other instances generally have quite a bit of room on them (120gb for builders, 410gb for desktop-test) so even with the default setting of 10gb it generally does not hit issues before the worker dies off.

I propose that we go back to the default on these workers and see how that plays out.  I don't think it could be worse than what we now.  

Specifically I'm removing this piece of instanceTypes.userdata in the workerType definition:
        "capacityManagement": {
          "diskspaceThreshold": 2000000000
        },



*The threshold indicates how much space should be free for any task before things are garbage collected.  Based on this setting, the worker would not even try to free up space until it was below 2gb free, and in that case would hit this issue extremely quickly when downloading/extracting an image.

Comment hidden (Intermittent Failures Robot)

6 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* autoland: 5
* mozilla-inbound: 1

Platform breakdown:
* linux64: 6

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1296547&startday=2016-08-15&endday=2016-08-21&tree=all

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 5

•

8 years ago

Thanks Greg! Generally speaking those workers should be ideally identical with desktop-test workers. We only created those to have our own task queue.

Dustin J. Mitchell [:dustin] (he/him)

Comment 6

•

8 years ago

Worth noting this was run on desktop-test-large, not the qa-3-linux-fx-tests-workers, per https://tools.taskcluster.net/task-inspector/#C2l-UzeTRUSJ4eGqkyutJw/

As I noted in bug 1281241, the qa-3-* appear to be unused.

Summary: [docker-worker] Pulling docker image for Ubuntu 16.04 fails for qa-3-linux-fx-tests workers: No space left on device (ENOSPC) → [docker-worker] Pulling docker image for Ubuntu 16.04 fails for desktop-test-large workers: No space left on device (ENOSPC)

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 7

•

8 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #6)
> As I noted in bug 1281241, the qa-3-* appear to be unused.

How that? We use it daily for the nightly builds:
https://github.com/mozilla/mozmill-ci/blob/master/lib/tasks/functional.yml#L15

Here an example:
https://tools.taskcluster.net/task-inspector/#T30oqimoSQWZKjcR2XhrRw/

Dustin J. Mitchell [:dustin] (he/him)

Comment 8

•

8 years ago

Ah, I see -- so the out-of-tree tasks use the qa-3-* workerTypes, but the in-tree tasks use desktop-test.  Do you want to unify those?

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 9

•

8 years ago

Both worker types should be identically configured. If you can do that, it would be great.

Gregory Szorc [:gps]

Comment 10

•

8 years ago

(In reply to Henrik Skupin (:whimboo) from comment #7)
> (In reply to Dustin J. Mitchell [:dustin] from comment #6)
> > As I noted in bug 1281241, the qa-3-* appear to be unused.
> 
> How that? We use it daily for the nightly builds:
> https://github.com/mozilla/mozmill-ci/blob/master/lib/tasks/functional.
> yml#L15

I'd like to point out that we wouldn't have this misunderstanding if this code were in-tree.

Henrik Skupin [:whimboo][⌚️UTC+1]

Reporter

Comment 11

•

8 years ago

We cannot have this code in tree right now. There are substantial pieces missing like the taskgraph for Nightly builds, which block us to do so. It means the situation will still be around the next couple of months.

Greg Arndt [:garndt]

Assignee

Comment 12

•

8 years ago

So I think fixing the diskspace threshold misconfiguration definitely helped things out, but this issue is still coming up.  I'm guessing a lot has to do to the overhead of loading a task image (about 4x space needed to load an image).  The next step would be to investigate how we save/load images to cut out some unnecessary space.  There are also some things the worker could do to incrementally save space while loading an image rather than removing the temporary files stored at the end.

Greg Arndt [:garndt]

Assignee

Comment 13

•

8 years ago

I investigated removing some of the middle steps of image loading, but it appears that it might not be possible so we'll need to figure out how to optimize elsewhere.

What I was attempting to do is use docker export/import rather than save/load because "import" allows an image to be retagged when importing.  However, import/export is meant for exporting a container, not an image.  Because of this, we first extract the tarball and rename the image within the json metadata that's included prior to loading.  This is the step I was hoping to remove but did not have much luck.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 14

•

8 years ago

This is still mass-occurring: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=7e3e593e814146029954f913db48f9b6b074a50b

Closed the trees for this.

Severity: normal → blocker

Flags: needinfo?(garndt)

Greg Arndt [:garndt]

Assignee

Comment 15

•

8 years ago

I believe the trees could be opened now.  I switched the diskspace thresholds on these workers to be at least 20gb free before it will claim/run a task.  This should account for the space needed to download a task image, load it, and run the task.

There is also a PR that was merged and will be deployed in the next 24 hours to help with the diskspace issues. https://github.com/taskcluster/docker-worker/pull/242

Flags: needinfo?(garndt)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 16

•

8 years ago

I just reopened trees.

Severity: blocker → major

Comment hidden (Intermittent Failures Robot)

43 automation job failures were associated with this bug yesterday.

Repository breakdown:
* fx-team: 41
* mozilla-inbound: 2

Platform breakdown:
* linux64: 43

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1296547&startday=2016-08-23&endday=2016-08-23&tree=all

Comment hidden (Intermittent Failures Robot)

43 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* fx-team: 41
* mozilla-inbound: 2

Platform breakdown:
* linux64: 43

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1296547&startday=2016-08-22&endday=2016-08-28&tree=all

Greg Arndt [:garndt]

Assignee

Comment 19

•

8 years ago

Looking at orange factor this issue has become pretty non-existent after the recent changes.  I'm going to vote for marking this resolved.  Fixed by changes made to our workers and the worker type definitions (diskspace threshold).  Doesn't look like we need to move to larger instance types afterall.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

5 years ago

Component: Worker → Workers

Bugzilla

Quick Search

[docker-worker] Pulling docker image for Ubuntu 16.04 fails for desktop-test-large workers: No space left on device (ENOSPC)

Categories

(Taskcluster :: Workers, defect)

Tracking

(Not tracked)

People

(Reporter: whimboo, Assigned: garndt)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Comment 19

Updated