Closed Bug 1296547 Opened 8 years ago Closed 8 years ago

[docker-worker] Pulling docker image for Ubuntu 16.04 fails for desktop-test-large workers: No space left on device (ENOSPC)

Categories

(Taskcluster :: Workers, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Assigned: garndt)

References

()

Details

Today we got a failure when running our Firefox ui tests on our qa-3-linux-fx-tests worker:

[taskcluster 2016-08-18 20:23:32.762Z] Download Progress: 97.76%
[taskcluster 2016-08-18 20:23:37.762Z] Download Progress: 99.11%
[taskcluster 2016-08-18 20:23:41.340Z] Downloaded artifact successfully.
[taskcluster 2016-08-18 20:23:41.341Z] Downloaded 3519.436 mb
[taskcluster 2016-08-18 20:23:41.342Z] Loading docker image from downloaded archive.
[taskcluster:error] Pulling docker image {"path":"public/image.tar","type":"task-image","taskId":"CSimoiBLQ5CD3aYSkIZQPw"} has failed: ENOSPC, write
[taskcluster 2016-08-18 20:23:56.271Z] Unsuccessful task run with exit code: -1 completed in 429.544 seconds

Looks like the disk is full during extraction of the docker image. So whether we might have to increase the disk size if possible, or find a solution for bug 1294264.
Flags: needinfo?(dustin)
Flags: needinfo?(dustin) → needinfo?(garndt)
Looks like normal desktop-test instances are causing the same issue:

https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=Firefox%20UI&bugfiler&fromchange=20885c842fd32bde80566eacf5e8ad04ae0723bd&selectedJob=2197055

Maybe it is related to the large desktop-test changes? Do those have lesser disk-space?
I'm slowly digging into this, but I think the worker will need some deeper profiling done at some point in the near future.

desktop-test-large most certainly has less space than desktop-test.  Desktop-test-large has 30G, where desktop-test has 410G.  Also, from looking at our metrics, it's clear that there was a spike in machines running out of diskspace starting yesterday afternoon.  I spoke with Joel and we enabled more tests/branches on this workertype which will cause even more space to be taken up.  

The worker will do what it can to free up diskspace during each garbage collection cycle (every 60 seconds by default), but this in some cases might not be quick enough.

Also, with only having a 30G disk, that means that potentially just downloading/extracting/importing one large docker image could take up quite a bit of space initially.

For instance, a 4GB task image could potentially use 16GB of space.The image is downloaded (4gb), then it's extracted and manipulated before importing (4gb), and then it's imported (another 4gb).

There is definitely some optimizations I have in mind here for the future.  Including skipping out some of the steps for how we download and import the images.  We've been fortunate to run on workers that had a lot of disk on them, but need to consider running on these instances that might not be as large.


Some investigation also needs to go into how the workers are claiming tasks.  We set a per task threshold of 20GB for that worker type, so it shouldn't have claimed a task until it had at least that.  That should have been more than enough to at least prevent the issue of the task failing in the bug description.  Unless somehow something is causing / to run out of space as well.
Flags: needinfo?(garndt)
Assignee: nobody → garndt
Component: Docker-Worker → Worker
Summary: Pulling docker image for Ubuntu 16.04 fails for qa-3-linux-fx-tests workers: No space left on device (ENOSPC) → [docker-worker] Pulling docker image for Ubuntu 16.04 fails for qa-3-linux-fx-tests workers: No space left on device (ENOSPC)
I noticed in the configuration that the per task diskspace threshold* is 2000000000 bytes, which is just 2gb (default for docker-worker is 10gb).  Chances are it was intended for this to be a 20gb threshold, but missing an extra zero.  

Our other instances generally have quite a bit of room on them (120gb for builders, 410gb for desktop-test) so even with the default setting of 10gb it generally does not hit issues before the worker dies off.

I propose that we go back to the default on these workers and see how that plays out.  I don't think it could be worse than what we now.  

Specifically I'm removing this piece of instanceTypes.userdata in the workerType definition:
        "capacityManagement": {
          "diskspaceThreshold": 2000000000
        },



*The threshold indicates how much space should be free for any task before things are garbage collected.  Based on this setting, the worker would not even try to free up space until it was below 2gb free, and in that case would hit this issue extremely quickly when downloading/extracting an image.
Thanks Greg! Generally speaking those workers should be ideally identical with desktop-test workers. We only created those to have our own task queue.
Worth noting this was run on desktop-test-large, not the qa-3-linux-fx-tests-workers, per https://tools.taskcluster.net/task-inspector/#C2l-UzeTRUSJ4eGqkyutJw/

As I noted in bug 1281241, the qa-3-* appear to be unused.
Summary: [docker-worker] Pulling docker image for Ubuntu 16.04 fails for qa-3-linux-fx-tests workers: No space left on device (ENOSPC) → [docker-worker] Pulling docker image for Ubuntu 16.04 fails for desktop-test-large workers: No space left on device (ENOSPC)
(In reply to Dustin J. Mitchell [:dustin] from comment #6)
> As I noted in bug 1281241, the qa-3-* appear to be unused.

How that? We use it daily for the nightly builds:
https://github.com/mozilla/mozmill-ci/blob/master/lib/tasks/functional.yml#L15

Here an example:
https://tools.taskcluster.net/task-inspector/#T30oqimoSQWZKjcR2XhrRw/
Ah, I see -- so the out-of-tree tasks use the qa-3-* workerTypes, but the in-tree tasks use desktop-test.  Do you want to unify those?
Both worker types should be identically configured. If you can do that, it would be great.
(In reply to Henrik Skupin (:whimboo) from comment #7)
> (In reply to Dustin J. Mitchell [:dustin] from comment #6)
> > As I noted in bug 1281241, the qa-3-* appear to be unused.
> 
> How that? We use it daily for the nightly builds:
> https://github.com/mozilla/mozmill-ci/blob/master/lib/tasks/functional.
> yml#L15

I'd like to point out that we wouldn't have this misunderstanding if this code were in-tree.
We cannot have this code in tree right now. There are substantial pieces missing like the taskgraph for Nightly builds, which block us to do so. It means the situation will still be around the next couple of months.
So I think fixing the diskspace threshold misconfiguration definitely helped things out, but this issue is still coming up.  I'm guessing a lot has to do to the overhead of loading a task image (about 4x space needed to load an image).  The next step would be to investigate how we save/load images to cut out some unnecessary space.  There are also some things the worker could do to incrementally save space while loading an image rather than removing the temporary files stored at the end.
I investigated removing some of the middle steps of image loading, but it appears that it might not be possible so we'll need to figure out how to optimize elsewhere.

What I was attempting to do is use docker export/import rather than save/load because "import" allows an image to be retagged when importing.  However, import/export is meant for exporting a container, not an image.  Because of this, we first extract the tarball and rename the image within the json metadata that's included prior to loading.  This is the step I was hoping to remove but did not have much luck.
I believe the trees could be opened now.  I switched the diskspace thresholds on these workers to be at least 20gb free before it will claim/run a task.  This should account for the space needed to download a task image, load it, and run the task.

There is also a PR that was merged and will be deployed in the next 24 hours to help with the diskspace issues. https://github.com/taskcluster/docker-worker/pull/242
Flags: needinfo?(garndt)
I just reopened trees.
Severity: blocker → major
Looking at orange factor this issue has become pretty non-existent after the recent changes.  I'm going to vote for marking this resolved.  Fixed by changes made to our workers and the worker type definitions (diskspace threshold).  Doesn't look like we need to move to larger instance types afterall.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: Worker → Workers
You need to log in before you can comment on or make changes to this bug.