1587611 - open /var/lib/docker/tmp/docker-import-.../repo/.../json: no such file or directory

Reporter

Description

•

5 years ago

My job on mobile-1-images is failing while squashing the docker image.

2019-10-09 20:58:12,547 root         DEBUG    Cleaning up /tmp/docker-squash-lkai3x72 temporary directory
2019-10-09 20:58:14,719 root         ERROR    404 Client Error: Not Found ("b'open /var/lib/docker/tmp/docker-import-786951861/repo/73f371facc8aecf7b130ec6faa9b1539969ea2eb6d5f2fbb070ca8d545ed08e0/json: no such file or directory'")
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 222, in _raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localunixsocket/v1.18/images/load

FWIW, I tried building the image and squashing locally using the following commands, and it finishes without an error.

$ taskgraph build-image linux -t mitchhentges:v1
$ docker-squash -v -t "mitchhentges:v1-squashed" "mitchhentges:v1"

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

5 years ago

Is this intermittent?

Mitchell Hentges [:mhentges] 🦀

Reporter

Comment 2

•

5 years ago

It seems consistent: I've seen > 5 failures, but never any successes yet.

Dustin J. Mitchell [:dustin] (he/him)

Comment 3

•

5 years ago

Does docker squash work in other tasks?

Mitchell Hentges [:mhentges] 🦀

Reporter

Comment 4

•

5 years ago

•

Edited

Yeah, this task from 7 hours ago on mobile-1-images worked

Dustin J. Mitchell [:dustin] (he/him)

Comment 5

•

5 years ago

What's different about that task and this one?

Mitchell Hentges [:mhentges] 🦀

Reporter

Comment 6

•

5 years ago

Quite a bit: the project, the image that's being built, the environment.
I'm going to play with the Dockerfile and see if the issue is related to its contents. If so, I'll bisect and determine what specifically is causing the squash issue

Mitchell Hentges [:mhentges] 🦀

Reporter

Updated

•

5 years ago

Comment 7

•

5 years ago

•

Edited

Sometimes, I'm getting a different error:

2019-10-11 21:48:49,340 urllib3.connectionpool DEBUG    http://localhost:None "POST /v1.18/images/load HTTP/1.1" 500 4
2019-10-11 21:48:49,340 root         DEBUG    Cleaning up /tmp/docker-squash-10vn23ow temporary directory
2019-10-11 21:48:51,259 root         ERROR    500 Server Error: Internal Server Error ("b'EOF'")
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 222, in _raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localunixsocket/v1.18/images/load

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/docker_squash/cli.py", line 87, in run
    from_layer=args.from_layer, tag=args.tag, output_path=args.output_path, tmp_dir=args.tmp_dir, development=args.development, cleanup=args.cleanup).run()
  File "/usr/local/lib/python3.6/dist-packages/docker_squash/squash.py", line 59, in run
    return self.squash(image)
  File "/usr/local/lib/python3.6/dist-packages/docker_squash/squash.py", line 92, in squash
    image.load_squashed_image()
  File "/usr/local/lib/python3.6/dist-packages/docker_squash/image.py", line 254, in load_squashed_image
    self._load_image(self.new_image_dir)
  File "/usr/local/lib/python3.6/dist-packages/docker_squash/image.py", line 299, in _load_image
    self.docker.load_image(f)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/image.py", line 298, in load_image
    self._raise_for_status(res)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 224, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/usr/local/lib/python3.6/dist-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 500 Server Error: Internal Server Error ("b'EOF'")

I'm unsure if this is intermittent or not. I've been slowly removing commands from my Dockerfile and committing the changes one-by-one, and some of the builds fail due to .../images/load, and others fail due to /var/lib/docker/tmp/docker-import-.../repo/.../json. It doesn't seem to be strictly related to the docker image size.

Mitchell Hentges [:mhentges] 🦀

Reporter

Comment 8

•

5 years ago

This task failed because it couldn't upload the artifact:

[taskcluster:error] Error uploading "public/image.tar.zst" artifact. Could not upload artifact. Status Code: 400

Mitchell Hentges [:mhentges] 🦀

Reporter

Comment 9

•

5 years ago

I've reduced the problem Dockerfile as much as possible while still seeing the same error. The smallest I've got it is visible here.

I'm going to create a minimal standalone repository that can cause the issue so that I can confirm that other application-services config isn't affecting this problem.

Mitchell Hentges [:mhentges] 🦀

Reporter

Comment 10

•

5 years ago

After investigating this some more, this is entirely a docker image size issue. Above a certain size, one of two possible errors will fail the image build. Note that the error that occurs isn't consistent, you can re-run the same docker build and you'll get a different error.

I created a standalone repository that just builds docker images when commits are pushed or a release happens. After creating the first commit, I re-ran the build until I saw both errors:

The most convenient workaround I can think of right now is to move some dependency installations from the Dockerfile to instead happen each time the Dockerfile is used

Dustin J. Mitchell [:dustin] (he/him)

Comment 11

•

5 years ago

Is this a disk-space issue?

Mitchell Hentges [:mhentges] 🦀

Reporter

Comment 12

•

5 years ago

I'm not sure how mobile-1-images works internally, these are the error messages I see based on the different Dockerfiles I'm providing.
Based on how this is dependent on docker image size, it sounds like it could be related to disk or memory space, but that's an assumption.

Mitchell Hentges [:mhentges] 🦀

Reporter

Comment 13

•

5 years ago

To test if this is a disk space issue, I doubled the volumeSize and diskspaceThreshold of an images worker and tried again. I received the 500 Server Error: Internal Server Error for url: http+docker://localunixsocket/v1.18/images/load error when I built the image

Dustin J. Mitchell [:dustin] (he/him)

Comment 14

•

5 years ago

Interesting, so this a docker bug. I think we are blocked at the moment from upgrading docker by requiring an ubuntu upgrade and kernel versions and packet and mumble mumble mumble. It'd be interesting to know if this occurs with newer dockers, or if it's correlated with some aspect of how docker-worker uses docker.

But for the moment, I don't think any of that can change, so .. do you have an adequate workaround?

Mitchell Hentges [:mhentges] 🦀

Reporter

Comment 15

•

5 years ago

I can work around this by removing installations from the docker image and adding it as a "pre-run" list of commands for each task using the docker image.

Dustin J. Mitchell [:dustin] (he/him)

Comment 16

•

5 years ago

So, I'm guessing that this will work better with a newer docker version, which requires a newer Ubuntu version on the host. I expect relops will be working on that at some point, so moving over to that component for triage..

Assignee: nobody → relops

Component: General → RelOps: General

Product: Taskcluster → Infrastructure & Operations

QA Contact: klibby

ablew

Updated

•

3 years ago

Status: NEW → RESOLVED

Closed: 3 years ago

Resolution: --- → WONTFIX

Bugzilla

open /var/lib/docker/tmp/docker-import-.../repo/.../json: no such file or directory

Categories

(Infrastructure & Operations :: RelOps: General, defect)

Tracking

(Not tracked)

People

(Reporter: mhentges, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Updated