Closed Bug 1587611 Opened 5 years ago Closed 3 years ago

open /var/lib/docker/tmp/docker-import-.../repo/.../json: no such file or directory

Categories

(Infrastructure & Operations :: RelOps: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: mhentges, Unassigned)

References

Details

My job on mobile-1-images is failing while squashing the docker image.

2019-10-09 20:58:12,547 root         DEBUG    Cleaning up /tmp/docker-squash-lkai3x72 temporary directory
2019-10-09 20:58:14,719 root         ERROR    404 Client Error: Not Found ("b'open /var/lib/docker/tmp/docker-import-786951861/repo/73f371facc8aecf7b130ec6faa9b1539969ea2eb6d5f2fbb070ca8d545ed08e0/json: no such file or directory'")
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 222, in _raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localunixsocket/v1.18/images/load

FWIW, I tried building the image and squashing locally using the following commands, and it finishes without an error.

$ taskgraph build-image linux -t mitchhentges:v1
$ docker-squash -v -t "mitchhentges:v1-squashed" "mitchhentges:v1"

Is this intermittent?

It seems consistent: I've seen > 5 failures, but never any successes yet.

Does docker squash work in other tasks?

What's different about that task and this one?

Quite a bit: the project, the image that's being built, the environment.
I'm going to play with the Dockerfile and see if the issue is related to its contents. If so, I'll bisect and determine what specifically is causing the squash issue

See Also: → 1501720

Sometimes, I'm getting a different error:

2019-10-11 21:48:49,340 urllib3.connectionpool DEBUG    http://localhost:None "POST /v1.18/images/load HTTP/1.1" 500 4
2019-10-11 21:48:49,340 root         DEBUG    Cleaning up /tmp/docker-squash-10vn23ow temporary directory
2019-10-11 21:48:51,259 root         ERROR    500 Server Error: Internal Server Error ("b'EOF'")
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 222, in _raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 940, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localunixsocket/v1.18/images/load

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/docker_squash/cli.py", line 87, in run
    from_layer=args.from_layer, tag=args.tag, output_path=args.output_path, tmp_dir=args.tmp_dir, development=args.development, cleanup=args.cleanup).run()
  File "/usr/local/lib/python3.6/dist-packages/docker_squash/squash.py", line 59, in run
    return self.squash(image)
  File "/usr/local/lib/python3.6/dist-packages/docker_squash/squash.py", line 92, in squash
    image.load_squashed_image()
  File "/usr/local/lib/python3.6/dist-packages/docker_squash/image.py", line 254, in load_squashed_image
    self._load_image(self.new_image_dir)
  File "/usr/local/lib/python3.6/dist-packages/docker_squash/image.py", line 299, in _load_image
    self.docker.load_image(f)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/image.py", line 298, in load_image
    self._raise_for_status(res)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 224, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/usr/local/lib/python3.6/dist-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 500 Server Error: Internal Server Error ("b'EOF'")

I'm unsure if this is intermittent or not. I've been slowly removing commands from my Dockerfile and committing the changes one-by-one, and some of the builds fail due to .../images/load, and others fail due to /var/lib/docker/tmp/docker-import-.../repo/.../json. It doesn't seem to be strictly related to the docker image size.

This task failed because it couldn't upload the artifact:

[taskcluster:error] Error uploading "public/image.tar.zst" artifact. Could not upload artifact. Status Code: 400

I've reduced the problem Dockerfile as much as possible while still seeing the same error. The smallest I've got it is visible here.

I'm going to create a minimal standalone repository that can cause the issue so that I can confirm that other application-services config isn't affecting this problem.

After investigating this some more, this is entirely a docker image size issue. Above a certain size, one of two possible errors will fail the image build. Note that the error that occurs isn't consistent, you can re-run the same docker build and you'll get a different error.

I created a standalone repository that just builds docker images when commits are pushed or a release happens. After creating the first commit, I re-ran the build until I saw both errors:

The most convenient workaround I can think of right now is to move some dependency installations from the Dockerfile to instead happen each time the Dockerfile is used

Is this a disk-space issue?

I'm not sure how mobile-1-images works internally, these are the error messages I see based on the different Dockerfiles I'm providing.
Based on how this is dependent on docker image size, it sounds like it could be related to disk or memory space, but that's an assumption.

To test if this is a disk space issue, I doubled the volumeSize and diskspaceThreshold of an images worker and tried again. I received the 500 Server Error: Internal Server Error for url: http+docker://localunixsocket/v1.18/images/load error when I built the image

Interesting, so this a docker bug. I think we are blocked at the moment from upgrading docker by requiring an ubuntu upgrade and kernel versions and packet and mumble mumble mumble. It'd be interesting to know if this occurs with newer dockers, or if it's correlated with some aspect of how docker-worker uses docker.

But for the moment, I don't think any of that can change, so .. do you have an adequate workaround?

I can work around this by removing installations from the docker image and adding it as a "pre-run" list of commands for each task using the docker image.

So, I'm guessing that this will work better with a newer docker version, which requires a newer Ubuntu version on the host. I expect relops will be working on that at some point, so moving over to that component for triage..

Assignee: nobody → relops
Component: General → RelOps: General
Product: Taskcluster → Infrastructure & Operations
QA Contact: klibby
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.