1302596 - Intermittent [taskcluster:error] Pulling docker image has failed.

Reporter

Description

•

8 years ago

treeherder

Filed by: philringnalda [at] gmail.com https://treeherder.mozilla.org/logviewer.html#?job_id=3458876&repo=autoland https://queue.taskcluster.net/v1/task/U57rFB4pRZu4RcATq6vUMw/runs/0/artifacts/public%2Flogs%2Flive_backing.log https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/U57rFB4pRZu4RcATq6vUMw/runs/0/artifacts/public%2Flogs%2Flive_backing.log&only_show_unexpected=1

Comment hidden (Intermittent Failures Robot)

Dustin J. Mitchell [:dustin] (he/him)

Comment 20

•

8 years ago

From Greg in https://bugzilla.mozilla.org/show_bug.cgi?id=1296659#c5 Looking back at the last dozen or so of these recorded, the workers are removing an image during a garbage collection cycle at the same time as the worker is importing an image. I'm guessing that during import of an image tarball there are less locks on shared layers than when doing a pull from docker hub.

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

8 years ago

Depends on: 1305442

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 31

•

8 years ago

this problem is increasing as of yesterday- either a fix was backed out, or we have a new problem. what I see as issues in the last couple of days is primarily build tasks- where as before we had other *lint* and unittest tasks hitting this issue. As of now, this is the #3 intermittent orange on the trees (ignoring 2 others with fixes landed) :garndt, can you make sure this gets on the radar?

Flags: needinfo?(garndt)

Carsten Book [:Tomcat]

Comment 32

•

8 years ago

aurora was running about ~ 300 times into this problem here and according to dustin this is now affecting all branches, so closing trees and bumping to blocker

Severity: normal → blocker

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

8 years ago

Depends on: 1314284

Carsten Book [:Tomcat]

Comment 33

•

8 years ago

(In reply to Carsten Book [:Tomcat] from comment #32) > aurora was running about ~ 300 times into this problem here and according to > dustin this is now affecting all branches, so closing trees and bumping to > blocker note that trees are reopen since awhile now, also no other trees were affected

Greg Arndt [:garndt]

Comment 34

•

8 years ago

> :garndt, can you make sure this gets on the radar? This has been an ongoing issue we've been looking into. Changes have been deployed this morning that we hope will eliminate these errors. We will be monitoring. More information is in https://bugzilla.mozilla.org/show_bug.cgi?id=1314284

Flags: needinfo?(garndt)

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 44

•

8 years ago

:garndt, this is still ongoing at a pretty steady rate- the data isn't clustered, so this indicates we are dealing with something not related to a common outage (even if it happens once/day) instead this is happening 5-6 times/day and spread out evenly throughout the day. Should we make this an auto retry case?

Flags: needinfo?(garndt)

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 53

•

8 years ago

:garndt, I am not sure if this will be fixed by the sidecar work (bug 1307493), but I wanted to make sure this is on your radar for the intermittent work in this quarter.

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 60

•

8 years ago

now that sidecar is landed, we still have a very high failure rate for this bug, :garndt, who can take this work and figure out a solution?

Comment hidden (Intermittent Failures Robot)

Greg Arndt [:garndt]

Comment 64

•

8 years ago

I think for the most part this error has become much less because we are now bundling the images inside the AMI. There are still issues sometimes with linking the containers, but I think those are being captured under other bugs.

Flags: needinfo?(garndt)

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

8 years ago

Whiteboard: [stockwell infra]

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 71

•

8 years ago

this had a spike on April 7th, otherwise we are having a slow and steady rate still

Comment hidden (Intermittent Failures Robot)

Greg Arndt [:garndt]

Comment 77

•

8 years ago

The spike of these appear to be from a single machine having an issue, which is a rare occurrence. Mostly with this in the log: "Error: 140511475332992:error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac:../deps/openssl/openssl/ssl/s3_pkt.c:532"

Gregory Szorc [:gps]

Comment 78

•

8 years ago

(In reply to Greg Arndt [:garndt] from comment #77) > The spike of these appear to be from a single machine having an issue, which > is a rare occurrence. Mostly with this in the log: > "Error: 140511475332992:error:1408F119:SSL > routines:SSL3_GET_RECORD:decryption failed or bad record > mac:../deps/openssl/openssl/ssl/s3_pkt.c:532" If a single machine is responsible for a bulk of the reports, then I wouldn't be surprised if that machine has bad hardware, most likely memory or a NIC.

Comment hidden (Intermittent Failures Robot)

Greg Arndt [:garndt]

Comment 81

•

8 years ago

This was all caused by 1 worker, i-0efc6eabb37ba822d. Currently none of our workers have health checks to see if they should stop claiming tasks. In the case of pulling a docker image and loading, perhaps if the last N jobs have failed, we should have the work self-terminate.

Comment hidden (Intermittent Failures Robot)

Greg Arndt [:garndt]

Comment 99

•

7 years ago

Mixture of corrupted tarballs that corrected itself and s3 slow down errors from when AWS had an outtage.

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 105

•

7 years ago

:garndt, this has picked up a bit in the last week, can you look to see if there is a new failure?

Flags: needinfo?(garndt)

Greg Arndt [:garndt]

Comment 106

•

7 years ago

Sorry, should have commented yesterday when I looked into it. These failures seem to be much of the same that we've been seeing. In this case it was some machines reporting the issue multiple times which made it worse. So far it looks like a corrupted download on some of the machines. We are working towards having content hashes for artifacts this quarter that will prevent something like this from happening. Or at least make it so that that particular reason is no longer an issue.

Flags: needinfo?(garndt)

Comment hidden (Intermittent Failures Robot)

Greg Arndt [:garndt]

Comment 109

•

7 years ago

So far what I've determined for the failures that occur when extracting the downloaded image and result in "Error 39 : Read error : premature end": 1. not related to artifact transfers between regions, failures have been noted in us-west-2. Looking at most of the recent failures this month, it's been machines only in us-west-2. This is not surprising, over 90% of the gecko-t-linux-medium machines are provisioned in us-west-2 so it's more likely to see errors there. 2. usually isolated to particular machines 3. once those machines are bad, they are always bad [1][2][3] and will fail at downloading the file (so retrying within a single task won't help since claiming additional tasks and downloading again is the same outcome as retrying) 4. it gets to different points in the tar extract and fails (not consistent) 5. Downloaded file size is always the same between failed and successful loads. Perhaps something with the content is corrupted resulting in the same size, but bad content. Work going on now to have headers defining the content hash for s3 objects will help answer this and make it not a possibility 6. always m1.medium machines [1] https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-linux-medium/workers/us-west-2/i-09f11ee7e67b88535 [2] https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-linux-medium/workers/us-west-2/i-0ed128d92620dd676 [3] https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-linux-medium/workers/us-west-2/i-0a6b06f9b6969a7ff

Greg Arndt [:garndt]

Comment 110

•

7 years ago

At least for the 3 machines referenced in comment 109 this is a result in a bad volume. I imagine this is the root cause for most of these machines. Errors are filled with things like: end_request: I/O error, dev xvdb, sector 48349928

Comment hidden (Intermittent Failures Robot)

Wander Lairson Costa

Assignee

Updated

•

7 years ago

Assignee: nobody → wcosta

Severity: blocker → normal

Priority: -- → P1

Comment hidden (Intermittent Failures Robot)

Chris Cooper [:coop] (he/him)

Updated

•

7 years ago

Blocks: tc-stability

Comment hidden (Intermittent Failures Robot)

Wander Lairson Costa

Assignee

Updated

•

7 years ago

Status: NEW → ASSIGNED

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

7 years ago

Component: Docker Images → Docker-Worker

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

Resolution: --- → WORKSFORME

Andreea Pavel [:apavel]

Comment 132

•

7 years ago

Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=177474014&repo=autoland&lineNumber=17 [taskcluster 2018-05-08 14:53:03.367Z] Task ID: QWVrbXTnSNih9y294DIQwA [taskcluster 2018-05-08 14:53:03.367Z] Worker ID: i-0f28725baacfaf09f [taskcluster 2018-05-08 14:53:03.367Z] Worker Group: us-west-2 [taskcluster 2018-05-08 14:53:03.367Z] Worker Node Type: c5.4xlarge [taskcluster 2018-05-08 14:53:03.367Z] Worker Type: gecko-3-b-linux [taskcluster 2018-05-08 14:53:03.367Z] Public IP: 34.217.128.94 [taskcluster 2018-05-08 14:53:03.367Z] using cache "level-3-checkouts-v3-697eee6c304d222fdb39" -> /builds/worker/checkouts [taskcluster 2018-05-08 14:53:03.367Z] using cache "level-3-autoland-build-linux64-st-an-opt-workspace-v3-697eee6c304d222fdb39" -> /builds/worker/workspace [taskcluster 2018-05-08 14:53:03.367Z] using cache "level-3-tooltool-cache-v3-697eee6c304d222fdb39" -> /builds/worker/tooltool-cache [taskcluster 2018-05-08 14:53:05.098Z] Downloading artifact "public/image.tar.zst" from task ID: D06NaEbzTcO-TgeZNpSGQQ. [taskcluster 2018-05-08 14:53:08.071Z] Downloaded artifact successfully. [taskcluster 2018-05-08 14:53:08.071Z] Downloaded 184.803 mb [taskcluster 2018-05-08 14:53:08.071Z] Decompressing downloaded image [taskcluster 2018-05-08 14:53:09.315Z] Loading docker image from downloaded archive. [taskcluster:error] Pulling docker image has failed. [taskcluster:error] Error: Error loading docker image. socket hang up [taskcluster 2018-05-08 14:53:17.745Z] Unsuccessful task run with exit code: -1 completed in 14.379 seconds

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → WORKSFORME

Comment hidden (Intermittent Failures Robot)

Eliza Balazs [:ebalazs_]

Comment 134

•

6 years ago

Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=182261458&repo=autoland&lineNumber=17

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 138

•

6 years ago

:wcosta, I don't see a lot of failures in recent times, can we close this bug?

Flags: needinfo?(wcosta)

Wander Lairson Costa

Assignee

Comment 139

•

6 years ago

(In reply to Joel Maher ( :jmaher ) (UTC+2) (PTO: back Aug 2) from comment #138) > :wcosta, I don't see a lot of failures in recent times, can we close this > bug? I am okay with that

Flags: needinfo?(wcosta)

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

6 years ago

Status: REOPENED → RESOLVED

Closed: 7 years ago → 6 years ago

Resolution: --- → FIXED

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

6 years ago

Resolution: FIXED → WORKSFORME

Tiberius Oros[:tiberius_oros]

Comment 140

•

6 years ago

Recent log file and snippet with the failure: https://treeherder.mozilla.org/logviewer.html#?job_id=191373208&repo=mozilla-inbound&lineNumber=17 [taskcluster 2018-08-01 15:16:21.527Z] Task ID: D0le_IUFTCGBGhY8ix7Ztg [taskcluster 2018-08-01 15:16:21.527Z] Worker ID: i-0b22f1ec30e9ad807 [taskcluster 2018-08-01 15:16:21.527Z] Worker Group: us-east-1 [taskcluster 2018-08-01 15:16:21.527Z] Worker Node Type: m4.4xlarge [taskcluster 2018-08-01 15:16:21.527Z] Worker Type: gecko-3-b-linux [taskcluster 2018-08-01 15:16:21.527Z] Public IP: 18.207.103.206 [taskcluster 2018-08-01 15:16:21.527Z] using cache "level-3-tooltool-cache-v3-ef4ffaeaa8ce9aa76898" -> /builds/worker/tooltool-cache [taskcluster 2018-08-01 15:16:21.527Z] using cache "level-3-checkouts-v3-ef4ffaeaa8ce9aa76898" -> /builds/worker/checkouts [taskcluster 2018-08-01 15:16:21.527Z] using cache "level-3-mozilla-inbound-build-linux-rusttests-opt-workspace-v3-ef4ffaeaa8ce9aa76898" -> /builds/worker/workspace [taskcluster 2018-08-01 15:16:22.688Z] Downloading artifact "public/image.tar.zst" from task ID: MTemE-0iQ0KdDx3ElnTA-g. [taskcluster 2018-08-01 15:16:26.224Z] Downloaded artifact successfully. [taskcluster 2018-08-01 15:16:26.224Z] Downloaded 206.282 mb [taskcluster 2018-08-01 15:16:26.224Z] Decompressing downloaded image [taskcluster 2018-08-01 15:16:27.931Z] Loading docker image from downloaded archive. [taskcluster:error] Pulling docker image has failed. [taskcluster:error] Error: Error loading docker image. socket hang up [taskcluster 2018-08-01 15:16:34.961Z] Unsuccessful task run with exit code: -1 completed in 13.435 seconds

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Comment hidden (Intermittent Failures Robot)

Chris Cooper [:coop] (he/him)

Updated

•

6 years ago

Status: REOPENED → RESOLVED

Closed: 6 years ago → 6 years ago

Resolution: --- → WORKSFORME

Bogdan Tara[:bogdan_tara | bogdant]

Comment 155

•

6 years ago

This is still happening. Recent failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=214528439&repo=autoland&lineNumber=17 [taskcluster 2018-11-29 02:54:56.678Z] Task ID: CVGnvjBzRMqt6YJ2gR13oA [taskcluster 2018-11-29 02:54:56.678Z] Worker ID: i-0e05de9f46eb09899 [taskcluster 2018-11-29 02:54:56.678Z] Worker Group: us-east-1 [taskcluster 2018-11-29 02:54:56.678Z] Worker Node Type: m5d.4xlarge [taskcluster 2018-11-29 02:54:56.678Z] Worker Type: gecko-3-b-linux [taskcluster 2018-11-29 02:54:56.678Z] Public IP: 54.173.22.168 [taskcluster 2018-11-29 02:54:56.678Z] using cache "level-3-checkouts-v3-2f7c6cb7dd63ce826370" -> /builds/worker/checkouts [taskcluster 2018-11-29 02:54:56.678Z] using cache "level-3-tooltool-cache-v3-2f7c6cb7dd63ce826370" -> /builds/worker/tooltool-cache [taskcluster 2018-11-29 02:54:56.678Z] using cache "level-3-autoland-build-linux64-fuzzing-debug-workspace-v3-2f7c6cb7dd63ce826370" -> /builds/worker/workspace [taskcluster 2018-11-29 02:54:57.764Z] Downloading artifact "public/image.tar.zst" from task ID: Y1XN7465QyaC9vE5DeD_kg. [taskcluster 2018-11-29 02:55:00.863Z] Downloaded artifact successfully. [taskcluster 2018-11-29 02:55:00.863Z] Downloaded 217.126 mb [taskcluster 2018-11-29 02:55:00.863Z] Decompressing downloaded image [taskcluster 2018-11-29 02:55:02.307Z] Loading docker image from downloaded archive. [taskcluster:error] Pulling docker image has failed. [taskcluster:error] Error: Error loading docker image. connect ENOENT /var/run/docker.sock [taskcluster 2018-11-29 02:55:04.362Z] Unsuccessful task run with exit code: -1 completed in 7.684 seconds

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Chris Cooper [:coop] (he/him)

Comment 157

•

6 years ago

(In reply to Bogdan Tara[:bogdan_tara] from comment #155) > This is still happening. Sure, but at a failure rate of 1/3000, it's not going to jump to the front of the queue any time soon.

Joel Maher ( :jmaher ) (UTC -8)

Comment 158

•

6 years ago

yeah, the current process if for intermittent failures we have a bug to associate with so we can track and annotate. I agree a rate of once/week is not useful, in fact we probably have 1 failure in 300,000+

Comment hidden (Intermittent Failures Robot)

Chris Cooper [:coop] (he/him)

Comment 162

•

6 years ago

This will hopefully die a quiet death when we move to generic-worker.

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: Docker-Worker → Workers

Comment hidden (Intermittent Failures Robot)

Chris Cooper [:coop] (he/him)

Updated

•

6 years ago

Priority: P1 → P3

Comment hidden (Intermittent Failures Robot)

Andrei Ciure[:aciure]

Comment 184

•

5 years ago

There are 48 failures associated to this bug in the last 7 days. These are occurring on linux 64 for opt, debug, asan builds.

Comment hidden (Intermittent Failures Robot)

Wander Lairson Costa

Assignee

Comment 188

•

5 years ago

https://github.com/taskcluster/docker-worker/pull/502

Comment hidden (Intermittent Failures Robot)

Dustin J. Mitchell [:dustin] (he/him)

Comment 190

•

5 years ago

Attached image image.png — Details

It looks like this is basically fixed by now.

Comment hidden (Intermittent Failures Robot)

Wander Lairson Costa

Assignee

Updated

•

5 years ago

Status: REOPENED → RESOLVED

Closed: 6 years ago → 5 years ago

Resolution: --- → FIXED

Alexandru Michis [:malexandru]

Comment 196

•

5 years ago

Wander, there is a recent occurrence of this issue: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=297365053&repo=mozilla-central&lineNumber=33

Should we reopen the bug?

Flags: needinfo?(wcosta)

Wander Lairson Costa

Assignee

Comment 197

•

5 years ago

Hrm, it seems some issue with the docker daemon. I don't think it is worth reopening the bug for such a unique case, but feel free to do so if you disagree.

Flags: needinfo?(wcosta)

Comment hidden (Intermittent Failures Robot)

Cosmin Sabou [:CosminS]

Comment 199

•

5 years ago

Another occurence here: https://treeherder.mozilla.org/logviewer.html#?job_id=300606953&repo=autoland

Comment hidden (Intermittent Failures Robot)