Closed Bug 1302596 Opened 8 years ago Closed 4 years ago

Intermittent [taskcluster:error] Pulling docker image has failed.

Categories

(Taskcluster :: Workers, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: intermittent-bug-filer, Assigned: wcosta)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra])

Attachments

(1 file)

From Greg in https://bugzilla.mozilla.org/show_bug.cgi?id=1296659#c5

Looking back at the last dozen or so of these recorded, the workers are removing an image during a garbage collection cycle at the same time as the worker is importing an image.  I'm guessing that during import of an image tarball there are less locks on shared layers than when doing a pull from docker hub.
this problem is increasing as of yesterday- either a fix was backed out, or we have a new problem.

what I see as issues in the last couple of days is primarily build tasks- where as before we had other *lint* and unittest tasks hitting this issue.

As of now, this is the #3 intermittent orange on the trees (ignoring 2 others with fixes landed)

:garndt, can you make sure this gets on the radar?
Flags: needinfo?(garndt)
aurora was running about ~ 300 times into this problem here and according to dustin this is now affecting all branches, so closing trees and bumping to blocker
Severity: normal → blocker
(In reply to Carsten Book [:Tomcat] from comment #32)
> aurora was running about ~ 300 times into this problem here and according to
> dustin this is now affecting all branches, so closing trees and bumping to
> blocker

note that trees are reopen since awhile now, also no other trees were affected
> :garndt, can you make sure this gets on the radar?

This has been an ongoing issue we've been looking into.  Changes have been deployed this morning that we hope will eliminate these errors.  We will be monitoring.

More information is in https://bugzilla.mozilla.org/show_bug.cgi?id=1314284
Flags: needinfo?(garndt)
:garndt, this is still ongoing at a pretty steady rate- the data isn't clustered, so this indicates we are dealing with something not related to a common outage (even if it happens once/day) instead this is happening 5-6 times/day and spread out evenly throughout the day.

Should we make this an auto retry case?
Flags: needinfo?(garndt)
:garndt, I am not sure if this will be fixed by the sidecar work (bug 1307493), but I wanted to make sure this is on your radar for the intermittent work in this quarter.
now that sidecar is landed, we still have a very high failure rate for this bug, :garndt, who can take this work and figure out a solution?
I think for the most part this error has become much less because we are now bundling the images inside the AMI.  There are still issues sometimes with linking the containers, but I think those are being captured under other bugs.
Flags: needinfo?(garndt)
Whiteboard: [stockwell infra]
this had a spike on April 7th, otherwise we are having a slow and steady rate still
The spike of these appear to be from a single machine having an issue, which is a rare occurrence.  Mostly with this in the log:
"Error: 140511475332992:error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac:../deps/openssl/openssl/ssl/s3_pkt.c:532"
(In reply to Greg Arndt [:garndt] from comment #77)
> The spike of these appear to be from a single machine having an issue, which
> is a rare occurrence.  Mostly with this in the log:
> "Error: 140511475332992:error:1408F119:SSL
> routines:SSL3_GET_RECORD:decryption failed or bad record
> mac:../deps/openssl/openssl/ssl/s3_pkt.c:532"

If a single machine is responsible for a bulk of the reports, then I wouldn't be surprised if that machine has bad hardware, most likely memory or a NIC.
This was all caused by 1 worker, i-0efc6eabb37ba822d.  Currently none of our workers have health checks to see if they should stop claiming tasks.  In the case of pulling a docker image and loading, perhaps if the last N jobs have failed, we should have the work self-terminate.
Mixture of corrupted tarballs that corrected itself and s3 slow down errors from when AWS had an outtage.
:garndt, this has picked up a bit in the last week, can you look to see if there is a new failure?
Flags: needinfo?(garndt)
Sorry, should have commented yesterday when I looked into it.  These failures seem to be much of the same that we've been seeing.  In this case it was some machines reporting the issue multiple times which made it worse.  So far it looks like a corrupted download on some of the machines.  We are working towards having content hashes for artifacts this quarter that will prevent something like this from happening.  Or at least make it so that that particular reason is no longer an issue.
Flags: needinfo?(garndt)
So far what I've determined for the failures that occur when extracting the downloaded image and result in "Error 39 : Read error : premature end":

1. not related to artifact transfers between regions, failures have been noted in us-west-2.  Looking at most of the recent failures this month, it's been machines only in us-west-2.  This is not surprising, over 90% of the gecko-t-linux-medium machines are provisioned in us-west-2 so it's more likely to see errors there.
2. usually isolated to particular machines
3. once those machines are bad, they are always bad [1][2][3] and will fail at downloading the file (so retrying within a single task won't help since claiming additional tasks and downloading again is the same outcome as retrying)
4. it gets to different points in the tar extract and fails (not consistent)
5. Downloaded file size is always the same between failed and successful loads.  Perhaps something with the content is corrupted resulting in the same size, but bad content.  Work going on now to have headers defining the content hash for s3 objects will help answer this and make it not a possibility
6. always m1.medium machines



[1] https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-linux-medium/workers/us-west-2/i-09f11ee7e67b88535
[2] https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-linux-medium/workers/us-west-2/i-0ed128d92620dd676
[3] https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-linux-medium/workers/us-west-2/i-0a6b06f9b6969a7ff
At least for the 3 machines referenced in comment 109 this is a result in a bad volume.  I imagine this is the root cause for most of these machines.

Errors are filled with things like:
end_request: I/O error, dev xvdb, sector 48349928
Assignee: nobody → wcosta
Severity: blocker → normal
Priority: -- → P1
Status: NEW → ASSIGNED
Component: Docker Images → Docker-Worker
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=177474014&repo=autoland&lineNumber=17

[taskcluster 2018-05-08 14:53:03.367Z] Task ID: QWVrbXTnSNih9y294DIQwA
[taskcluster 2018-05-08 14:53:03.367Z] Worker ID: i-0f28725baacfaf09f
[taskcluster 2018-05-08 14:53:03.367Z] Worker Group: us-west-2
[taskcluster 2018-05-08 14:53:03.367Z] Worker Node Type: c5.4xlarge
[taskcluster 2018-05-08 14:53:03.367Z] Worker Type: gecko-3-b-linux
[taskcluster 2018-05-08 14:53:03.367Z] Public IP: 34.217.128.94
[taskcluster 2018-05-08 14:53:03.367Z] using cache "level-3-checkouts-v3-697eee6c304d222fdb39" -> /builds/worker/checkouts
[taskcluster 2018-05-08 14:53:03.367Z] using cache "level-3-autoland-build-linux64-st-an-opt-workspace-v3-697eee6c304d222fdb39" -> /builds/worker/workspace
[taskcluster 2018-05-08 14:53:03.367Z] using cache "level-3-tooltool-cache-v3-697eee6c304d222fdb39" -> /builds/worker/tooltool-cache

[taskcluster 2018-05-08 14:53:05.098Z] Downloading artifact "public/image.tar.zst" from task ID: D06NaEbzTcO-TgeZNpSGQQ.
[taskcluster 2018-05-08 14:53:08.071Z] Downloaded artifact successfully.
[taskcluster 2018-05-08 14:53:08.071Z] Downloaded 184.803 mb
[taskcluster 2018-05-08 14:53:08.071Z] Decompressing downloaded image
[taskcluster 2018-05-08 14:53:09.315Z] Loading docker image from downloaded archive.

[taskcluster:error] Pulling docker image has failed.

[taskcluster:error] Error: Error loading docker image. socket hang up
[taskcluster 2018-05-08 14:53:17.745Z] Unsuccessful task run with exit code: -1 completed in 14.379 seconds
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → WORKSFORME
Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=182261458&repo=autoland&lineNumber=17
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
:wcosta, I don't see a lot of failures in recent times, can we close this bug?
Flags: needinfo?(wcosta)
(In reply to Joel Maher ( :jmaher ) (UTC+2) (PTO: back Aug 2) from comment #138)
> :wcosta, I don't see a lot of failures in recent times, can we close this
> bug?

I am okay with that
Flags: needinfo?(wcosta)
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Resolution: FIXED → WORKSFORME
Recent log file and snippet with the failure:
https://treeherder.mozilla.org/logviewer.html#?job_id=191373208&repo=mozilla-inbound&lineNumber=17

[taskcluster 2018-08-01 15:16:21.527Z] Task ID: D0le_IUFTCGBGhY8ix7Ztg
[taskcluster 2018-08-01 15:16:21.527Z] Worker ID: i-0b22f1ec30e9ad807
[taskcluster 2018-08-01 15:16:21.527Z] Worker Group: us-east-1
[taskcluster 2018-08-01 15:16:21.527Z] Worker Node Type: m4.4xlarge
[taskcluster 2018-08-01 15:16:21.527Z] Worker Type: gecko-3-b-linux
[taskcluster 2018-08-01 15:16:21.527Z] Public IP: 18.207.103.206
[taskcluster 2018-08-01 15:16:21.527Z] using cache "level-3-tooltool-cache-v3-ef4ffaeaa8ce9aa76898" -> /builds/worker/tooltool-cache
[taskcluster 2018-08-01 15:16:21.527Z] using cache "level-3-checkouts-v3-ef4ffaeaa8ce9aa76898" -> /builds/worker/checkouts
[taskcluster 2018-08-01 15:16:21.527Z] using cache "level-3-mozilla-inbound-build-linux-rusttests-opt-workspace-v3-ef4ffaeaa8ce9aa76898" -> /builds/worker/workspace

[taskcluster 2018-08-01 15:16:22.688Z] Downloading artifact "public/image.tar.zst" from task ID: MTemE-0iQ0KdDx3ElnTA-g.
[taskcluster 2018-08-01 15:16:26.224Z] Downloaded artifact successfully.
[taskcluster 2018-08-01 15:16:26.224Z] Downloaded 206.282 mb
[taskcluster 2018-08-01 15:16:26.224Z] Decompressing downloaded image
[taskcluster 2018-08-01 15:16:27.931Z] Loading docker image from downloaded archive.

[taskcluster:error] Pulling docker image has failed.

[taskcluster:error] Error: Error loading docker image. socket hang up
[taskcluster 2018-08-01 15:16:34.961Z] Unsuccessful task run with exit code: -1 completed in 13.435 seconds
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → WORKSFORME
This is still happening.

Recent failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=214528439&repo=autoland&lineNumber=17

[taskcluster 2018-11-29 02:54:56.678Z] Task ID: CVGnvjBzRMqt6YJ2gR13oA
[taskcluster 2018-11-29 02:54:56.678Z] Worker ID: i-0e05de9f46eb09899
[taskcluster 2018-11-29 02:54:56.678Z] Worker Group: us-east-1
[taskcluster 2018-11-29 02:54:56.678Z] Worker Node Type: m5d.4xlarge
[taskcluster 2018-11-29 02:54:56.678Z] Worker Type: gecko-3-b-linux
[taskcluster 2018-11-29 02:54:56.678Z] Public IP: 54.173.22.168
[taskcluster 2018-11-29 02:54:56.678Z] using cache "level-3-checkouts-v3-2f7c6cb7dd63ce826370" -> /builds/worker/checkouts
[taskcluster 2018-11-29 02:54:56.678Z] using cache "level-3-tooltool-cache-v3-2f7c6cb7dd63ce826370" -> /builds/worker/tooltool-cache
[taskcluster 2018-11-29 02:54:56.678Z] using cache "level-3-autoland-build-linux64-fuzzing-debug-workspace-v3-2f7c6cb7dd63ce826370" -> /builds/worker/workspace
[taskcluster 2018-11-29 02:54:57.764Z] Downloading artifact "public/image.tar.zst" from task ID: Y1XN7465QyaC9vE5DeD_kg.
[taskcluster 2018-11-29 02:55:00.863Z] Downloaded artifact successfully.
[taskcluster 2018-11-29 02:55:00.863Z] Downloaded 217.126 mb
[taskcluster 2018-11-29 02:55:00.863Z] Decompressing downloaded image
[taskcluster 2018-11-29 02:55:02.307Z] Loading docker image from downloaded archive.
[taskcluster:error] Pulling docker image has failed.
[taskcluster:error] Error: Error loading docker image. connect ENOENT /var/run/docker.sock
[taskcluster 2018-11-29 02:55:04.362Z] Unsuccessful task run with exit code: -1 completed in 7.684 seconds
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
(In reply to Bogdan Tara[:bogdan_tara] from comment #155)
> This is still happening.

Sure, but at a failure rate of 1/3000, it's not going to jump to the front of the queue any time soon.
yeah, the current process if for intermittent failures we have a bug to associate with so we can track and annotate.  I agree a rate of once/week is not useful, in fact we probably have 1 failure in 300,000+

This will hopefully die a quiet death when we move to generic-worker.

Component: Docker-Worker → Workers
Priority: P1 → P3

There are 48 failures associated to this bug in the last 7 days. These are occurring on linux 64 for opt, debug, asan builds.

Attached image image.png

It looks like this is basically fixed by now.

Status: REOPENED → RESOLVED
Closed: 6 years ago4 years ago
Resolution: --- → FIXED

Wander, there is a recent occurrence of this issue: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=297365053&repo=mozilla-central&lineNumber=33

Should we reopen the bug?

Flags: needinfo?(wcosta)

Hrm, it seems some issue with the docker daemon. I don't think it is worth reopening the bug for such a unique case, but feel free to do so if you disagree.

Flags: needinfo?(wcosta)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: