Closed Bug 1314284 Opened 9 years ago Closed 9 years ago

cloud-mirror redirecting to S3 URLs that 404

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure)

We have been in a call investigating this for 3 hours now :) The error has not recurred beyond this instance, as the object eventually stopped 404'ing and started 200'ing. We were able to find another object illustrating this issue that was not causing failures but which we could reproduce using some simple `curl` operations. We purged this object and it is now stuck in the copying state. We have not yet found a root cause, although we have a few possibilities: * eventual-consistency issue with S3 * bug in cloud-mirror flagging a failed copy as successful And have made a change which may help: * Disable backfilling (which causes a HEAD request to S3 objects before copying them, which disables S3's read-after-write consistency)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 1314479
This is recurring quite a lot, per some emails from papertrail and per a current treeclosure.
Blocks: 1302596
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Summary: Intermittent [taskcluster:error] Error: Error loading docker image. Could not download artifact "public/image.tar from task "ZsXD2f8BTq61uQ-T6mkyMw" after 1 attempt(s). Error: Not Found → cloud-mirror redirecting to S3 URLs that 404
We stopped doing HEAD requests (by removing the backfilling support) on Nov 1. https://github.com/taskcluster/cloud-mirror/pull/24 I believe jonas and john landed some error-handling fixes yesterday (they were merged, but I don't know about deployment): https://github.com/taskcluster/cloud-mirror/pull/27 https://github.com/taskcluster/cloud-mirror/pull/28 https://github.com/taskcluster/cloud-mirror/pull/29 https://github.com/taskcluster/cloud-mirror/pull/30
Regarding S3 consistency: - we have determined that we are using the read-after-write consistency endpoints everywhere - we have seen this with mirrored artifacts in us-west-1 as well - while read-after-write fails if there is a HEAD request just before the PUT, this effect is short-lived, not the hours we're seeing - S3 is eventually consistent for "existing" objects. Discussion ongoing as to what that means.
The failure today is for ZsXD2f8BTq61uQ-T6mkyMw/public/image.tar
Fixes for cloud-mirror have been deployed around 09:15 CDT that we are hopeful will correct this issue. We'll be monitoring for issues.
I just flushed the redis cache to force all possible invalid present values to be removed > flushdb OK (0.89s)
Those 7 days are not since the fix 4 days ago, so *done*
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.