Closed Bug 1182739 Opened 6 years ago Closed 6 years ago

abort: HTTP error fetching bundle: HTTP Error 403: Forbidden on talos

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nigelb, Assigned: gps)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Seeing this consistently across all trees

19:49:27 INFO - Setting /builds/slave/test-pgo/build/talos_repo to https://hg.mozilla.org/build/talos revision 554aa164ba0b.
19:49:27 INFO - Cloning https://hg.mozilla.org/build/talos to /builds/slave/test-pgo/build/talos_repo.
19:49:27 INFO - Running command: ['hg', '--config', 'ui.merge=internal:merge', 'clone', 'https://hg.mozilla.org/build/talos', '/builds/slave/test-pgo/build/talos_repo']
19:49:27 INFO - Copy/paste: hg --config ui.merge=internal:merge clone https://hg.mozilla.org/build/talos /builds/slave/test-pgo/build/talos_repo
19:49:27 INFO - Calling ['hg', '--config', 'ui.merge=internal:merge', 'clone', 'https://hg.mozilla.org/build/talos', '/builds/slave/test-pgo/build/talos_repo'] with output_timeout 1200
19:49:27 INFO - downloading bundle https://s3-us-west-2.amazonaws.com/moz-hg-bundles-us-west-2/build/talos/b4b41ebeec4181e03574c9eabafc53f2217afeb9.stream.hg
19:49:27 ERROR - abort: HTTP error fetching bundle: HTTP Error 403: Forbidden
19:49:27 ERROR - Automation Error: hg not responding
19:49:27 INFO - (consider contacting the server operator if this error persists)
19:49:27 ERROR - Return code: 255 

https://treeherder.mozilla.org/logviewer.html#?job_id=2295538&repo=b2g-inbound
Found _no_ (zero) bundles on S3 for build/talos, in either us-west-2 or us-east-1

Regenerated bundle by as user hg on hgssh1 with:
 /repo/hg/scripts/outputif /repo/hg/scripts/generate-hg-s3-bundles build/talos
(line taken from user hg's crontab)

After regeneration, curl reports bundle there as expected:
 $ curl -I https://s3-us-west-2.amazonaws.com/moz-hg-bundles-us-west-2/build/talos/deac3ce69268c787de370974889e843b647a4486.stream.hg
HTTP/1.1 200 OK
x-amz-id-2: TrkW8yjsCC8dz0fbFCFYmFtC5dfl6nat/j7G7sEDGgeI8DxjEgcBhxJSpDcQhUh/cDWmByaKdCA=
x-amz-request-id: 415F84CC7CAE3C3C
Date: Sat, 11 Jul 2015 04:18:18 GMT
x-amz-version-id: wO_fGSKXneOyrqesQinX8isXNbytNWUo
Last-Modified: Sat, 11 Jul 2015 04:15:41 GMT
x-amz-expiration: expiry-date="Wed, 15 Jul 2015 00:00:00 GMT", rule-id="prune after a few days"
ETag: "c51745cca47ab0ac3edfc4d92ddd731b"
Accept-Ranges: bytes
Content-Type: application/octet-stream
Content-Length: 33391062
Server: AmazonS3
Assignee: nobody → hwine
Note that the build/talos repository hasn't been changed much recently. 
    deac3ce69268 2015-07-10 16:50 +0100
    b4b41ebeec41 2015-07-06 18:11 +0200

It was changed today, but the bundle generating cronjob had not yet run. (scheduled for 0145 PT daily). Last good job on b2g-inbound downloaded bundle successfully:
    17:30:25     INFO -  downloading bundle https://s3-us-west-2.amazonaws.com/moz-hg-bundles-us-west-2/build/talos/b4b41ebeec4181e03574c9eabafc53f2217afeb9.stream.hg

First fail:
    20:10:19     INFO -  downloading bundle https://s3-us-west-2.amazonaws.com/moz-hg-bundles-us-west-2/build/talos/b4b41ebeec4181e03574c9eabafc53f2217afeb9.stream.hg
    20:10:19    ERROR -  abort: HTTP error fetching bundle: HTTP Error 403: Forbidden

We're currently configured to permanently delete 4 days after object creation (I think, not positive of the s3 rule behavior). That 4 days expired between the last-good and first-fail runs.

My guess is we need to "touch" the s3 object when we find the key already uploaded.

ni: gps for confirmation and fix
Flags: needinfo?(gps)
gah - timezone math in comment 2 is incorrect.
  2015-07-06 18:11 +0200 is 2015-07-06 09:11 -0700
so the "last good" was already over 8 hours after the 4 day window.
Hal's analysis and resolution is spot on: lifecycle rules on the S3 bucket caused deletion of the bundle after N days and regenerating the bundles manually was the proper workaround.

The lifecycle pruning is apparently performed around UTC day boundaries, which explains the multi-hour discrepancy between time created and time deleted not falling on a 24 hour interval.

I have temporary disabled the lifecycle policy on the 2 relevant S3 buckets to ensure bundles aren't prematurely deleted in the next few days (read: over the weekend).

A long term solution will be to "touch" the objects (as Hal suggested) or something along those lines. I'll figure out the resolution on Monday. We should also consider having the Mercurial server verify the content in its bundle manifests is available. Although, something tells me automation will effectively do this for us, so might not be worth the effort.

FWIW, the behavior of the clone hard failing on missing bundle and not falling back to hg.mozilla.org is by design. See https://hg.mozilla.org/hgcustom/version-control-tools/rev/7fd53f81bc94. tl;dr it helps prevent clone flooding and overwhelming hg.mozilla.org.

Finally, there are two hard problems in computer science and this bug is due to one of them (caching). That makes me feel slightly better about things :)
Assignee: hwine → gps
Status: NEW → ASSIGNED
Flags: needinfo?(gps)
scripts: remote copy S3 object to reset lifecycle expiration (bug 1182739); r?bkero

We have a lifecycle policy on the S3 buckets for bundles that expires
objects after a few days (currently 3). If a repository is inactive for
a few days, we wouldn't upload bundles and the utilized S3 objects would
be expired.

This commit changes the behavior when the S3 object exists to perform a
remote copy, which will reset the object's mtime, thus resetting the
object expiration timer.
Attachment #8632867 - Flags: review?(bkero)
Comment on attachment 8632867 [details]
MozReview Request: scripts: remote copy S3 object to reset lifecycle expiration (bug 1182739); r?bkero

https://reviewboard.mozilla.org/r/13149/#review11735

I wish we could accomplish this with something like a touch(1) to make it a little faster/cheaper, but this should do.
Attachment #8632867 - Flags: review?(bkero) → review+
url:        https://hg.mozilla.org/hgcustom/version-control-tools/rev/d558ddea8f6f0873b0bdf084423198b78f8e814d
changeset:  d558ddea8f6f0873b0bdf084423198b78f8e814d
user:       Gregory Szorc <gps@mozilla.com>
date:       Mon Jul 13 12:23:48 2015 -0700
description:
scripts: remote copy S3 object to reset lifecycle expiration (bug 1182739); r=bkero

We have a lifecycle policy on the S3 buckets for bundles that expires
objects after a few days (currently 3). If a repository is inactive for
a few days, we wouldn't upload bundles and the utilized S3 objects would
be expired.

This commit changes the behavior when the S3 object exists to perform a
remote copy, which will reset the object's mtime, thus resetting the
object expiration timer.
This is now deployed. I'll wait until tomorrow to turn back on the lifecycle policy for these buckets, otherwise objects may get deleted before the bundles are regenerated in the next ~14 hours.
The lifecycle policies on the 2 buckets have been manually re-enabled through the AWS web console.

I've also adjusted the policies to prune after 7 days, not 3. We'll pay slightly more money for having more "live" objects around. But, this gives us a whole week to detect potential issues instead of just 3 days.

I also verified that objects like the latest build/talos bundle had their mtime bumped in last night's bundle generation.

Calling this bug done.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Blocks: 1183857
You need to log in before you can comment on or make changes to this bug.