Closed Bug 1041173 Opened 6 years ago Closed 5 years ago

Deploy bundleclone extension

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: gps)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/734] )

Attachments

(10 files, 2 obsolete files)

39 bytes, text/x-review-board-request
bkero
: review+
Details
39 bytes, text/x-review-board-request
bkero
: review+
Details
39 bytes, text/x-review-board-request
smacleod
: review+
Details
39 bytes, text/x-review-board-request
smacleod
: review+
Details
39 bytes, text/x-review-board-request
smacleod
: review+
Details
39 bytes, text/x-review-board-request
smacleod
: review+
Details
39 bytes, text/x-review-board-request
smacleod
: review+
Details
39 bytes, text/x-review-board-request
smacleod
: review+
Details
39 bytes, text/x-review-board-request
smacleod
: review+
Details
39 bytes, text/x-review-board-request
smacleod
: review+
Details
As is tradition, I hacked on a "fun" project while on a flight. In the sky between San Francisco and Austin, I cobbled together a Mercurial extension that performs bundle-assisted clones. Essentially, the server pre-generates a Mercurial bundle and advertises the URL to clients. When clients perform `hg clone`, they first check if a bundle URL is advertised. If so, they download and apply the bundle then perform a pull.

This is all transparent from the perspective of the client: all you need to do is install an extension and talk to a server that advertises bundles.

The server side is slightly more complicated. You need to generate a bundle and make it available somehow. This likely boils down to a CRON job and most likely an HTTP server serving static files from a directory.

The benefit of this extension is that the Mercurial server doesn't generate a new bundle for each clone (hg clone calls a "getbundle" API internally). Instead, you pay an up-front, out-of-band cost for generating a single bundle file and the clone cost is effectively static file transfer (dirt cheap) plus an incremental hg pull (cost is proportional to the difference since bundle creation - this should also be very cheap).

I have an old clone of mozilla-central with 191669 commits. If I start a local HTTP Mercurial server and do an `hg clone`, I get the following CPU times:

Client Clone:     774s CPU time
Mercurial Server: 205s CPU time

With bundle clone using a gzip bundle (~877 MB), I start a separate HTTP server for the bundle and get:

Client Clone:     842s CPU time
Mercurial Server:   0.267s CPU time
Bundle Server:      1.122s CPU time (using a Python HTTP server)

With bundle clone with 3346 additional changesets (we'll do an incremental pull in addition to bundle apply):

Client Clone:     852s CPU time
Mercurial Server:  15.943s CPU time

The major difference here is the server CPU time went from 205s to essentially nothing for the full bundle clone and very small for the partial bundle clone.

When you realize 0.164s of the Mercurial server time is the start-up and teardown time, the bundle-based clone is even more impressive (~100 ms CPU time on server).

There is an absurdly long pause after incoming changesets are added. I'm almost certain this is something running during the "changegroup" hook and is due to an extension I'm running. Time seems similar with both runs. Although the bundle code path seems to incur two pauses (presumably leading to the longer client clone time). I need to investigate this. (Unfortunately the blackblox log isn't being helpful.)

I expect proper deployment of this extension by Firefox automation to reduce Mercurial server load drastically.

*I reckon this extension could have prevented Thursday's outage due to decreasing server load.*

Here's how I see this bug playing out.

Release Automation:

1) Enable bundleclone extension on all clients ASAP. Installing it will change nothing until the server extension is in place.

HG Server:

1) Enable bundleclone extension globally on server.
2) Set up a public HTTP or FTP server to hold bundle files. This could potentially be S3.
3) Set up a CRON to generate a bundle of mozilla-central periodically (every hour or so or perhaps on checkin). Have it publish files to new server. Have it update the bundle manifest files for related repos (mozilla-central, mozilla-inbound, fx-team, etc)
4) Repeat 3 for other large and/or highly-cloned repos (mozilla-aurora, mozilla-beta, mozilla-release, mozharness, etc).

We *should* be able to share bundles across the Firefox integration repos. (At least for the automation support case.) Although if we don't want to do this, that's acceptable.

I only had time to make the extension work for Mercurial 3.0 on the plane. There's no reason it can't be backported (to 2.5.4 and whatnot) for deployment on hg.mozilla.org and in Firefox automation.

Code lives at https://hg.mozilla.org/hgcustom/version-control-tools/file/default/hgext/bundleclone. There are even tests!

This work is inspired by Augie Fackler's lookaside extension proof-of-concept.

Augie: If you want to look at the code and let me know of any obvious shortcomings, it would be appreciated. Things I've thought of:

* Download resume support on bundles. But I don't think that's trivial to implement.
* Intercepting pull to an empty repo to go through bundle code path (not sure if needed)
Flags: needinfo?(raf)
The extension now works with Mercurial 2.5 through 3.0 and it should be safe to deploy on hg.mozilla.org and all of Firefox automation.
Is the bundleclone.manifest file something that would be included on an `hg pull` of the repo? If not, then we'd need to figure out how to include that in repo mirroring.
Yeah, resuming an interrupted pull is hard - I'm not sure how we could do that if the user ^Cs in the middle. We could probably have some sort of retry loop though for flaky internet connections?

As for the second point, it might be nice (some day) to figure out how to cache bundles that clients are likely to want to pull and hook pull in a general way as well. Is it common for people to pull into an empty repo? I don't think I've ever done it outside of hgsubversion development bugs.
Flags: needinfo?(raf)
(In reply to Kendall Libby [:fubar] from comment #2)
> Is the bundleclone.manifest file something that would be included on an `hg
> pull` of the repo? If not, then we'd need to figure out how to include that
> in repo mirroring.

Yes, this can be added.
The extension now supports copying the manifest at pull time. You need to enable it though.

https://hg.mozilla.org/hgcustom/version-control-tools/rev/c1b6926c0de6
(In reply to Gregory Szorc [:gps] from comment #0)
> HG Server:
> 
> 2) Set up a public HTTP or FTP server to hold bundle files. This could
> potentially be S3.
> 3) Set up a CRON to generate a bundle of mozilla-central periodically (every
> hour or so or perhaps on checkin). Have it publish files to new server. Have
> it update the bundle manifest files for related repos (mozilla-central,
> mozilla-inbound, fx-team, etc)
> 4) Repeat 3 for other large and/or highly-cloned repos (mozilla-aurora,
> mozilla-beta, mozilla-release, mozharness, etc).

These all already exist (and have for years) - see https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Source_Code/Mercurial/Bundles

And, the moral equivalent of the client side code is already deployed throughout the build farm as I understand your description. Automation always explicitly prefers a bundle when a clone is needed.

The one change from current operations is you are proposing to create bundles far more often. We only do them once a week (iirc) due to the cost of producing a bundle (at least at the time things were set up).

What tests do you propose to run to show the benefit to existing build farm operations? Or is this targeted for developers?
With this extension, all the complicated code in automation around using bundles can be deleted: Mercurial will handle it transparently. We already have cases where not all consumers of `hg clone` do the bundle beforehand. See mozharness.

This extension shifts the burden of maintenance from Firefox automation into the Mercurial server. Code and procedures in Firefox automation become much simpler: "just install an extension."

Developers also benefit from this change, since they can likely download a bundle file from a static server faster than the sometimes-overloaded hg.mozilla.org can dynamically generate one.

I would recommend producing bundles as quickly as we can justify. Daily is a good first target. There's no reason we can't up the frequency later if we want. We could point at the existing weekly bundles for an initial test. But we have enough new code landing that incremental pull will eat a lot of CPU cycles, eating into the benefits of this extension.

The extension has automated tests. The next steps are to deploy it somewhere and have people start using it. I would like that to be hg.mozilla.org and I would like the initial audience to be developers and possible a project branch of automation (ash?).

The overall risk for clients and servers should be minimal. If things blow up, we zero out the manifests on the servers to revert to old behavior and/or we uninstall the extension.
Product: Release Engineering → Developer Services
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/200]
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/200] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/734] [kanban:engops:https://kanbanize.com/ctrl_board/6/200]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/734] [kanban:engops:https://kanbanize.com/ctrl_board/6/200] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/734]
I've deployed the bundleclone extension on hg.mozilla.org. I did a one-time generation of bundles for mozilla-central and integration/mozilla-inbound.

catlee: could you install bundleclone on a test project branch so we can see if it "just works?"
Assignee: nobody → gps
Status: NEW → ASSIGNED
Flags: needinfo?(catlee)
we don't have a great way of deploying bundleclone to just one branch. closest I can come up with is to land the extension in our tools repo, and update hgtool to enable it when given a specific cmdline option, or perhaps for particular repositories.
Flags: needinfo?(catlee)
Hmmm.

OK. Let me put some polish on the bundleclone extension then we can coordinate on a for real deployment. I'll associate the commits with this bug.
Attached file MozReview Request: bz://1041173/gps (obsolete) —
/r/9145 - bundleclone: support client-side preferences for choosing bundles (bug 1041173)
/r/9147 - ansible: create a playbook for writing hg bundles to s3
/r/9149 - ansible: install CRON to periodically generate repository bundles

Pull down these commits:

hg pull -r e308dd2243853d27ddc871e23ff5f46e269412e2 https://reviewboard-hg.mozilla.org/version-control-tools/
Attachment #8608357 - Flags: review?(smacleod)
Attachment #8608357 - Flags: review?(bkero)
Comment on attachment 8608357 [details]
MozReview Request: bz://1041173/gps

/r/9145 - bundleclone: support client-side preferences for choosing bundles (bug 1041173)
/r/9147 - ansible: create a playbook for writing hg bundles to s3
/r/9149 - ansible: install CRON to periodically generate repository bundles

Pull down these commits:

hg pull -r 5fd5fcfb6531459385e78af61dd3ce2700033bfc https://reviewboard-hg.mozilla.org/version-control-tools/
There are now 2 S3 buckets, which replication configured between the two. One bucket is in us-west-2, the other in us-east-1. This corresponds to where Firefox automation is currently running. If we set up automation elsewhere or start wanting to clone from other geographic regions, we can always create buckets in more regions. The cost of storage should be <$100/mo per region, which is pocket change as far as I'm concerned. I have an auto-expiration policy set up to ensure we don't retain GB sized S3 objects for more than a few days.

I'm in the process of updating manifests for mozilla-central and other popular repos to reference the new URLs. I've also incorporated the "compression" and "ec2region" attributes in the manifests. We're also now generating 3 versions of each bundle (gzip, bzip2, and uncompressed). So, each manifest will have 6 URLs (2 regions x 3 compression types). I need to do some measuring, but I suspect uncompressed doesn't give us that much of an advantage over gzip and we can kill it. But we'll likely want to retain bzip2, as it is a few hundred MB smaller than gzip and people with slow connections may want to trade reduced bandwidth for increased one-time CPU cost to decode the bundle.
Hmm... is S3 bucket replication not synchronous with the upload? I notice a lag of several dozen seconds between when we upload to us-west-2 and the object appears in the us-east-1 replicated bucket. We need to wait on publishing the manifest until all bundles are available, or clients will fail to apply the bundle and fall back to the Mercurial server. This could be disastrous if our server capacity is provisioned on the assumption the majority of clones are bootstrapped by S3-based bundles.
It's not synchronous -- that's why tooltool's still doing explicit COPY operations and only adding the replicated copy to its index when the operation completes.  You might do the same, btw -- put one bundle up immediately, and let clients hit that even if it's not closest, until the replication completes.
Comment on attachment 8608357 [details]
MozReview Request: bz://1041173/gps

/r/9145 - bundleclone: support client-side preferences for choosing bundles (bug 1041173)
/r/9215 - bundleclone: copy stream clone handling code from Mercurial
/r/9217 - bundleclone: support stream bundles (bug 1041173)
/r/9147 - ansible: create a playbook for writing hg bundles to s3 (bug 1041173)
/r/9149 - ansible: install CRON to periodically generate repository bundles (bug 1041173)

Pull down these commits:

hg pull -r b495e3852095d7baea5da571dc4aa042e91a80cc https://reviewboard-hg.mozilla.org/version-control-tools/
Comment on attachment 8608357 [details]
MozReview Request: bz://1041173/gps

/r/9145 - bundleclone: support client-side preferences for choosing bundles (bug 1041173)
/r/9215 - bundleclone: copy stream clone handling code from Mercurial
/r/9217 - bundleclone: support stream bundles (bug 1041173)
/r/9147 - ansible: create a playbook for writing hg bundles to s3 (bug 1041173)
/r/9149 - ansible: install CRON to periodically generate repository bundles (bug 1041173)
/r/9219 - bundleclone: do not do bundle cloning with stream is requested
/r/9221 - bundleclone: do not fall back to regular clone by default (bug 1041173)

Pull down these commits:

hg pull -r 6c161ecf0caa04e7e84be54058a8c7362aec52f0 https://reviewboard-hg.mozilla.org/version-control-tools/
With support for "stream bundles", a Mercurial clone of mozilla-central weighing in at ~1480MB is *faster* to clone over the wireless network in the SF office with my MBP than a git clone, which weighs in at ~1150MB. Here are the raw times (using -U with Mercurial and --bare with Git to avoid working copy checkout):

Mercurial
real	2m20.542s
user    0m21.594s
sys     0m27.592s

Git
real    3m58.166s
user    3m 6.994s
sys     1m32.405s

<5 seconds was spent counting and compressing objects on the Git server (the server was recently repacked). Git is transferring the packfile fast enough. But "Resolving deltas" takes over a minute.

A Git shallow clone with --depth 1 is better:

real    0m34.789s
user    0m 8.342s
sys     0m 3.099s

Although, Git shallow clones mean more work for the server to recompress objects in packfiles. Serving pre-generated bundles/packfiles is infinitely better than *any* type of bundling/packing on the server if server scalability is what you care about.

When executed on a beefy server in a data center with a fat pipe to S3 that is capable of transferring ~90MB/s:

Mercurial
real    1m8.910s
user    0m40.515s
sys     0m17.152s

Git
real    3m17.629s
user    4m 5.196s
sys     2m 3.881s

Mercurial from a datacenter is ~2x faster. This is almost certainly due to the fat network pipe.

Git from a datacenter is slightly faster. But it looks like transfer speed was about the same as from the SF office (~6.5MB/s). I reckon Linux's superior I/O performance and Git's Linux-heavy developer base has something to do with this result. But don't hold me to that.

For reference, `time wget https://s3-us-west-2.amazonaws.com/moz-hg-bundles-us-west-2/mozilla-central/cb33535b93f2cc8905104449e1a5e242ca86420f.stream.hg` from the office wi-fi:

real	2m46.782s
user	0m 3.929s
sys	0m 6.444s

This tells me network throughput is the limiting factor over a <200 Mbps network (expected).

From a server in a datacenter:

real	0m16.928s
user	0m 5.676s
sys	0m 5.194s

Looking at the network graphs, TCP slow start hurts us here: it takes 4-5s to approach max transfer speed. We could certainly muck about with network settings if we really wanted to, I reckon.

We still have ~45s overhead in Mercurial. However, looking at the code, I don't think there's much we can do. The code for applying stream bundles is all pretty low-level and I reckon Python is a bigger bottleneck than Mercurial. But I could certainly profile and make Mercurial faster if people think we need to. Before I do that, let's not lose site that 60s to clone is significantly faster than 340s to clone, which is how much CPU time Mercurial speaks without the optimal stream clones.
Comment on attachment 8608357 [details]
MozReview Request: bz://1041173/gps

/r/9145 - bundleclone: support client-side preferences for choosing bundles (bug 1041173)
/r/9215 - bundleclone: copy stream clone handling code from Mercurial
/r/9217 - bundleclone: support stream bundles (bug 1041173)
/r/9147 - ansible: create a playbook for writing hg bundles to s3 (bug 1041173)
/r/9149 - ansible: install CRON to periodically generate repository bundles (bug 1041173)
/r/9219 - bundleclone: do not do bundle cloning with stream is requested
/r/9221 - bundleclone: do not fall back to regular clone by default (bug 1041173)
/r/9231 - docs: document how pre-generated bundles work

Pull down these commits:

hg pull -r 10dee37a6b13b86de8c2b508e1033ec031065f58 https://reviewboard-hg.mozilla.org/version-control-tools/
For reference, `tar -xf` of an .hg directory on OS X:

real	0m21.482s
user	0m1.160s
sys	0m18.503s

Also, Augie from the Mercurial project had a sinister idea: we could implement a custom store/repo class that knew how to read into the stream bundle file. Essentially, we would download the stream bundle in <20s and immediately have access to the data therein. No writing 200,000 files. Just build a mapping of filename to offset and we'd be good. From there, you could use an overlay repo to apply additional bundles on top of the base repo, allowing us to pull other changesets in. It's a crazy idea. It would enable us to access repo data with essentially just a file download. But it has downsides, such as managing stacks of bundles and continuously pulling changesets/bundles. But for certain use cases, we could have a usable clone in <20s on a 1Gbps connection. Not bad. But I think I'll leave that as a follow-up if people aren't satisfied with existing performance :)
https://reviewboard.mozilla.org/r/9217/#review7967

Looks good to me - Only issue is the edge case I wanted to be sure about (If I'm incorrect I'd also be interested in knowing why).

::: hgext/bundleclone/__init__.py:336
(Diff revision 1)
> +    requires = repo.requirements & repo.supportedformats
> +    if requires - set(['revlogv1']):

Will this fail in the case of a local based clone, since requirements will be a list instead of a set?

Though, I might be mis-understanding `class localrepository(object)` in localrepo.py and how it relates to the repository object we have here. Please take a look at this.
https://reviewboard.mozilla.org/r/9219/#review7971

Commmit message nit: "do not do bundle cloning *with* stream is requested"
https://reviewboard.mozilla.org/r/9221/#review7975

::: hgext/bundleclone/__init__.py:506
(Diff revision 1)
>              except urllib2.URLError as e:
>                  self.ui.warn(_('error fetching bundle; using normal clone: %s\n') % e.reason)
>                  return super(bundleclonerepo, self).clone(remote, heads=heads,
>                          stream=stream)

This is dead code now since the exception handler above it will catch the `urllib2.URLError`
Attachment #8608357 - Flags: review?(smacleod) → review+
Comment on attachment 8608357 [details]
MozReview Request: bz://1041173/gps

https://reviewboard.mozilla.org/r/9143/#review7979

LGTM. r+ with the few nits and issues adressed / fixed.
https://reviewboard.mozilla.org/r/9217/#review8011

> Will this fail in the case of a local based clone, since requirements will be a list instead of a set?
> 
> Though, I might be mis-understanding `class localrepository(object)` in localrepo.py and how it relates to the repository object we have here. Please take a look at this.

You are correct: localrepository.requirements is supposed to be a list. But AFAICT, consumers only care about it being iterable. There's probably some code path that needs it to be a list. I'll throw some casts in here to make it happy.
I landed the bundleclone bits of the series. Thanks for the review, Steven!

catlee: https://hg.mozilla.org/hgcustom/version-control-tools/raw-file/392be0afcb49/hgext/bundleclone/__init__.py contains the extension you'll want to start deploying. It's only tested against Mercurial 3.1 and later. If you need to support older versions, while I'd highly prefer you upgrade Mercurial, I can make it compatible with earlier versions as needed. https://mozilla-version-control-tools.readthedocs.org/en/latest/hgmo/bundleclone.html contains some docs on configuring it.

*Please let us know when this gets rolled out so we can look at logs and our S3 bill to be sure there aren't any surprises.*

I'll submit a new review series for the Ansible configs for the hg.mozilla.org servers.
Comment on attachment 8608357 [details]
MozReview Request: bz://1041173/gps

Will submit new series for Ansible foo.
Attachment #8608357 - Flags: review?(bkero)
Attached file MozReview Request: bz://1041173/gps2 (obsolete) —
/r/9311 - ansible: create a playbook for writing hg bundles to s3 (bug 1041173)
/r/9313 - ansible: install CRON to periodically generate repository bundles (bug 1041173)

Pull down these commits:

hg pull -r 21e84bc1f84135b21f0a273bf0438c39c07974a6 https://reviewboard-hg.mozilla.org/version-control-tools/
Attachment #8609685 - Flags: review?(bkero)
TIL about the wonkiness of the US Standard S3 region.

According to https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region, requests sent to s3.amazonaws.com get routed to either Virginia or the Pacific Northwest (read: us-east-1 or us-west-2) even though the US Standard region is in Virginia, in us-east-1.

Normally, your region specific hostnames are s3-<region>.amazonaws.com. However, s3-us-east-1.amazonaws.com doesn't exist. Instead, the equivalent is s3-external-1.amazonaws.com.

Wonky.

Long story short, the bundle manifests are now advertising s3-us-west-2.amazonaws.com and s3-external-1.amazonaws.com in URLs. The "ec2region" attribute is still us-west-2 and us-east-1, however, as this is sane.
Also, US Standard S3 region doesn't have READ after WRITE guarantees, whereas us-west-2 buckets do (c.f. http://shlomoswidler.com/2009/12/read-after-write-consistency-in-amazon.html)
Comment on attachment 8609685 [details]
MozReview Request: bz://1041173/gps2

/r/9311 - ansible: create a playbook for writing hg bundles to s3 (bug 1041173)
/r/9313 - ansible: install CRON to periodically generate repository bundles (bug 1041173)

Pull down these commits:

hg pull -r cdebce01066943b8bdb545ab8f6d2a38b767e083 https://reviewboard-hg.mozilla.org/version-control-tools/
(In reply to Chris AtLee [:catlee] from comment #34)
> Also, US Standard S3 region doesn't have READ after WRITE guarantees,
> whereas us-west-2 buckets do (c.f.
> http://shlomoswidler.com/2009/12/read-after-write-consistency-in-amazon.html)

Good to know.

We're uploading to us-east-1 via the us-external-1 URL. I /think/ this means that the data will be stored in the eastern part of US Standard (as opposed to it being random), meaning READ after WRITE is guaranteed.

But if you think we should poll the URLs and ensure reads are available before publishing the new bundle URLs, I'd totally understand.
https://reviewboard.mozilla.org/r/9311/#review8075

::: ansible/tasks/generate-hg-s3-bundles.yml:55
(Diff revision 2)
> +    shell: hg -R /repo/hg/mozilla/{{ repo }} streambundle /repo/hg/mozilla/{{ repo }}/.hg/bundles/{{ tip.stdout }}.stream.hg.tmp && mv /repo/hg/mozilla/{{ repo }}/.hg/bundles/{{ tip.stdout }}.stream.hg.tmp /repo/hg/mozilla/{{ repo }}/.hg/bundles/{{ tip.stdout }}.stream.hg

I can't find a streambundle command/extension defined anywhere. If this is a local extension, gate on including/deploying that before this.

::: ansible/tasks/generate-hg-s3-bundles.yml:48
(Diff revision 2)
> +    shell: hg -R /repo/hg/mozilla/{{ repo }} bundle -a -t {{ item }} /repo/hg/mozilla/{{ repo }}/.hg/bundles/{{ tip.stdout }}.{{ item }}.hg.tmp && mv /repo/hg/mozilla/{{ repo }}/.hg/bundles/{{ tip.stdout }}.{{ item }}.hg.tmp /repo/hg/mozilla/{{ repo }}/.hg/bundles/{{ tip.stdout }}.{{ item }}.hg

This is going to write files like 'e5b507efb36e2b9ad8edb1a38459d26c934d74dd.gzip.hg' and 'e5b507efb36e2b9ad8edb1a38459d26c934d74dd.bzip2.hg' instead of the standard '.gz' and '.bz2' nomenclatures. Since they're unusable by standard tools as regular archives, this might not be an issue worth fixing.

::: ansible/tasks/generate-hg-s3-bundles.yml:67
(Diff revision 2)
> +  # don't want to advertise the bundle until it is distributed, as this

How are we advertising this?

::: ansible/tasks/generate-hg-s3-bundles.yml:29
(Diff revision 2)
> +  - name: prune old bundle files

Due to replication lag, and depending on how these are advertised, isn't it possible that automated hosts would be instructed to pull files that this would have already deleted?

I'm thinking this:

1. Bundlefile list/tip rev is fetched by build slave.
2. cronjob triggers, deleting old bundlefile(s)
2. Build slave tries to fetch bundle

::: ansible/tasks/generate-hg-s3-bundles.yml:6
(Diff revision 2)
> +# * AWS credentials are available to the environment (likely in ~/.boto)

We need to make sure the boto binary is also installed for this to work

::: ansible/tasks/generate-hg-s3-bundles.yml:18
(Diff revision 2)
> +    command: hg -R /repo/hg/mozilla/{{ repo | mandatory }} log -r tip -T '{node}'

Do we care about hardcoding /repo/hg/mozilla/ in all these commands? In the past I've tried to un-hardcode this in the configuration management.

::: ansible/tasks/generate-hg-s3-bundles.yml:100
(Diff revision 2)
> +    file: path=/repo/hg/mozilla/{{ repo }}/.hg/bundlemanifestfragments state=directory

These will be the only directories in /repo/hg/mozilla owned by root.  They should all be 'hg' user and inherit group from the parent dir.
https://reviewboard.mozilla.org/r/9311/#review8097

> Do we care about hardcoding /repo/hg/mozilla/ in all these commands? In the past I've tried to un-hardcode this in the configuration management.

Enough tools assume the /repo/hg/mozilla bits. I'm fine with adding one more. It certainly makes things easier.

> Due to replication lag, and depending on how these are advertised, isn't it possible that automated hosts would be instructed to pull files that this would have already deleted?
> 
> I'm thinking this:
> 
> 1. Bundlefile list/tip rev is fetched by build slave.
> 2. cronjob triggers, deleting old bundlefile(s)
> 2. Build slave tries to fetch bundle

No, because the local files are never used for anything except caching whether the bundle operation is current.

> This is going to write files like 'e5b507efb36e2b9ad8edb1a38459d26c934d74dd.gzip.hg' and 'e5b507efb36e2b9ad8edb1a38459d26c934d74dd.bzip2.hg' instead of the standard '.gz' and '.bz2' nomenclatures. Since they're unusable by standard tools as regular archives, this might not be an issue worth fixing.

Correct.

> I can't find a streambundle command/extension defined anywhere. If this is a local extension, gate on including/deploying that before this.

It is something we had to hack together in the bundleclone extension. I would like to integrate this into vanilla Mercurial, however, as it is a useful feature.

> How are we advertising this?

Via the bundle manifest file.

> These will be the only directories in /repo/hg/mozilla owned by root.  They should all be 'hg' user and inherit group from the parent dir.

This goes away after rewriting the functionality as a Python script.
Comment on attachment 8609685 [details]
MozReview Request: bz://1041173/gps2

/r/9311 - hgmo: create a script for writing hg bundles to s3 (bug 1041173)
/r/9313 - ansible: install CRON to periodically generate repository bundles (bug 1041173)

Pull down these commits:

hg pull -r 0720764ace0b9abb12bb1b4d1affe83349248edf https://reviewboard-hg.mozilla.org/version-control-tools/
https://reviewboard.mozilla.org/r/9311/#review8131

::: scripts/generate-hg-s3-bundles:56
(Diff revision 3)
> +    repo_full = os.path.join('/repo/hg/mozilla', repo)

Do we want to hardcode '/repo/hg/mozilla'? I'd propose a ROOT_DIR variable at the top for this

::: scripts/generate-hg-s3-bundles:88
(Diff revision 3)
> +        os.unlink(full)

Will this create a race condition of clients fetching an unlinked file?

::: scripts/generate-hg-s3-bundles:101
(Diff revision 3)
> +    # This is because an aborted bundle process may result in a partial file,

Is this something that can be fixed in upstream Mercurial?

Cloning, etc already uses this sort of behavior (ie: if I ^C during a clone, the directory will never be created)

::: scripts/generate-hg-s3-bundles:154
(Diff revision 3)
> +            upload_to_s3(host, bucket, final_path, remote_path)

Would this be worth parallelizing? Some of these are going to be awfully big.
https://reviewboard.mozilla.org/r/9313/#review8133

::: ansible/tasks/hgmo-bundle-cron.yml:9
(Diff revision 3)
> +  cron: name="Generate Mercurial bundles for {{ item.repo }}"

Can we run this as the 'hg' user instead of root? It strikes me as bad practice to be running serving applications as root.
https://reviewboard.mozilla.org/r/9311/#review8143

> Would this be worth parallelizing? Some of these are going to be awfully big.

I think this can be deferred to a follow-up.

FWIW, I'm able to upload at >30MB/s from hgssh1. I'm not too worried about performance here.

> Is this something that can be fixed in upstream Mercurial?
> 
> Cloning, etc already uses this sort of behavior (ie: if I ^C during a clone, the directory will never be created)

^C (SIGINT) is one thing. SIGKILL (kill -9) is another. All bets are off when SIGKILL is used. Mercurial/Python may not get the opportunity to clean up temp files!

Besides, the temp file is all about an abundance of caution. If we were to accidentally upload an incomplete bundle, bad things would happen (it would likely cause a tree closure).

That being said, Mercurial should arguably be writing to a temp file and using an atomic rename at the end of the bundle. It doesn't do this today. I'll consider sending a patch upstream because that would be more robust behavior.
https://reviewboard.mozilla.org/r/9313/#review8155

> Can we run this as the 'hg' user instead of root? It strikes me as bad practice to be running serving applications as root.

What if I told you no servers were started as part of this CRON job?

Running as root as easiest partially because of the mirroring step at the end. That requires running as the `hg` user. I'm also pretty sure the `hg` user doesn't have write access to every repo since it isn't a member of the various scm_* groups. I think it just happens to get luck and inherit "other" read and execute privileges.
https://reviewboard.mozilla.org/r/9313/#review8255

> What if I told you no servers were started as part of this CRON job?
> 
> Running as root as easiest partially because of the mirroring step at the end. That requires running as the `hg` user. I'm also pretty sure the `hg` user doesn't have write access to every repo since it isn't a member of the various scm_* groups. I think it just happens to get luck and inherit "other" read and execute privileges.

I understand that these are single-running processes and not long-running servers. Additionally, the pushing script already covers logic for this:

$ cat /repo/hg/scripts/push-repo.sh 
#!/usr/bin/env bash
WHOAMI=$(id -un)
if [ "${WHOAMI}" == "hg" ]; then
    /usr/local/bin/repo-push.sh $(echo ${PWD/\/repo\/hg\/mozilla\/})
else
    sudo -u hg /usr/local/bin/repo-push.sh $(echo ${PWD/\/repo\/hg\/mozilla\/})
fi

In terms of repos, there are some ownership issues due to the weird way that we're handling authentication and writing, although all repositories (sans users) are owned by the 'hg' user. Mainly the unix group is responsible for providing write permissions to users.

In particular I'm scared of a couple things:
  1. generatebundles() method not validating non-absolute paths (line 55)
  2. the hg command running without a fully-qualified path (lines 65 and 119). As an aside, wouldn't this be better implemented in the mercurial python library?
  3. The longevity of these processes. I understand they're not persistent server processes, but their lifetimes will range from seconds to minutes depending on disk io and network bandwidth.
https://reviewboard.mozilla.org/r/9313/#review8347

> I understand that these are single-running processes and not long-running servers. Additionally, the pushing script already covers logic for this:
> 
> $ cat /repo/hg/scripts/push-repo.sh 
> #!/usr/bin/env bash
> WHOAMI=$(id -un)
> if [ "${WHOAMI}" == "hg" ]; then
>     /usr/local/bin/repo-push.sh $(echo ${PWD/\/repo\/hg\/mozilla\/})
> else
>     sudo -u hg /usr/local/bin/repo-push.sh $(echo ${PWD/\/repo\/hg\/mozilla\/})
> fi
> 
> In terms of repos, there are some ownership issues due to the weird way that we're handling authentication and writing, although all repositories (sans users) are owned by the 'hg' user. Mainly the unix group is responsible for providing write permissions to users.
> 
> In particular I'm scared of a couple things:
>   1. generatebundles() method not validating non-absolute paths (line 55)
>   2. the hg command running without a fully-qualified path (lines 65 and 119). As an aside, wouldn't this be better implemented in the mercurial python library?
>   3. The longevity of these processes. I understand they're not persistent server processes, but their lifetimes will range from seconds to minutes depending on disk io and network bandwidth.

OK. Will have it run as "hg" user. But this is critique for the CRON commit, not this one.
https://reviewboard.mozilla.org/r/9313/#review8349

> OK. Will have it run as "hg" user. But this is critique for the CRON commit, not this one.

Bah. Wrong commit.
hgmo: create a script for writing hg bundles to s3 (bug 1041173); r?bkero

We want a way to easily generate repository bundles and store them in
S3. Create a script that we can run on hgssh to do this.
Attachment #8612410 - Flags: review?(bkero)
ansible: install CRON to periodically generate repository bundles (bug 1041173); r?bkero

Now that we support generating Mercurial bundles and uploading them to
S3, let's make generating them part of our standard operating procedure.

This commit installs daily CRON entries to do the bundle generation.

We introduce the "outputif" script to execute a process and only print
its output if it exits with an error. This makes CRON email tolerable by
not sending it if everything is OK.
Attachment #8612411 - Flags: review?(bkero)
Attachment #8609685 - Flags: review?(bkero)
Comment on attachment 8612410 [details]
MozReview Request: hgmo: create a script for writing hg bundles to s3 (bug 1041173); r?bkero

hgmo: create a script for writing hg bundles to s3 (bug 1041173); r?bkero

We want a way to easily generate repository bundles and store them in
S3. Create a script that we can run on hgssh to do this.
Comment on attachment 8612411 [details]
MozReview Request: ansible: install CRON to periodically generate repository bundles (bug 1041173); r?bkero

ansible: install CRON to periodically generate repository bundles (bug 1041173); r?bkero

Now that we support generating Mercurial bundles and uploading them to
S3, let's make generating them part of our standard operating procedure.

This commit installs daily CRON entries to do the bundle generation.

We introduce the "outputif" script to execute a process and only print
its output if it exits with an error. This makes CRON email tolerable by
not sending it if everything is OK.
Comment on attachment 8612411 [details]
MozReview Request: ansible: install CRON to periodically generate repository bundles (bug 1041173); r?bkero

https://reviewboard.mozilla.org/r/9313/#review8423

Ship It!
Attachment #8612411 - Flags: review?(bkero) → review+
Comment on attachment 8612410 [details]
MozReview Request: hgmo: create a script for writing hg bundles to s3 (bug 1041173); r?bkero

https://reviewboard.mozilla.org/r/9311/#review8425

Ship It!
Attachment #8612410 - Flags: review?(bkero) → review+
Comment on attachment 8612410 [details]
MozReview Request: hgmo: create a script for writing hg bundles to s3 (bug 1041173); r?bkero

hgmo: create a script for writing hg bundles to s3 (bug 1041173); r?bkero

We want a way to easily generate repository bundles and store them in
S3. Create a script that we can run on hgssh to do this.
Attachment #8612410 - Flags: review+ → review?(bkero)
Comment on attachment 8612411 [details]
MozReview Request: ansible: install CRON to periodically generate repository bundles (bug 1041173); r?bkero

ansible: install CRON to periodically generate repository bundles (bug 1041173); r?bkero

Now that we support generating Mercurial bundles and uploading them to
S3, let's make generating them part of our standard operating procedure.

This commit installs daily CRON entries to do the bundle generation.

We introduce the "outputif" script to execute a process and only print
its output if it exits with an error. This makes CRON email tolerable by
not sending it if everything is OK.
Attachment #8612411 - Flags: review+ → review?(bkero)
Comment on attachment 8612410 [details]
MozReview Request: hgmo: create a script for writing hg bundles to s3 (bug 1041173); r?bkero

https://reviewboard.mozilla.org/r/9311/#review8429

Ship It!
Attachment #8612410 - Flags: review?(bkero) → review+
Attachment #8612411 - Flags: review?(bkero) → review+
Comment on attachment 8612411 [details]
MozReview Request: ansible: install CRON to periodically generate repository bundles (bug 1041173); r?bkero

https://reviewboard.mozilla.org/r/9313/#review8431

Ship It!
https://reviewboard.mozilla.org/r/9311/#review8435

> We need to make sure the boto binary is also installed for this to work

This is addressed in the next commit.
Bundle file generation is officially in production on hg.mozilla.org \o/.

Instructions for using are at https://mozilla-version-control-tools.readthedocs.org/en/latest/hgmo/bundleclone.html.

I'd like to emphasize that you'll want to define bundle preferences in your /etc/mercurial/hgrc file to prefer fetches from the same EC2 region *and* that stream clones are preferred for performance reasons inside EC2 (see timings in comment #18, intra-region transfer is free).

Release automation should deploy bundleclone yesterday to offload clone load from hg.mozilla.org so the thundering herd can't take down the service from excessive clones.

I'm marking this bug closed. I'll file new ones to track deploying to automation.
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Blocks: 1144872
Attachment #8609685 - Attachment is obsolete: true
Attachment #8608357 - Attachment is obsolete: true
Attachment #8618229 - Flags: review+
Attachment #8618230 - Flags: review+
Attachment #8618231 - Flags: review+
Attachment #8618232 - Flags: review+
Attachment #8618233 - Flags: review+
Attachment #8618234 - Flags: review+
Attachment #8618235 - Flags: review+
Attachment #8618236 - Flags: review+
Duplicate of this bug: 1055302
Blocks: 1185261
bundleclone's functionality has been integrated into Mercurial 3.6, which should be released on or around November 1.
Blocks: 1216216
Blocks: 1352494
You need to log in before you can comment on or make changes to this bug.