Closed Bug 1181783 Opened 6 years ago Closed 5 years ago
Cluster pulls too much
+++ This bug was initially created as a clone of Bug #1096653 +++ Clones of gaia-central from hg.mozilla.org currently consume 100-170GB per day. In a post bundleclone world, clones of gaia-central consume more bandwidth than any other repository, including mozilla-central. Utilizing bundleclone for cloning gaia-central has the potential to reduce hg.mozilla.org traffic by 20%. Note: bundleclone should almost certainly be universally enabled. While I care most about gaia-central right now, chances are this bug is really "system X isn't using bundleclone." But I don't know what "system X" is.
On UTC day July 8, 2015, gaia-central accounted for 245,924,895,240 bytes of outbound traffic, which was 19.46% of total outbound traffic. It had 76 GB more data transferred than the next closest repository.
Digging into the data a bit deeper, I /think/ what's happening is that gaia-central is just being pulled a lot. `hg clone` and `hg pull` perform a "getbundle" Mercurial wire protocol command to retrieve repository data from the server. If you break down the distribution of the size of "getbundle" responses for the gaia-central repository, the sizes are reasonable. In fact, on at least one server, there were no full clones for this repository! Instead, we have thousands of requests for 1 MB, 5 MB, 20MB bundles. In aggregate, these add up to over 100 GB. I suspect the heavy volume is coming from TaskCluster test jobs pulling the repository. See e.g. https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Dq_pI68aTpyWqyAupQ8F3A/0/public/logs/live_backing.log. Here, we see a pull of 256 changesets after uncompressing the gaia-central tarball. I see 3 potential solutions to reducing gaia-central's volume: 1) Stop leveraging gaia-central from test jobs (I'm assuming this is difficult) 2) Distribute gaia-central completely from within TaskCluster (I think this is unnecessary) 3) Generate gaia-central archives more frequently If we generate gaia-central archives more frequently, the delta between what's in the archive and what needs fetched from hg.mozilla.org will be smaller, thus reducing overall transfer size. This feels like an easy and quick solution.
Summary: gaia-central isn't using bundleclone → gaia-central pulls too much
This isn't just gaia-central: here is a log from fx-team showing 2,062 changesets being pulled: https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/wqtq4u6kSoiaocNrw9waKg/0/public/logs/live_backing.log Doing some maths, I reckon bundleclone on stream repos will be faster than xz files. That changes if the xz archives are decently up to date, however. Anyway, from that log, the decision graph job is taking 30s to pull fx-team. That's 30s delay between push and job scheduling. Anything we can do to decrease this time will reduce overall job turnaround time.
OK, this is a bigger problem than I thought. From https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/NbRWQdVGSKCGIjwDr7mEhw/0/public/logs/live_backing.log, we see TaskCluster pulling 2,079 changesets from Try. I was looking at the per-repository CPU time breakdown the other day and was a bit surprised as to why Try was so high and I reckon this is why. For server performance reasons, the Try repository on the server is encoded in the "generaldelta" format. This makes every part of Mercurial faster except when it generates bundles to send to clients (as part of `hg clone` or `hg pull` operations). (Well, at least this is true until bundle2 becomes more widespread in Mercurial 3.5+.) Anyway, the performance regression isn't noticeable unless you are pulling say 100+ changesets. The good news is not all pulls from Try are retrieving hundreds of changesets. https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/HTaKk9WwRky1Nr1o5yxl8w/0/public/logs/live_backing.log only pulls 10 changesets. That's acceptable. The trick here is going to be a) ensuring the seeded bundles are generated more often b) using shared repo storage.
Summary: gaia-central pulls too much → TaskCluster pulls too much
Component: General Automation → General
Product: Release Engineering → Taskcluster
QA Contact: catlee
I submitted a pull request at https://github.com/taskcluster/taskcluster-vcs/pull/18 that attempts to fix the Try pull problem. I also chatted with Selena IRL regarding this. I need to follow up with Greg on the state of TaskCluster and VCS interaction.
We are now running caches every 4 hours, and I haven't heard about issues recently. Our path forward is to replace tc-vcs with an incremental tar solution in the decision task (q1/q2 timeframe). Please let us know if we need to do more here.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.