Closed Bug 1198951 Opened 9 years ago Closed 7 years ago

High git load causing frequent B2G build failures

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: RyanVM, Unassigned)

References

Details

This is currently closing trunk trees. We've been fighting with git load all day today without a clear reason as to why. However, it is causing B2G builds to fail at a >50% rate at the moment.

https://tools.taskcluster.net/task-inspector/#XPaGrDXMTLK9-6kyVeyHjA/0
All trees reopened.
Severity: blocker → normal
Following up on IRC conversation, mdoglio and I noticed that the shape, times, and duration of the TreeHerder log parsing slowness and the git.mo load matched up surprisingly well. Correlation implies causation?
following up further - we've seen the same issue, to varying degrees, on several consecutive days. When mdoglio pulled the longer term graph, we did not find match ups. When we had a spike, they had a trough.
I'm trying to get a fresh tarball of the git shared directory. So far I have no good progress because the script fails due to 500s and I have to rerun it again.
at the second, the 500s are coming from zeus because apache on git1 is at max clients.
I seem to recall that we lowered that in the past because it was too high. I've temporarily bumped it from 96 to 144 (cores*6) to see how it goes.
https://docs.google.com/spreadsheets/d/1rvrPCz_DgLwBIDaVYPRO5MdLIGILOpewYRYXmJnEO9s/edit?usp=sharing

Response bytes for git-upload-pack (clone/fetch operation) per repo for July 27. As you can see, /external/caf/platform/prebuilts/ndk, /external/caf/platform/prebuilts/sdk, and /releases/gaia.git are the dominant service consumers.
looks like some tests use bare "git clone .../gaia.git", what is suboptimal. See https://treeherder.mozilla.org/logviewer.html#?job_id=4415963&repo=fx-team

Would be better to optimize these.
Regenerating git-shared tarball didn't work, because the new tarball has grown in size up to 28G and won't fit on our current root device which is 35G.

We should either optimize the tarball (less devices?, ignore some repos?) or bump the size of the instances.

Bumping the size involves a couple of steps. We'll need to regenerate our base AMI and then update ami id in the configs so golden AMIs uses the new base AMIs.

I tried to resize the the root device online, but it didn't go well from the first attempt. In any case, resizing is not optimal thing to do.

I'm reverting the new tarball, so our overnight golden AMI generation works.
Depends on: 1199524
FTR, Hal is updating git-shared for try machines. It should work fine because the increase in size in not big (8.3G vs 8.5G)
An exciting morning keeping an eye on gitmo as it goes through it's morning spike. Looking at the graphs (http://hwine.github.io/gitstats/), I see a flat spot at the top of the apache in-flight stats which says we would have been returning 500s for a couple minutes. :-(

It looks like the culprit is the ndk/sdk clones eating up too many slots in apache. We're configured to only allow 96 MaxClients so that we don't trash the server to death, but those clones are eating up at least two thirds of them. And some of those clones run for > 45 minutes.

yep, and now we're holding steady at max clients. killing long standing clones...
(In reply to Rail Aliiev [:rail] from comment #11)
> Gaia related load may come from
> https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/
> mozilla/gaia.py#31
> which is used by several tests:
> https://dxr.mozilla.org/mozilla-central/
> search?q=GaiaMixin&redirect=false&case=true&limit=68&offset=0

This normally results in an hg clone of gaia-central, and has no impact on git.m.o.  I checked the Gij jobs, and that is in fact what they do.  So it _probably_ isn't the test jobs where this load is coming from.
Depends on: 1200350
running into maxclients on gitmo again. some of the connecting clients are hitting gitmo for dozens of repo updates at once; :garndt will link the bug to have repo tool be less egregious.
Depends on: 1203614
back filling: had a large hit yesterday afternoon that resulted in a significant load spike. philor closed trees for a while to try and help cool off git. garndt and jonasfj fixed a taskcluster issue that was responsible for this particular go around:

20:28:21 <&garndt> hwine-commuting: I think you should see the clones of repos that we typically don't see go down. There was an issue with pulling down the cached copies of the repos from s3 that prevented the cache from being prewarmed and causing full clones to be done.
and the problem is back it seems
problem this morning has been manifesting differently on the server side, but may yet have common roots in what we're doing with caches and cloning.

repo of the day is /external/caf/platform/frameworks/base.git
appears to be a 3gb repo, with 1gb pack files. clones where causing pack-objects jobs on the server to eat all the ram, and also using up enough slots that I think regular jobs filled the rest and caused 500s. we also found a whole bunch of orphaned pack-objects jobs that were causing us to go deeply into swap (16gb of 17gb).

we've killed the orphans, so we're better, but still using more swap than we should me. some of that may be a single orphan that :gps is looking at. he's also repacking the repo to make it behave better.

lastly, to spread out the load I enabled gitweb[13] in the git-http zeus pool. it's possible that we'll see git clones fail on missing revs, due to the mirroring (and in particular because the master is still serving http data), but atm it's worth the extra breathing room.
Relevant Git config options we should consider adjusting:

       core.packedGitWindowSize
           Number of bytes of a pack file to map into memory in a single mapping operation. Larger window sizes may allow your system to process a smaller number of large pack files
           more quickly. Smaller window sizes will negatively affect performance due to increased calls to the operating system's memory manager, but may improve performance when
           accessing a large number of large pack files.

           Default is 1 MiB if NO_MMAP was set at compile time, otherwise 32 MiB on 32 bit platforms and 1 GiB on 64 bit platforms. This should be reasonable for all users/operating
           systems. You probably do not need to adjust this value.

           Common unit suffixes of k, m, or g are supported.

I noticed the orphaned processes had a large mmap segment of a packfile. If all processes are memory mapping 1 GiB, that's a good way to eat memory and swap.

       pack.windowMemory
           The maximum size of memory that is consumed by each thread in git-pack-objects(1) for pack window memory when no limit is given on the command line. The value can be
           suffixed with "k", "m", or "g". When left unconfigured (or set explicitly to 0), there will be no limit.

When clients fetch from Git, the Git server needs to produce a single packfile from existing packfiles and loose objects. It does this by invoking git-pack-objects. The actual work it is doing is described in gory detail at https://github.com/git/git/blob/master/Documentation/technical/pack-heuristics.txt. If you read that, you'll quickly realize why cloning from large Git repos that aren't optimally packed results in tons of CPU usage on the server.

Anyway, when packing objects, there is an in-memory sliding window of objects the next object will be diffed against. By default, the window is N=50 objects and there is no size limit. I /think/ (and I'm only not 100% sure because I haven't looked at the source code) that if you have a repository of very large objects/files that the window could consist of a bunch of very large objects and subsequently consume a lot of memory.

Of course, each clone of an unchanged Git repository incurs the same packing operation to produce (frequently) the same bits of output. So if you have a repository that takes a long time to pack (because it has many packfiles and/or loose objects), the server will collapse over the CPU and possibly memory load of many clients cloning.

       pack.deltaCacheSize
           The maximum memory in bytes used for caching deltas in git-pack-objects(1) before writing them out to a pack. This cache is used to speed up the writing object phase by not
           having to recompute the final delta result once the best match for all objects is found. Repacking large repositories on machines which are tight with memory might be badly
           impacted by this though, especially if this cache pushes the system into swapping. A value of 0 means no limit. The smallest size of 1 byte may be used to virtually disable
           this cache. Defaults to 256 MiB.

Could contribute to memory exhaustion (but likely not a primary contributor).

Also, my explanation about servers having to produce a packfile on every fetch/clone is why we should be running repacks frequently: an expensive job once a day or so will reduce CPU and memory requirements from subsequent fetches, as the amount of work the Git server does is proportional to the number and size of objects not stored in the largest packfile.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.