1198951 - High git load causing frequent B2G build failures

Reporter

Description

•

9 years ago

This is currently closing trunk trees. We've been fighting with git load all day today without a clear reason as to why. However, it is causing B2G builds to fail at a >50% rate at the moment.

https://tools.taskcluster.net/task-inspector/#XPaGrDXMTLK9-6kyVeyHjA/0

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 1

•

9 years ago

All trees reopened.

Severity: blocker → normal

Gregory Szorc [:gps]

Comment 2

•

9 years ago

Following up on IRC conversation, mdoglio and I noticed that the shape, times, and duration of the TreeHerder log parsing slowness and the git.mo load matched up surprisingly well. Correlation implies causation?

Hal Wine [:hwine] use NI!

Comment 3

•

9 years ago

following up further - we've seen the same issue, to varying degrees, on several consecutive days. When mdoglio pulled the longer term graph, we did not find match ups. When we had a spike, they had a trough.

Rail Aliiev [:rail]

Comment 4

•

9 years ago

I'm trying to get a fresh tarball of the git shared directory. So far I have no good progress because the script fails due to 500s and I have to rerun it again.

Kendall Libby [:fubar] (he/him)

Comment 5

•

9 years ago

at the second, the 500s are coming from zeus because apache on git1 is at max clients.
I seem to recall that we lowered that in the past because it was too high. I've temporarily bumped it from 96 to 144 (cores*6) to see how it goes.

Gregory Szorc [:gps]

Comment 6

•

9 years ago

https://docs.google.com/spreadsheets/d/1rvrPCz_DgLwBIDaVYPRO5MdLIGILOpewYRYXmJnEO9s/edit?usp=sharing

Response bytes for git-upload-pack (clone/fetch operation) per repo for July 27. As you can see, /external/caf/platform/prebuilts/ndk, /external/caf/platform/prebuilts/sdk, and /releases/gaia.git are the dominant service consumers.

Rail Aliiev [:rail]

Comment 7

•

9 years ago

looks like some tests use bare "git clone .../gaia.git", what is suboptimal. See https://treeherder.mozilla.org/logviewer.html#?job_id=4415963&repo=fx-team

Would be better to optimize these.

Rail Aliiev [:rail]

Comment 8

•

9 years ago

Regenerating git-shared tarball didn't work, because the new tarball has grown in size up to 28G and won't fit on our current root device which is 35G.

We should either optimize the tarball (less devices?, ignore some repos?) or bump the size of the instances.

Bumping the size involves a couple of steps. We'll need to regenerate our base AMI and then update ami id in the configs so golden AMIs uses the new base AMIs.

I tried to resize the the root device online, but it didn't go well from the first attempt. In any case, resizing is not optimal thing to do.

I'm reverting the new tarball, so our overnight golden AMI generation works.

Rail Aliiev [:rail]

Updated

•

9 years ago

Depends on: 1199524

Rail Aliiev [:rail]

Comment 9

•

9 years ago

FTR, Hal is updating git-shared for try machines. It should work fine because the increase in size in not big (8.3G vs 8.5G)

Kendall Libby [:fubar] (he/him)

Comment 10

•

9 years ago

An exciting morning keeping an eye on gitmo as it goes through it's morning spike. Looking at the graphs (http://hwine.github.io/gitstats/), I see a flat spot at the top of the apache in-flight stats which says we would have been returning 500s for a couple minutes. :-(

It looks like the culprit is the ndk/sdk clones eating up too many slots in apache. We're configured to only allow 96 MaxClients so that we don't trash the server to death, but those clones are eating up at least two thirds of them. And some of those clones run for > 45 minutes.

yep, and now we're holding steady at max clients. killing long standing clones...

Rail Aliiev [:rail]

Comment 11

•

9 years ago

Gaia related load may come from https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/gaia.py#31
which is used by several tests:
https://dxr.mozilla.org/mozilla-central/search?q=GaiaMixin&redirect=false&case=true&limit=68&offset=0

Jonathan Griffin (:jgriffin)

Comment 12

•

9 years ago

(In reply to Rail Aliiev [:rail] from comment #11)
> Gaia related load may come from
> https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/
> mozilla/gaia.py#31
> which is used by several tests:
> https://dxr.mozilla.org/mozilla-central/
> search?q=GaiaMixin&redirect=false&case=true&limit=68&offset=0

This normally results in an hg clone of gaia-central, and has no impact on git.m.o.  I checked the Gij jobs, and that is in fact what they do.  So it _probably_ isn't the test jobs where this load is coming from.

Hal Wine [:hwine] use NI!

Updated

•

9 years ago

Depends on: 1200350

Kendall Libby [:fubar] (he/him)

Comment 14

•

9 years ago

running into maxclients on gitmo again. some of the connecting clients are hitting gitmo for dozens of repo updates at once; :garndt will link the bug to have repo tool be less egregious.

Greg Arndt [:garndt]

Updated

•

9 years ago

Depends on: 1203614

Kendall Libby [:fubar] (he/him)

Comment 15

•

9 years ago

back filling: had a large hit yesterday afternoon that resulted in a significant load spike. philor closed trees for a while to try and help cool off git. garndt and jonasfj fixed a taskcluster issue that was responsible for this particular go around:

20:28:21 <&garndt> hwine-commuting: I think you should see the clones of repos that we typically don't see go down. There was an issue with pulling down the cached copies of the repos from s3 that prevented the cache from being prewarmed and causing full clones to be done.

Carsten Book [:Tomcat]

Comment 16

•

9 years ago

and the problem is back it seems

Kendall Libby [:fubar] (he/him)

Comment 17

•

9 years ago

problem this morning has been manifesting differently on the server side, but may yet have common roots in what we're doing with caches and cloning.

repo of the day is /external/caf/platform/frameworks/base.git
appears to be a 3gb repo, with 1gb pack files. clones where causing pack-objects jobs on the server to eat all the ram, and also using up enough slots that I think regular jobs filled the rest and caused 500s. we also found a whole bunch of orphaned pack-objects jobs that were causing us to go deeply into swap (16gb of 17gb).

we've killed the orphans, so we're better, but still using more swap than we should me. some of that may be a single orphan that :gps is looking at. he's also repacking the repo to make it behave better.

lastly, to spread out the load I enabled gitweb[13] in the git-http zeus pool. it's possible that we'll see git clones fail on missing revs, due to the mirroring (and in particular because the master is still serving http data), but atm it's worth the extra breathing room.

Gregory Szorc [:gps]

Comment 18

•

9 years ago

Relevant Git config options we should consider adjusting:

core.packedGitWindowSize
Number of bytes of a pack file to map into memory in a single mapping operation. Larger window sizes may allow your system to process a smaller number of large pack files
more quickly. Smaller window sizes will negatively affect performance due to increased calls to the operating system's memory manager, but may improve performance when
accessing a large number of large pack files.

Default is 1 MiB if NO_MMAP was set at compile time, otherwise 32 MiB on 32 bit platforms and 1 GiB on 64 bit platforms. This should be reasonable for all users/operating
systems. You probably do not need to adjust this value.

Common unit suffixes of k, m, or g are supported.

I noticed the orphaned processes had a large mmap segment of a packfile. If all processes are memory mapping 1 GiB, that's a good way to eat memory and swap.

pack.windowMemory
The maximum size of memory that is consumed by each thread in git-pack-objects(1) for pack window memory when no limit is given on the command line. The value can be
suffixed with "k", "m", or "g". When left unconfigured (or set explicitly to 0), there will be no limit.

When clients fetch from Git, the Git server needs to produce a single packfile from existing packfiles and loose objects. It does this by invoking git-pack-objects. The actual work it is doing is described in gory detail at https://github.com/git/git/blob/master/Documentation/technical/pack-heuristics.txt. If you read that, you'll quickly realize why cloning from large Git repos that aren't optimally packed results in tons of CPU usage on the server.

Anyway, when packing objects, there is an in-memory sliding window of objects the next object will be diffed against. By default, the window is N=50 objects and there is no size limit. I /think/ (and I'm only not 100% sure because I haven't looked at the source code) that if you have a repository of very large objects/files that the window could consist of a bunch of very large objects and subsequently consume a lot of memory.

Of course, each clone of an unchanged Git repository incurs the same packing operation to produce (frequently) the same bits of output. So if you have a repository that takes a long time to pack (because it has many packfiles and/or loose objects), the server will collapse over the CPU and possibly memory load of many clients cloning.

pack.deltaCacheSize
The maximum memory in bytes used for caching deltas in git-pack-objects(1) before writing them out to a pack. This cache is used to speed up the writing object phase by not
having to recompute the final delta result once the best match for all objects is found. Repacking large repositories on machines which are tight with memory might be badly
impacted by this though, especially if this cache pushes the system into swapping. A value of 0 means no limit. The smallest size of 1 byte may be used to virtually disable
this cache. Defaults to 256 MiB.

Could contribute to memory exhaustion (but likely not a primary contributor).

Also, my explanation about servers having to produce a packfile on every fetch/clone is why we should be running repacks frequently: an expensive job once a day or so will reduce CPU and memory requirements from subsequent fetches, as the amount of work the Git server does is proportional to the number and size of objects not stored in the largest packfile.

Chris AtLee [:catlee]

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Assignee

Updated

•

6 years ago

Component: General Automation → General

Bugzilla

Quick Search

High git load causing frequent B2G build failures

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: RyanVM, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Comment 10

Comment 11

Comment 12

Updated

Comment 14

Updated

Comment 15

Comment 16

Comment 17

Comment 18

Updated

Updated