Closed Bug 985864 Opened 10 years ago Closed 10 years ago

Problems (500 etc) on git.m.o - Integration trees and m-c closed

Categories

(Developer Services :: General, task)

x86
Gonk (Firefox OS)
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Assigned: fubar)

References

Details

Attachments

(3 files)

Tbpl reports ongoing issues with git.m.o with errors like http 500 and others when trying to access git.m.o . Bkero told me on irc that is now known to IT and worked on. But filing for tracking etc
00:58:15     INFO -  error: The requested URL returned error: 500 while accessing https://git.mozilla.org/b2g/platform_external_apriori/info/refs
The box in question is having problems because there are currently 47 git-pack-object processes running (each with a number of threads equal to the # of visible processors in the box). This is causing the box to OOM.

This leads me to believe that a script issued a massive amount of repository syncing all at the same time.

Does anybody know anything about that?
since this affected device builds and other tests on m-c, m-i and b2g-i closing this trees now
Severity: critical → blocker
Summary: Problems (500 etc) on git.m.o → Problems (500 etc) on git.m.o - Integration trees and m-c closed
We started getting cron mail at 00:52 Pacific, the first was a fairly cryptic:

From: vcs2vcs@github-sync3.dmz.scl3
[vcs2vcs]: git.m.o push failed: 128

There are a handful of these messages, and 10s of mails from beagle about not being able to push (makes sense if the machines are unresponsive).
Assignee: relops → server-ops-webops
Component: RelOps → WebOps: Source Control
QA Contact: arich → nmaul
Assignee: server-ops-webops → ludovic
Took because it's paging oncall bkero did a bunch of work things are recovering right now
The problems have recovered after I SIGTOPped all of the packing processes and started them one-by-one so as to not swamp the box.
nthomas, can you look in the logs to see which repositories were pushed right before the unavailability?
Flags: needinfo?(nthomas)
trees reopen at 2:30 PDT look good so far
I need to defer to hwine for a proper check, for his detailed knowledge and because there are several places that are pushing. 

FTR the first nagios alert was
 Thu 00:34:13 PDT [5548] git-zlb.vips.scl3.mozilla.com:HTTP - Port 80 is CRITICAL: CRITICAL - Socket timeout after 30 seconds
Flags: needinfo?(nthomas) → needinfo?(hwine)
nagios-scl3: Thu 06:11:48 PDT [5714] git-zlb.vips.scl3.mozilla.com:HTTPS - Port 443 is CRITICAL: CRITICAL - Socket timeout after 30 seconds (http://m.mozilla.org/HTTPS+-+Port+443)
[2:12pm] nagios-scl3: Thu 06:12:08 PDT [5715] git1.dmz.scl3.mozilla.com:Load is CRITICAL: CRITICAL - load average: 99.84, 62.14, 29.79 (http://m.mozilla.org/Load)
and here we go again, closing the trees again
<armenzg_mtg> Usul, Tomcat|sheriffduty wrt to the "beagle" emails - our vcs2vcs system is now working since 6:38

FTR it started failing around 6:20 AM (pasted failing log)


##################################
Unable to push these repos:
mozilla-inbound: Can't push /opt/vcs2vcs/build/conversion/beagle to gitmo-beagle!



Summary is non-zero:

error - Unable to push mozilla-inbound.

06:20:49     INFO - Copy/paste: git push gitolite3@git.mozilla.org:integration/gecko-dev.git +refs/heads/inbound:refs/heads/inbound
06:20:49     INFO - Using partial env: {'GIT_SSH': '/opt/vcs2vcs/mozharness/external_tools/git-ssh-wrapper.sh',
06:20:49     INFO -  'GIT_SSH_KEY': '~/.ssh/vcs-sync_rsa'}
06:20:49     INFO - Calling ['git', 'push', 'gitolite3@git.mozilla.org:integration/gecko-dev.git', '+refs/heads/inbound:refs/heads/inbound'] with output_timeout 1800
06:20:51     INFO -  Connection to git.mozilla.org closed by remote host.
06:20:51    ERROR -  fatal: The remote end hung up unexpectedly
06:20:51    ERROR - Return code: 128
06:20:51    ERROR - mozilla-inbound: Can't push /opt/vcs2vcs/build/conversion/beagle to gitmo-beagle!
06:20:51     INFO - retry: Calling <bound method HgGitScript.run_command of <__main__.HgGitScript object at 0x20628d0>> with args: (['git', 'push', 'git@github.com:mozilla/gecko-dev.git', '+refs/heads/inbound:refs/heads/inbound'],), kwargs: {'error_list': [{'substr': 'command not found', 'level': 'error'}, {'substr': 'Permission denied (publickey).', 'level': 'error'}, {'substr': 'fatal: The remote end hung up unexpectedly', 'level': 'error'}, {'substr': 'does not appear to be a git repository', 'level': 'error'}, {'substr': 'error: src refspec', 'level': 'error'}, {'substr': 'invalid author/committer line -', 'level': 'error'}, {'substr': 'remote: fatal: Error in object', 'level': 'error'}, {'substr': "fatal: sha1 file '<stdout>' write error: Broken pipe", 'level': 'error'}, {'substr': 'error: failed to push some refs to ', 'level': 'error'}, {'substr': 'remote: error: denying non-fast-forward ', 'level': 'error'}, {'substr': '! [remote rejected] ', 'level': 'error'}, {'rege
 x': <_sre.SRE_Pattern object at 0x1deea20>, 'level': 'error'}], 'partial_env': {'GIT_SSH_KEY': '~/.ssh/releng-github-id_rsa', 'GIT_SSH': '/opt/vcs2vcs/mozharness/external_tools/git-ssh-wrapper.sh'}, 'output_timeout': 1800, 'cwd': '/opt/vcs2vcs/build/conversion/beagle/.git'}, attempt #1
06:20:51     INFO - Running command: ['git', 'push', 'git@github.com:mozilla/gecko-dev.git', '+refs/heads/inbound:refs/heads/inbound'] in /opt/vcs2vcs/build/conversion/beagle/.git
06:20:51     INFO - Copy/paste: git push git@github.com:mozilla/gecko-dev.git +refs/heads/inbound:refs/heads/inbound
06:20:51     INFO - Using partial env: {'GIT_SSH': '/opt/vcs2vcs/mozharness/external_tools/git-ssh-wrapper.sh',
06:20:51     INFO -  'GIT_SSH_KEY': '~/.ssh/releng-github-id_rsa'}
06:20:51     INFO - Calling ['git', 'push', 'git@github.com:mozilla/gecko-dev.git', '+refs/heads/inbound:refs/heads/inbound'] with output_timeout 1800
06:20:54     INFO -  To git@github.com:mozilla/gecko-dev.git
06:20:54     INFO -     a77b209..845933d  inbound -> inbound
06:20:54     INFO - Return code: 0
06:20:54    ERROR - Unable to push mozilla-inbound.
06:20:54     INFO - Writing to file /opt/vcs2vcs/build/upload/repo_update.json
06:20:54     INFO - Contents:
06:20:54     INFO -  {
06:20:54     INFO -      "last_pull_datetime": "2014-03-20 06:19 PDT", 
06:20:54     INFO -      "last_pull_timestamp": 1395321561, 
--
06:21:14     INFO -              "push_datetime": "2014-03-18 18:34 PDT", 
06:21:14     INFO -              "push_timestamp": 1395192851
06:21:14     INFO -          }
06:21:14     INFO -      }
06:21:14     INFO -  }
06:21:14    FATAL - Unable to push these repos:
06:21:14    FATAL - mozilla-inbound: Can't push /opt/vcs2vcs/build/conversion/beagle to gitmo-beagle!
06:21:14    FATAL - 
06:21:14    FATAL - Running post_fatal callback...
06:21:14     INFO - Job took 132 seconds.
06:21:14     INFO - Getting output from command: ['egrep', '-C5', '^[0-9:]+ +(ERROR|CRITICAL|FATAL) -', '/opt/vcs2vcs/logs/beagle_info.log']
06:21:14     INFO - Copy/paste: egrep -C5 "^[0-9:]+ +(ERROR|CRITICAL|FATAL) -" /opt/vcs2vcs/logs/beagle_info.log
trees reopened at 7:20am PDT
Found in logs for pushes to git.m.o:releases/gecko.git:
2014-03-20T07:39+0000
Everything up-to-date
2014-03-20T07:50+0000
ssh_exchange_identification: Connection closed by remote host
fatal: The remote end hung up unexpectedly
2014-03-20T08:02+0000
ssh_exchange_identification: Connection closed by remote host
fatal: The remote end hung up unexpectedly
2014-03-20T08:14+0000
Everything up-to-date
2014-03-20T08:30+0000
Everything up-to-date
2014-03-20T08:42+0000
ssh_exchange_identification: Connection closed by remote host
fatal: The remote end hung up unexpectedly
2014-03-20T08:53+0000
Everything up-to-date

Can we determine why these disconnects happened?
Flags: needinfo?(hwine) → needinfo?(ludovic)
(In reply to Hal Wine [:hwine] (use needinfo) from comment #14)

> Can we determine why these disconnects happened?


Mar 20 06:13:17 git1 puppet-agent[17728]: (/Stage[main]/Gitolite3::Config/Exec[refresh-authkeys]) /usr/bin/gitolite trigger SSH_AUTHKEYS returned 25 instead of one of [0]
Mar 20 13:35:07 git1.dmz.scl3.mozilla.com kernel: [ 5442]     0  5442    16024       25  17     -17         -1000 sshd


Mar 20 08:42:14 git1 sshd[30835]: Connection from 10.22.74.210 port 49234
Mar 20 08:42:14 git1 sshd[30836]: Connection from 10.22.74.212 port 20752
Mar 20 08:42:14 git1 sshd[30836]: Did not receive identification string from 10.22.74.212
Mar 20 08:42:14 git1 sshd[30835]: Did not receive identification string from 10.22.74.210
Mar 20 08:42:14 git1 sshd[30837]: Set /proc/self/oom_score_adj to 0
Mar 20 08:42:14 git1 sshd[30837]: Connection from 10.22.74.208 port 46613
Mar 20 08:42:14 git1 sshd[30837]: Did not receive identification string from 10.22.74.208
Mar 20 08:42:16 git1 sshd[30840]: Set /proc/self/oom_score_adj to 0
Mar 20 08:42:16 git1 sshd[30841]: Set /proc/self/oom_score_adj to 0
Mar 20 08:42:16 git1 sshd[30842]: Set /proc/self/oom_score_adj to 0
Mar 20 08:42:16 git1 sshd[30843]: Set /proc/self/oom_score_adj to 0
Mar 20 08:42:17 git1 sshd[30841]: Connection from 10.22.74.210 port 52620
Mar 20 08:42:17 git1 sshd[30841]: Did not receive identification string from 10.22.74.210
Mar 20 08:42:40 git1 sshd[30885]: Set /proc/self/oom_score_adj to 0
Mar 20 08:42:40 git1 sshd[30884]: Set /proc/self/oom_score_adj to 0
Mar 20 08:42:40 git1 sshd[30880]: Set /proc/self/oom_score_adj to 0
Mar 20 08:42:40 git1 sshd[30882]: Set /proc/self/oom_score_adj to 0
Mar 20 08:42:40 git1 sshd[30881]: Set /proc/self/oom_score_adj to 0

Looks like we were running out of memory.
Flags: needinfo?(ludovic)
From:

1) 4:58:12 fubar: from what I saw earlier, upload-packs eating all of the memory. I don't know which repos, yet, though.

and:

2) "The box in question is having problems because there are currently 47 git-pack-object processes running (each with a number of threads equal to the # of visible processors in the box)."

I would expect this to be a problem with git clones, not git pushes, since:

1) http://git-scm.com/docs/git-upload-pack.html
"Invoked by git fetch-pack, learns what objects the other side is missing, and sends them after packing."

2) http://git-scm.com/docs/git-pack-objects.html
"Reads list of objects from the standard input, and writes a packed archive with specified base-name, or to the standard output."

So it looks to me like git-upload-pack is calling git-pack-objects in response to a git clone on the client side. If these are hanging - maybe there is a problem whereby clone processes are hanging, and eventually there are so many that the server runs out of memory and falls over.

I *don't* think we need to look at the vcs2vcs push process, I think we need to see if clones are hanging or taking an excessive amount of time.
Status Update, with thanks to :fubar & :pmoore for lots of help:

Findings to date:
 - all data consistent with all visible symptoms being caused by OOM
 - all data consistent with OOM being caused by "excessive" simultaneous clones from clients
 - the source of the clones appears to be the build farm (via the releng firewall)
 - no data points to any issues with the vcs-sync conversion systems (legacy or new)

A git server will occasionally decide a repack is needed while building an "upload pack" for the client to download. When that happens is a bit squishy if all settings are default.

If the same repository is cloned simultaneously and needs a repack, it is likely that each main process will decide a repack is needed. By default, repacks are free to use as many cores and as much memory as they'd like for the repack. During the events noted in the logs between 35-47 repacks of a single repo appeared to be happening at the same time.

As the number of supported b2g devices and branches increases, this is a more common condition. Each "push" can start 9-16 b2g device builds, with each device build (worst case) cloning the 90 git repos needed for a b2g build in parallel. So between 810 and 1440 clone operations (worst case) in a short period of time. And then we have 8 b2g branches active at the current time.

Fortunately, there is a lot of caching in the build farm, such that few of the builds need a full clone, but there is the potential for spikes here. And the potential for improvements.

Mitigation steps:
 - :fubar is limiting the amount of memory available for on-the-fly repacks (git config value pack.windowMemory) (today)
 - :hwine will file bug to ensure all b2g scheduled start times are staggered by at least 10 minutes (today)
 - ensure scheduled repacks happen in a controlled manner so the on-the-fly repacks are never needed (bug 980457)

Remaining open questions:
 - one event noted 35 simultaneous clones of releases/gecko.git -- that repo is not cloned by any build farm machinery. What is source of that load?
 - should number of parallel repack jobs (git config pack.threads) be reduced as well?
Depends on: 980457
commit 857d9e5bc049716ed898a905465da11b9293acd0

repo @all
    config pack.windowMemory = 512m
Assignee: ludovic → klibby
This spiked the alert threshold just now.  Looking at the top output, the high CPU culprit is the git-pack job again.  I also noticed that the git-pack job has 20 nice setting (same as apache)  renicing these jobs to 19 dropped the load significantly. I only did this for the 4 most taxing jobs at the moment, and they finished within 5 minutes.   

Staring these jobs with nice 19 is easy and maybe the cure.

nice 19 ./command <args>
Hit this again tonight but I wasn't aware of this bug... Shall try re-nicing, if I get paged for this again.
See Also: → 990344
and other spike was this morning and ashish catched - minimal influence on the trees
This seems to line up with our periodic b2g builds. Those generally happen every 6 hours on various trees.

This morning they were triggered at 6am, and previously at midnight, and 6pm last night.
Blocks: 990344
Got this again today - but it didn't close the tree.
Okay, we shouldn't have had any issue here with this low load.

I've attached a screen shot of the load during this event, and see the following differences from prior events:
 a) no significant bump in swap usage
 b) very long time to free all swap from the 0130Z (1830PT) load

:bkero can you see what what happening at this time?
Flags: needinfo?(bkero)
I've looked in apache logs and found that apache was running out of listeners, so I've adjusted httpd.conf to make up for that.

However, these settings will not go into effect until apache is hard-restarted. Releng is concerned about this burning builds, so this will need to be scheduled. Or if things break again, IT on-call has permission to 'service httpd restart' on git.mozilla.org
Flags: needinfo?(bkero)
(In reply to Ben Kero [:bkero] from comment #26)
> I've looked in apache logs and found that apache was running out of
> listeners, so I've adjusted httpd.conf to make up for that.
> 
> However, these settings will not go into effect until apache is
> hard-restarted. Releng is concerned about this burning builds, so this will
> need to be scheduled. Or if things break again, IT on-call has permission to
> 'service httpd restart' on git.mozilla.org

Hit trees again and Trees closed at 01:13:48am Pacific, Ashish (IT on call) is working on it to restart httpd per comment 26
Per Comment 26 httpd was kicked at Apr 16 01:15 Pacific.
trees reopen at 2:26 since it was confirmed the load has reduced after checking with hwine, pmoore and ashish
trees closed 1:32am since builds/tests failing again for error 500 after another spike
I also had to reload apache @ 01:15PST to clear dead apache workers.
rather 01:45 PST
trees reopened 2:11am since load reduced
When I first looked at the server, the SWAP was pegged @ 98-100%. Ram was getting close to 100% as well.  http://mzl.la/1gKltpU  hwine's graph suggest the same. Adjusting the swap space from 2Gb to 18GB (half of the RAM) I think will help carry the load of these git-upload job spikes. The SWAP=(RAM x .5) is a general rule of thumb setting for servers.  Hard drive space is cheap and we can also do this on the fly with a swapfile. Not to say this was configured improperly. Most of the time, we don't run servers to such a point they swap.  Seems in this case, its ok to crunch the server for a few minutes a day as long as apache and git keep working.  Worth a try imho
Swapping is bad but if we're going to swap then running out of it is probably worse. I'm very leery of IO wait, though; graphite shows pretty substantial wait times last night. In any case, worth a try. It's easy enough to add/remove the extra swap if it doesn't go in our favor.
Depends on: 997646
Depends on: 997702
Blocks: 997746
Have we tried disabling the git gc on the server side git repos?  In my experience, it's very difficult if at all possible to repack without causing OOMs on random things running on the same machine, and it buys you very little in terms of the amount of space saved on the disk.
git config gc.auto 0 on the server, and run git gc from cron.
In running various tests, I've found that running gc manually has either produced zero effect or actually made clones worse; same goes for manually repacking.
Unless your storage is ssd, i wouldn't advise letting the number of packs explode.
(In reply to Ben Kero [:bkero] from comment #2)
> The box in question is having problems because there are currently 47
> git-pack-object processes running (each with a number of threads equal to
> the # of visible processors in the box). This is causing the box to OOM.

Were all 47 for different repos, or were several running for the same repo? If the latter, there's a bug to report to git developers.
Load spiking past 54. git1.dmz.scl3:~root/server-status-2014-04-22--06:17.txt
fwiw - load spikes of <100 are "normal" and "not a major issue". Load spikes > 200 with swap usage are where IT now has instructions on actions to take.
It ended up well past 200.

If <200 is normal then the monitoring should be updated to reflect that. We shouldn't be getting paged for normal events.
It's only normal in the immediate term, while we try to fight this fire. Manual intervention (i.e. babysitting) is needed to ensure it doesn't completely trash itself to death.

There isn't a clear line where it can be ignored, though <70 might be ballpark. It's still possible for trees to burn if load's ~100 but we're OOM.

I realize it's frustrating for oncall, but we're firefighting atm and trying to get mirroring set up to spread out the load.
Attached file elinks output
Flapping on http, https and load again.
Attached file elinks output 2
Spike hit the trees again and so trees closed at 1:17am
trees reopen at 2:09am
after a big spike that ashish fixed/recovered, loading etc went high again and we were running into another spike that affects the trees and so trees closed again at 3:20am
just as info for the current tree closure, here is our current plan to reopen the trees:

1. suspend merges to reduce load till further notice (sheriffs list is informed to coordinate merges with oncall and releng)
2. retrigger the b2g nightlys based on aurora -> done (also now nightly generation on m-c was kicked off automatically)
3. wait till the nightlys are finished and retrigger the failed builds on the trees and when git is stable after this
4. -> reopen the trees
(In reply to Carsten Book [:Tomcat] from comment #50)
> just as info for the current tree closure, here is our current plan to
> reopen the trees:
> 
> 1. suspend merges to reduce load till further notice (sheriffs list is
> informed to coordinate merges with oncall and releng)
> 2. retrigger the b2g nightlys based on aurora -> done (also now nightly
> generation on m-c was kicked off automatically)
> 3. wait till the nightlys are finished and retrigger the failed builds on
> the trees and when git is stable after this
> 4. -> reopen the trees

seems this worked! trees reopened 5:14am
Depends on: 1000256
(In reply to Carsten Book [:Tomcat] from comment #51)
> seems this worked! trees reopened 5:14am

TBPL still says that trees are closed and links to this bug; am I missing something?
They were re-closed.
(In reply to Benoit Jacob [:bjacob] from comment #53)
> TBPL still says that trees are closed and links to this bug; am I missing
> something?

This is happening so frequently, that the bug isn't always being updated. The issue is happening again and all trees are closed. Bug 997702 via bug 1000245 should hopefully improve the current situation - an email was sent to dev.platform & dev.tree-management.
For random passer-bys in this bug, the server that powers git.mo just had a RAM upgrade performed and is back online. I've got B2G jobs retriggered on all trees and will reopen once things are green and RelEng gives an OK from their end.
We've also added a bundle for kernel/msm, which the repo tool looks for at https://git.mozilla.org/external/caf/kernel/msm/clone.bundle. This should avoid the CPU load on the server when doing initial clones.
Given that we have not seen the crisis issue recur, so I'm closing this bug and moving the remaining "good things" to the tracker bug.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
No longer depends on: 980457, 1000256
Blocks: 1013554
Component: WebOps: Source Control → General
Product: Infrastructure & Operations → Developer Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: