Closed
Bug 985864
Opened 11 years ago
Closed 11 years ago
Problems (500 etc) on git.m.o - Integration trees and m-c closed
Categories
(Developer Services :: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cbook, Assigned: fubar)
References
Details
Attachments
(3 files)
Tbpl reports ongoing issues with git.m.o with errors like http 500 and others when trying to access git.m.o . Bkero told me on irc that is now known to IT and worked on. But filing for tracking etc
Reporter | ||
Comment 1•11 years ago
|
||
00:58:15 INFO - error: The requested URL returned error: 500 while accessing https://git.mozilla.org/b2g/platform_external_apriori/info/refs
Comment 2•11 years ago
|
||
The box in question is having problems because there are currently 47 git-pack-object processes running (each with a number of threads equal to the # of visible processors in the box). This is causing the box to OOM.
This leads me to believe that a script issued a massive amount of repository syncing all at the same time.
Does anybody know anything about that?
Reporter | ||
Comment 3•11 years ago
|
||
since this affected device builds and other tests on m-c, m-i and b2g-i closing this trees now
Severity: critical → blocker
Summary: Problems (500 etc) on git.m.o → Problems (500 etc) on git.m.o - Integration trees and m-c closed
Comment 4•11 years ago
|
||
We started getting cron mail at 00:52 Pacific, the first was a fairly cryptic:
From: vcs2vcs@github-sync3.dmz.scl3
[vcs2vcs]: git.m.o push failed: 128
There are a handful of these messages, and 10s of mails from beagle about not being able to push (makes sense if the machines are unresponsive).
Reporter | ||
Updated•11 years ago
|
Assignee: relops → server-ops-webops
Component: RelOps → WebOps: Source Control
QA Contact: arich → nmaul
Updated•11 years ago
|
Assignee: server-ops-webops → ludovic
Comment 5•11 years ago
|
||
Took because it's paging oncall bkero did a bunch of work things are recovering right now
Comment 6•11 years ago
|
||
The problems have recovered after I SIGTOPped all of the packing processes and started them one-by-one so as to not swamp the box.
Comment 7•11 years ago
|
||
nthomas, can you look in the logs to see which repositories were pushed right before the unavailability?
Flags: needinfo?(nthomas)
Reporter | ||
Comment 8•11 years ago
|
||
trees reopen at 2:30 PDT look good so far
Comment 9•11 years ago
|
||
I need to defer to hwine for a proper check, for his detailed knowledge and because there are several places that are pushing.
FTR the first nagios alert was
Thu 00:34:13 PDT [5548] git-zlb.vips.scl3.mozilla.com:HTTP - Port 80 is CRITICAL: CRITICAL - Socket timeout after 30 seconds
Flags: needinfo?(nthomas) → needinfo?(hwine)
Comment 10•11 years ago
|
||
nagios-scl3: Thu 06:11:48 PDT [5714] git-zlb.vips.scl3.mozilla.com:HTTPS - Port 443 is CRITICAL: CRITICAL - Socket timeout after 30 seconds (http://m.mozilla.org/HTTPS+-+Port+443)
[2:12pm] nagios-scl3: Thu 06:12:08 PDT [5715] git1.dmz.scl3.mozilla.com:Load is CRITICAL: CRITICAL - load average: 99.84, 62.14, 29.79 (http://m.mozilla.org/Load)
Reporter | ||
Comment 11•11 years ago
|
||
and here we go again, closing the trees again
Comment 12•11 years ago
|
||
<armenzg_mtg> Usul, Tomcat|sheriffduty wrt to the "beagle" emails - our vcs2vcs system is now working since 6:38
FTR it started failing around 6:20 AM (pasted failing log)
##################################
Unable to push these repos:
mozilla-inbound: Can't push /opt/vcs2vcs/build/conversion/beagle to gitmo-beagle!
Summary is non-zero:
error - Unable to push mozilla-inbound.
06:20:49 INFO - Copy/paste: git push gitolite3@git.mozilla.org:integration/gecko-dev.git +refs/heads/inbound:refs/heads/inbound
06:20:49 INFO - Using partial env: {'GIT_SSH': '/opt/vcs2vcs/mozharness/external_tools/git-ssh-wrapper.sh',
06:20:49 INFO - 'GIT_SSH_KEY': '~/.ssh/vcs-sync_rsa'}
06:20:49 INFO - Calling ['git', 'push', 'gitolite3@git.mozilla.org:integration/gecko-dev.git', '+refs/heads/inbound:refs/heads/inbound'] with output_timeout 1800
06:20:51 INFO - Connection to git.mozilla.org closed by remote host.
06:20:51 ERROR - fatal: The remote end hung up unexpectedly
06:20:51 ERROR - Return code: 128
06:20:51 ERROR - mozilla-inbound: Can't push /opt/vcs2vcs/build/conversion/beagle to gitmo-beagle!
06:20:51 INFO - retry: Calling <bound method HgGitScript.run_command of <__main__.HgGitScript object at 0x20628d0>> with args: (['git', 'push', 'git@github.com:mozilla/gecko-dev.git', '+refs/heads/inbound:refs/heads/inbound'],), kwargs: {'error_list': [{'substr': 'command not found', 'level': 'error'}, {'substr': 'Permission denied (publickey).', 'level': 'error'}, {'substr': 'fatal: The remote end hung up unexpectedly', 'level': 'error'}, {'substr': 'does not appear to be a git repository', 'level': 'error'}, {'substr': 'error: src refspec', 'level': 'error'}, {'substr': 'invalid author/committer line -', 'level': 'error'}, {'substr': 'remote: fatal: Error in object', 'level': 'error'}, {'substr': "fatal: sha1 file '<stdout>' write error: Broken pipe", 'level': 'error'}, {'substr': 'error: failed to push some refs to ', 'level': 'error'}, {'substr': 'remote: error: denying non-fast-forward ', 'level': 'error'}, {'substr': '! [remote rejected] ', 'level': 'error'}, {'rege
x': <_sre.SRE_Pattern object at 0x1deea20>, 'level': 'error'}], 'partial_env': {'GIT_SSH_KEY': '~/.ssh/releng-github-id_rsa', 'GIT_SSH': '/opt/vcs2vcs/mozharness/external_tools/git-ssh-wrapper.sh'}, 'output_timeout': 1800, 'cwd': '/opt/vcs2vcs/build/conversion/beagle/.git'}, attempt #1
06:20:51 INFO - Running command: ['git', 'push', 'git@github.com:mozilla/gecko-dev.git', '+refs/heads/inbound:refs/heads/inbound'] in /opt/vcs2vcs/build/conversion/beagle/.git
06:20:51 INFO - Copy/paste: git push git@github.com:mozilla/gecko-dev.git +refs/heads/inbound:refs/heads/inbound
06:20:51 INFO - Using partial env: {'GIT_SSH': '/opt/vcs2vcs/mozharness/external_tools/git-ssh-wrapper.sh',
06:20:51 INFO - 'GIT_SSH_KEY': '~/.ssh/releng-github-id_rsa'}
06:20:51 INFO - Calling ['git', 'push', 'git@github.com:mozilla/gecko-dev.git', '+refs/heads/inbound:refs/heads/inbound'] with output_timeout 1800
06:20:54 INFO - To git@github.com:mozilla/gecko-dev.git
06:20:54 INFO - a77b209..845933d inbound -> inbound
06:20:54 INFO - Return code: 0
06:20:54 ERROR - Unable to push mozilla-inbound.
06:20:54 INFO - Writing to file /opt/vcs2vcs/build/upload/repo_update.json
06:20:54 INFO - Contents:
06:20:54 INFO - {
06:20:54 INFO - "last_pull_datetime": "2014-03-20 06:19 PDT",
06:20:54 INFO - "last_pull_timestamp": 1395321561,
--
06:21:14 INFO - "push_datetime": "2014-03-18 18:34 PDT",
06:21:14 INFO - "push_timestamp": 1395192851
06:21:14 INFO - }
06:21:14 INFO - }
06:21:14 INFO - }
06:21:14 FATAL - Unable to push these repos:
06:21:14 FATAL - mozilla-inbound: Can't push /opt/vcs2vcs/build/conversion/beagle to gitmo-beagle!
06:21:14 FATAL -
06:21:14 FATAL - Running post_fatal callback...
06:21:14 INFO - Job took 132 seconds.
06:21:14 INFO - Getting output from command: ['egrep', '-C5', '^[0-9:]+ +(ERROR|CRITICAL|FATAL) -', '/opt/vcs2vcs/logs/beagle_info.log']
06:21:14 INFO - Copy/paste: egrep -C5 "^[0-9:]+ +(ERROR|CRITICAL|FATAL) -" /opt/vcs2vcs/logs/beagle_info.log
Reporter | ||
Comment 13•11 years ago
|
||
trees reopened at 7:20am PDT
Comment 14•11 years ago
|
||
Found in logs for pushes to git.m.o:releases/gecko.git:
2014-03-20T07:39+0000
Everything up-to-date
2014-03-20T07:50+0000
ssh_exchange_identification: Connection closed by remote host
fatal: The remote end hung up unexpectedly
2014-03-20T08:02+0000
ssh_exchange_identification: Connection closed by remote host
fatal: The remote end hung up unexpectedly
2014-03-20T08:14+0000
Everything up-to-date
2014-03-20T08:30+0000
Everything up-to-date
2014-03-20T08:42+0000
ssh_exchange_identification: Connection closed by remote host
fatal: The remote end hung up unexpectedly
2014-03-20T08:53+0000
Everything up-to-date
Can we determine why these disconnects happened?
Flags: needinfo?(hwine) → needinfo?(ludovic)
Comment 15•11 years ago
|
||
(In reply to Hal Wine [:hwine] (use needinfo) from comment #14)
> Can we determine why these disconnects happened?
Mar 20 06:13:17 git1 puppet-agent[17728]: (/Stage[main]/Gitolite3::Config/Exec[refresh-authkeys]) /usr/bin/gitolite trigger SSH_AUTHKEYS returned 25 instead of one of [0]
Mar 20 13:35:07 git1.dmz.scl3.mozilla.com kernel: [ 5442] 0 5442 16024 25 17 -17 -1000 sshd
Mar 20 08:42:14 git1 sshd[30835]: Connection from 10.22.74.210 port 49234
Mar 20 08:42:14 git1 sshd[30836]: Connection from 10.22.74.212 port 20752
Mar 20 08:42:14 git1 sshd[30836]: Did not receive identification string from 10.22.74.212
Mar 20 08:42:14 git1 sshd[30835]: Did not receive identification string from 10.22.74.210
Mar 20 08:42:14 git1 sshd[30837]: Set /proc/self/oom_score_adj to 0
Mar 20 08:42:14 git1 sshd[30837]: Connection from 10.22.74.208 port 46613
Mar 20 08:42:14 git1 sshd[30837]: Did not receive identification string from 10.22.74.208
Mar 20 08:42:16 git1 sshd[30840]: Set /proc/self/oom_score_adj to 0
Mar 20 08:42:16 git1 sshd[30841]: Set /proc/self/oom_score_adj to 0
Mar 20 08:42:16 git1 sshd[30842]: Set /proc/self/oom_score_adj to 0
Mar 20 08:42:16 git1 sshd[30843]: Set /proc/self/oom_score_adj to 0
Mar 20 08:42:17 git1 sshd[30841]: Connection from 10.22.74.210 port 52620
Mar 20 08:42:17 git1 sshd[30841]: Did not receive identification string from 10.22.74.210
Mar 20 08:42:40 git1 sshd[30885]: Set /proc/self/oom_score_adj to 0
Mar 20 08:42:40 git1 sshd[30884]: Set /proc/self/oom_score_adj to 0
Mar 20 08:42:40 git1 sshd[30880]: Set /proc/self/oom_score_adj to 0
Mar 20 08:42:40 git1 sshd[30882]: Set /proc/self/oom_score_adj to 0
Mar 20 08:42:40 git1 sshd[30881]: Set /proc/self/oom_score_adj to 0
Looks like we were running out of memory.
Flags: needinfo?(ludovic)
Comment 16•11 years ago
|
||
From:
1) 4:58:12 fubar: from what I saw earlier, upload-packs eating all of the memory. I don't know which repos, yet, though.
and:
2) "The box in question is having problems because there are currently 47 git-pack-object processes running (each with a number of threads equal to the # of visible processors in the box)."
I would expect this to be a problem with git clones, not git pushes, since:
1) http://git-scm.com/docs/git-upload-pack.html
"Invoked by git fetch-pack, learns what objects the other side is missing, and sends them after packing."
2) http://git-scm.com/docs/git-pack-objects.html
"Reads list of objects from the standard input, and writes a packed archive with specified base-name, or to the standard output."
So it looks to me like git-upload-pack is calling git-pack-objects in response to a git clone on the client side. If these are hanging - maybe there is a problem whereby clone processes are hanging, and eventually there are so many that the server runs out of memory and falls over.
I *don't* think we need to look at the vcs2vcs push process, I think we need to see if clones are hanging or taking an excessive amount of time.
Comment 17•11 years ago
|
||
Status Update, with thanks to :fubar & :pmoore for lots of help:
Findings to date:
- all data consistent with all visible symptoms being caused by OOM
- all data consistent with OOM being caused by "excessive" simultaneous clones from clients
- the source of the clones appears to be the build farm (via the releng firewall)
- no data points to any issues with the vcs-sync conversion systems (legacy or new)
A git server will occasionally decide a repack is needed while building an "upload pack" for the client to download. When that happens is a bit squishy if all settings are default.
If the same repository is cloned simultaneously and needs a repack, it is likely that each main process will decide a repack is needed. By default, repacks are free to use as many cores and as much memory as they'd like for the repack. During the events noted in the logs between 35-47 repacks of a single repo appeared to be happening at the same time.
As the number of supported b2g devices and branches increases, this is a more common condition. Each "push" can start 9-16 b2g device builds, with each device build (worst case) cloning the 90 git repos needed for a b2g build in parallel. So between 810 and 1440 clone operations (worst case) in a short period of time. And then we have 8 b2g branches active at the current time.
Fortunately, there is a lot of caching in the build farm, such that few of the builds need a full clone, but there is the potential for spikes here. And the potential for improvements.
Mitigation steps:
- :fubar is limiting the amount of memory available for on-the-fly repacks (git config value pack.windowMemory) (today)
- :hwine will file bug to ensure all b2g scheduled start times are staggered by at least 10 minutes (today)
- ensure scheduled repacks happen in a controlled manner so the on-the-fly repacks are never needed (bug 980457)
Remaining open questions:
- one event noted 35 simultaneous clones of releases/gecko.git -- that repo is not cloned by any build farm machinery. What is source of that load?
- should number of parallel repack jobs (git config pack.threads) be reduced as well?
Assignee | ||
Comment 18•11 years ago
|
||
commit 857d9e5bc049716ed898a905465da11b9293acd0
repo @all
config pack.windowMemory = 512m
Updated•11 years ago
|
Assignee: ludovic → klibby
Comment 19•11 years ago
|
||
This spiked the alert threshold just now. Looking at the top output, the high CPU culprit is the git-pack job again. I also noticed that the git-pack job has 20 nice setting (same as apache) renicing these jobs to 19 dropped the load significantly. I only did this for the 4 most taxing jobs at the moment, and they finished within 5 minutes.
Staring these jobs with nice 19 is easy and maybe the cure.
nice 19 ./command <args>
Comment 20•11 years ago
|
||
Hit this again tonight but I wasn't aware of this bug... Shall try re-nicing, if I get paged for this again.
Reporter | ||
Comment 21•11 years ago
|
||
and other spike was this morning and ashish catched - minimal influence on the trees
Comment 22•11 years ago
|
||
This seems to line up with our periodic b2g builds. Those generally happen every 6 hours on various trees.
This morning they were triggered at 6am, and previously at midnight, and 6pm last night.
Comment 24•11 years ago
|
||
Got this again today - but it didn't close the tree.
Comment 25•11 years ago
|
||
Okay, we shouldn't have had any issue here with this low load.
I've attached a screen shot of the load during this event, and see the following differences from prior events:
a) no significant bump in swap usage
b) very long time to free all swap from the 0130Z (1830PT) load
:bkero can you see what what happening at this time?
Flags: needinfo?(bkero)
Comment 26•11 years ago
|
||
I've looked in apache logs and found that apache was running out of listeners, so I've adjusted httpd.conf to make up for that.
However, these settings will not go into effect until apache is hard-restarted. Releng is concerned about this burning builds, so this will need to be scheduled. Or if things break again, IT on-call has permission to 'service httpd restart' on git.mozilla.org
Flags: needinfo?(bkero)
Reporter | ||
Comment 27•11 years ago
|
||
(In reply to Ben Kero [:bkero] from comment #26)
> I've looked in apache logs and found that apache was running out of
> listeners, so I've adjusted httpd.conf to make up for that.
>
> However, these settings will not go into effect until apache is
> hard-restarted. Releng is concerned about this burning builds, so this will
> need to be scheduled. Or if things break again, IT on-call has permission to
> 'service httpd restart' on git.mozilla.org
Hit trees again and Trees closed at 01:13:48am Pacific, Ashish (IT on call) is working on it to restart httpd per comment 26
Comment 28•11 years ago
|
||
Per Comment 26 httpd was kicked at Apr 16 01:15 Pacific.
Reporter | ||
Comment 29•11 years ago
|
||
trees reopen at 2:26 since it was confirmed the load has reduced after checking with hwine, pmoore and ashish
Reporter | ||
Comment 30•11 years ago
|
||
trees closed 1:32am since builds/tests failing again for error 500 after another spike
Comment 31•11 years ago
|
||
I also had to reload apache @ 01:15PST to clear dead apache workers.
Comment 32•11 years ago
|
||
rather 01:45 PST
Reporter | ||
Comment 33•11 years ago
|
||
trees reopened 2:11am since load reduced
Comment 34•11 years ago
|
||
When I first looked at the server, the SWAP was pegged @ 98-100%. Ram was getting close to 100% as well. http://mzl.la/1gKltpU hwine's graph suggest the same. Adjusting the swap space from 2Gb to 18GB (half of the RAM) I think will help carry the load of these git-upload job spikes. The SWAP=(RAM x .5) is a general rule of thumb setting for servers. Hard drive space is cheap and we can also do this on the fly with a swapfile. Not to say this was configured improperly. Most of the time, we don't run servers to such a point they swap. Seems in this case, its ok to crunch the server for a few minutes a day as long as apache and git keep working. Worth a try imho
Assignee | ||
Comment 35•11 years ago
|
||
Swapping is bad but if we're going to swap then running out of it is probably worse. I'm very leery of IO wait, though; graphite shows pretty substantial wait times last night. In any case, worth a try. It's easy enough to add/remove the extra swap if it doesn't go in our favor.
Comment 36•11 years ago
|
||
Have we tried disabling the git gc on the server side git repos? In my experience, it's very difficult if at all possible to repack without causing OOMs on random things running on the same machine, and it buys you very little in terms of the amount of space saved on the disk.
Comment 37•11 years ago
|
||
git config gc.auto 0 on the server, and run git gc from cron.
Assignee | ||
Comment 38•11 years ago
|
||
In running various tests, I've found that running gc manually has either produced zero effect or actually made clones worse; same goes for manually repacking.
Comment 39•11 years ago
|
||
Unless your storage is ssd, i wouldn't advise letting the number of packs explode.
Comment 40•11 years ago
|
||
(In reply to Ben Kero [:bkero] from comment #2)
> The box in question is having problems because there are currently 47
> git-pack-object processes running (each with a number of threads equal to
> the # of visible processors in the box). This is causing the box to OOM.
Were all 47 for different repos, or were several running for the same repo? If the latter, there's a bug to report to git developers.
Comment 41•11 years ago
|
||
Load spiking past 54. git1.dmz.scl3:~root/server-status-2014-04-22--06:17.txt
Comment 42•11 years ago
|
||
fwiw - load spikes of <100 are "normal" and "not a major issue". Load spikes > 200 with swap usage are where IT now has instructions on actions to take.
Comment 43•11 years ago
|
||
It ended up well past 200.
If <200 is normal then the monitoring should be updated to reflect that. We shouldn't be getting paged for normal events.
Assignee | ||
Comment 44•11 years ago
|
||
It's only normal in the immediate term, while we try to fight this fire. Manual intervention (i.e. babysitting) is needed to ensure it doesn't completely trash itself to death.
There isn't a clear line where it can be ignored, though <70 might be ballpark. It's still possible for trees to burn if load's ~100 but we're OOM.
I realize it's frustrating for oncall, but we're firefighting atm and trying to get mirroring set up to spread out the load.
Comment 45•11 years ago
|
||
Flapping on http, https and load again.
Comment 46•11 years ago
|
||
Reporter | ||
Comment 47•11 years ago
|
||
Spike hit the trees again and so trees closed at 1:17am
Reporter | ||
Comment 48•11 years ago
|
||
trees reopen at 2:09am
Reporter | ||
Comment 49•11 years ago
|
||
after a big spike that ashish fixed/recovered, loading etc went high again and we were running into another spike that affects the trees and so trees closed again at 3:20am
Reporter | ||
Comment 50•11 years ago
|
||
just as info for the current tree closure, here is our current plan to reopen the trees:
1. suspend merges to reduce load till further notice (sheriffs list is informed to coordinate merges with oncall and releng)
2. retrigger the b2g nightlys based on aurora -> done (also now nightly generation on m-c was kicked off automatically)
3. wait till the nightlys are finished and retrigger the failed builds on the trees and when git is stable after this
4. -> reopen the trees
Reporter | ||
Comment 51•11 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #50)
> just as info for the current tree closure, here is our current plan to
> reopen the trees:
>
> 1. suspend merges to reduce load till further notice (sheriffs list is
> informed to coordinate merges with oncall and releng)
> 2. retrigger the b2g nightlys based on aurora -> done (also now nightly
> generation on m-c was kicked off automatically)
> 3. wait till the nightlys are finished and retrigger the failed builds on
> the trees and when git is stable after this
> 4. -> reopen the trees
seems this worked! trees reopened 5:14am
Comment 53•11 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #51)
> seems this worked! trees reopened 5:14am
TBPL still says that trees are closed and links to this bug; am I missing something?
Comment 54•11 years ago
|
||
They were re-closed.
Comment 55•11 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #53)
> TBPL still says that trees are closed and links to this bug; am I missing
> something?
This is happening so frequently, that the bug isn't always being updated. The issue is happening again and all trees are closed. Bug 997702 via bug 1000245 should hopefully improve the current situation - an email was sent to dev.platform & dev.tree-management.
Comment 56•11 years ago
|
||
For random passer-bys in this bug, the server that powers git.mo just had a RAM upgrade performed and is back online. I've got B2G jobs retriggered on all trees and will reopen once things are green and RelEng gives an OK from their end.
Comment 57•11 years ago
|
||
We've also added a bundle for kernel/msm, which the repo tool looks for at https://git.mozilla.org/external/caf/kernel/msm/clone.bundle. This should avoid the CPU load on the server when doing initial clones.
Comment 58•11 years ago
|
||
Given that we have not seen the crisis issue recur, so I'm closing this bug and moving the remaining "good things" to the tracker bug.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Updated•10 years ago
|
Component: WebOps: Source Control → General
Product: Infrastructure & Operations → Developer Services
You need to log in
before you can comment on or make changes to this bug.
Description
•