Closed
Bug 1040308
Opened 10 years ago
Closed 10 years ago
Trees closed due to constant slowness and unavailability of hg.mozilla.org
Categories
(mozilla.org Graveyard :: Server Operations, task, P1)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: Callek, Assigned: fubar)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
So, trees have been closed for some time now. With :AutomatedTester coming in #releng asking for a bug, and being told by fubar that there was none yet, I'm filing this. From various talks it looks like Releng Systems are hammering hg.m.o pretty badly, despite us not really knowing what changed. -- Remediation is Bug 1040255 for that, but unclear how much if any it will help IT Also was bring down and up new hgweb servers, which was part of changing the disks out in Bug 1026620, this is claimed to have an impact here since there is/was fallout by getting and installing the wrong disks. Which is leaving a handful of machines out of service. That is not much info to go on, but I'm hoping that since its filed, :fubar, and related teams can put any relevant data in here. And have someplace to follow.
Assignee | ||
Comment 1•10 years ago
|
||
ok, here's current status... load appears to be release related, though a few systems were found to be hitting hg.m.o more than expected (e.g. transvisionmozfr.org), but they appear to be benign so far. the two nodes out for rebuild were set with the disks, but we ran into an issue where commit SHAs weren't jiving. after much wailing and gnashing of teeth, the 127 repos were re-re-synced and everything appears to be happy. hgweb2 is currently back in rotation on hg.m.o. hgweb4 can go it, but I'm being wary. barring other issues, will add it to rotation tomorrow morning. nagios alerts on httpd max-clients turned out to be a red herring. I'm not sure how it ever worked, since all but one was missing server-status configuration. fixed in puppet and nagios is happy again (and nodes are at ~4 out of 256 clients). apologies for being short in irc; was attempting to juggle all of these, plus an opsec issue, at the same time.
Things seem to be loading slowly or outright failing to load again, so I just reclosed everything again.
Comment 3•10 years ago
|
||
At about 1525 Pacific, fubar restarted httpd and added in the last node (which I think means hgweb4). Jobs in the release automation are much better now, but we're just sorting out some fallout before we open the trees. I suggest we do that gently when it's time.
Briefly reopened most non-trunk trees before we had to re-close things because of https://tbpl.mozilla.org/php/getParsedLog.php?id=44063643&tree=B2g-Inbound
(In reply to Wes Kocher (:KWierso) from comment #4) > Briefly reopened most non-trunk trees before we had to re-close things > because of > https://tbpl.mozilla.org/php/getParsedLog.php?id=44063643&tree=B2g-Inbound [17:11] nthomas | actually, I think that's fallout from when things were broken earlier [17:14] KWierso|sheriffduty | nthomas: happy to reopen if that's the case [17:15] nthomas | we got load issues, so not yet [17:15] nthomas | it's as bad as when we got wedged at 3pm
Comment 6•10 years ago
|
||
We gradually reopened the trees and monitored load. [17:37] KWierso|sheriffduty nthomas: good to reopen non-trunk stuff again? or wait a bit more? nthomas KWierso|sheriffduty: lets go ahead, but make try special and do it later [17:38] KWierso|sheriffduty nthomas: done [17:39] nthomas Let the record show - b2g-i, fx-team, m-b, m-i, try still closed [17:55] nthomas so far so good, although not much started [18:01] nthomas KWierso|sheriffduty: lets do something busier now, one of the inbounds maybe KWierso|sheriffduty nthomas: inbound's open nthomas ok [18:25] KWierso|sheriffduty: still fine for load, b2g-i and maybe fx-team too ? only had on push to m-i so far *one push KWierso|sheriffduty nthomas: okay [18:26] KWierso|sheriffduty nthomas: everything but beta's back to their normal state, and beta's only closed for the merge nthomas KWierso|sheriffduty: and try still closed I see, which is good [18:51] load is still looking good [19:15] nthomas KWierso|sheriffduty: so wha do you think will be the bigger deal, gaia or try ? in terms of # of landings [19:18] nthomas load peaked at 8 at 7pm, and has dropped back below 5 nthomas I don't think gaia is closed anyway, based on b2g-inbound KWierso|sheriffduty nthomas: yeah, we opened gaia a while back [19:25] nthomas ah, ok [19:35] nthomas KWierso|sheriffduty|phone: lets open try whenever you're ready [19:37] -->| fabrice (fabrice@495A1E63.5994028D.9CA62458.IP) has joined #releng KWierso|sheriffduty|phone nthomas: boom [19:38] nthomas I sure hope not
Comment 7•10 years ago
|
||
FWIW, I'm seeing significant slowness pushing a mere 14 changesets to Try, due to lock contention. remote: waiting for lock on repository /repo/hg/mozilla/try/ held by 'hgssh1.dmz.scl3.mozilla.com:29394' I waited three minutes, twice, then gave up.
Comment 8•10 years ago
|
||
Cancelling a try push can delay later pushes, because the server spends time cleaning up and rebuilding a cache. More details at http://dtor.com/halfire/2014/07/02/2014_06_try_server_update.html.
Comment 9•10 years ago
|
||
And try is unwedged, and the load from 10 pushes in 10 minutes hasn't killed us. Trees are back in business.
Updated•10 years ago
|
Comment 10•10 years ago
|
||
and we are back with the problem program finished with exit code 255 elapsedTime=60.338841 ========= Finished 'hg clone ...' failed (results: 5, elapsed: 1 mins, 4 secs) (at 2014-07-18 00:45:28.540087) ========= closing trees
Comment 11•10 years ago
|
||
after a load tests with just mozilla-inbound open (in coordination with releng and infra) we had no more error 500 on hg - so reopened everything at 2:21
Comment 12•10 years ago
|
||
The load on the webheads spiked again between 2340 and 0110 Pacific. Usul responded with httpd restarts. From the graph you can see there were fairly large numbers of compile jobs starting on try and not try around this time. The trends are published at http://builddata.pub.build.mozilla.org/reports/pending/, specifically running.html below that.
Updated•10 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 13•10 years ago
|
||
The trees may not be closed again, but hgweb front ends are being hammered into the ground again. httpd processes taking up all available memory and swap.
Comment 14•10 years ago
|
||
Alerts came in through the time range of Sun 02:24:21 PDT - Sun 02:58:20 PDT. hgweb[135] required multiple graceful restarts of httpd and the OOM killers took out a few httpd processes when I was too slow dealing with the other machines. Seems to have calmed down now, keeping an eye on it.
Comment 15•10 years ago
|
||
(In reply to Peter Radcliffe [:pir] from comment #13) > The trees may not be closed again, but hgweb front ends are being hammered > into the ground again. httpd processes taking up all available memory and > swap. Did you manage to see where the traffic was originating from?
Flags: needinfo?(pradcliffe)
Comment 16•10 years ago
|
||
No, I don't know anything about the logging for hg and was dealing with several things at the same time.
Flags: needinfo?(pradcliffe)
Comment 17•10 years ago
|
||
(In reply to David Burns :automatedtester from comment #15) > (In reply to Peter Radcliffe [:pir] from comment #13) > > The trees may not be closed again, but hgweb front ends are being hammered > > into the ground again. httpd processes taking up all available memory and > > swap. > > Did you manage to see where the traffic was originating from? hg tanked again but this time I managed to gather more data since we aren't in peak hours anywhere yet. Load is at 24 on all webheads but that is okay considering each machine has 12xHT CPUs. Swap is barely used. So the webheads are all CPU blocked now (no IO Wait either). Looking through request logs, the top 5-6 IPs are all internal. Zeus graphs don't show any significant change in incoming requests. However at around 00:15 Pacific, the rate of requests actually dropped, which is likely the result of the webheads not responding and Zeus promptly returning 500s. Next would be to look at requests just before the time hg started to fail - ~00:20 Pacific and determine whether there were higher requests were for a specific tree. Or something else... FWIW a couple of webheads have recoveries in Nagios without my interference but the service as a whole is still out sick.
Comment 18•10 years ago
|
||
Jobs are timing out (eg https://tbpl.mozilla.org/php/getParsedLog.php?id=45311481&tree=Mozilla-Inbound) and TBPL is failing to load the hg pushlog to display results, so I've just closed all trees. This should hopefully reduce load soon too.
Comment 19•10 years ago
|
||
I've kicked httpd on all webheads but that hasn't helped much at all, even with trees closed :-/
Comment 20•10 years ago
|
||
I've ack'd the b2g bumper alerts too...
Comment 21•10 years ago
|
||
Trees are still closed. Conditions for reopening: * Being able to load each of the following used by TBPL: ** https://hg.mozilla.org/mozilla-central/json-pushes?full=1&maxhours=24 ** https://hg.mozilla.org/integration/mozilla-inbound/json-pushes?full=1&maxhours=24 ** https://hg.mozilla.org/integration/b2g-inbound/json-pushes?full=1&maxhours=24 ** https://hg.mozilla.org/integration/fx-team/json-pushes?full=1&maxhours=24 * Being able to load https://secure.pub.build.mozilla.org/buildapi/self-serve/jobs and other parts of self-serve (bug 1049430, depends on hg.m.o)
Comment 22•10 years ago
|
||
Some of the URLs in comment 21 are now working intermittently, albeit very slowly, but the try pushlog is still seeing issues: https://hg.mozilla.org/try/json-pushes?full=1&maxhours=24
Assignee | ||
Comment 23•10 years ago
|
||
we were down one node, so that hgweb3 could be rebuilt. I have since powered it back up, resynced it, and put it back into rotation. there was a significant load spike at around 0722 UTC. https://pastebin.mozilla.org/5836042 shows a bunch of device nightlies kicking off at :15. hits to /build/{mozharness,tools}/ account for approximately 30% of the hits. even though they're not all full pulls, that seems high. looking at graphite, there's also another noticeable spike in the evening, that was considerably worse last night with one node out. while I would like to see load spread out more evenly, it seems clear that even having one node out can be a significant problem, which is bad. will see about getting extra nodes before attempting to rebuild the remaining four which need more disk.
Comment 24•10 years ago
|
||
(In reply to Kendall Libby [:fubar] from comment #23) > hits to /build/{mozharness,tools}/ account for approximately 30% of the > hits. even though they're not all full pulls, that seems high. Seems like bug 1036122, bug 1035660, bug 851398 and similar might help with this.
Updated•10 years ago
|
Blocks: try-tracker
Comment 25•10 years ago
|
||
Comment 26•10 years ago
|
||
All of the URLs in comment 21 and 22 are loading promptly now, and as such TBPL is working, the recent try pushes have been picked up by the poller, and the self-serve nagios alerts have stopped (bug 1049430). Therefore have reopened the !try trees, will open try shortly after presuming the additional load doesn't cause issues (gaia-try being reopened will see a wave of new PR bot commits aiui).
Comment 27•10 years ago
|
||
(In reply to Ed Morley [:edmorley] from comment #25) > Created attachment 8468396 [details] > #releng chatlog FYI, the hg poller that feeds the dashboard retries if it hits network errors. The place it was stuck at was getting the repository list of /gaia-l10n/, which is why it kept banging on that url. That doesn't send out more requests, btw, it just sends requests at the same pace it always does, but only to the failing url. Once hg pushes to pulse, reliably so, I'd merrily stop polling, that'd be bug 1022701.
Comment 28•10 years ago
|
||
Try trees open now too.
Comment 29•10 years ago
|
||
Comment 23 implies our release automation infrastructure is DoSing ourselves because of inefficiencies. Said inefficiencies are already reported and filed under the bugs listed in comment 24. Said bugs have seen no real activity for up to months. While these bugs may not be the root cause, fixing matters shouldn't hurt any party, so why haven't we done that? Needinfo Taras as I believe these bugs fall into his org chart.
Flags: needinfo?(taras.mozilla)
Comment 30•10 years ago
|
||
Having issues again. Trees re-closed at 11:50am PT.
Comment 31•10 years ago
|
||
Things seem to have stabilized again, so I've open all trees except Try so we can assess where things stand with some more load on them.
Comment 32•10 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #29) > Comment 23 implies our release automation infrastructure is DoSing ourselves > because of inefficiencies. Said inefficiencies are already reported and > filed under the bugs listed in comment 24. Said bugs have seen no real > activity for up to months. While these bugs may not be the root cause, > fixing matters shouldn't hurt any party, so why haven't we done that? > > Needinfo Taras as I believe these bugs fall into his org chart. I see some random bugs. If someone confirms that not cloning mozharness will help, I can push on that. I don't see any evidence that we know what specific thing is killing our try perf.
Flags: needinfo?(taras.mozilla)
Comment 33•10 years ago
|
||
Since this relates to a releng issue, as far as I can see can you find the someone to confirm/deny this is the issue?
Flags: needinfo?(taras.mozilla)
Flags: needinfo?(laura)
Comment 34•10 years ago
|
||
Right now we're gathering evidence and statistics about traffic and attempting to associate certain events with the issue here. Right now the evidence is pointing towards that, although nothing has been proven entirely. This is proving a thorny issue to debug. Generally alleviating load will certainly help, although probably won't fix the issue entirely.
Comment 35•10 years ago
|
||
I should add that we're working on adding hardware as well which will help to spread the load. As Ben suggests, we are trying to debug the issue. The Dev Services team is all hands on deck on this and we'll update as we figure it out.
Flags: needinfo?(laura)
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #31) > Things seem to have stabilized again, so I've open all trees except Try so > we can assess where things stand with some more load on them. Try and try-comm-central reopened around 3PM Pacific Time.
Try re-closed at 2014-08-06T15:48:41 at hwine's request.
And reopened at 4:07PM
Comment 39•10 years ago
|
||
Jordan said he'll look into getting clone/pulls of mozharness/tools repos out of testruns
Flags: needinfo?(taras.mozilla)
Try reopened yet again at 2014-08-06T16:59:46
Comment 41•10 years ago
|
||
Closed try x 3 at hwine's request in #releng
Comment 42•10 years ago
|
||
All trees closed due to job failures across multiple repos.
Comment 43•10 years ago
|
||
500 ISEs all over again. Trees closed.
Comment 44•10 years ago
|
||
It appears that the most recent spike is gone. I've reopened Try and will be gradually reopening the other branches as I get a chance to clean things up.
Comment 45•10 years ago
|
||
Try re-closed at Callek's request due to more hgmo spikes and buildbot master command queue backlog.
Try and try-comm-central reopened at Callek's request.
Everything's now reopened.
Comment 48•10 years ago
|
||
This struck again between 5pm and 6pm today -- hg.m.o giving 503 and 500 errors, and tbpl having over 50 blue auto-retries of some jobs on this push: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=046dc36c93b1 Seems to be back now, though, for the moment...
Comment 49•10 years ago
|
||
Builds might have recovered (maybe, or maybe not), but tests continue to fail after hitting "abort: HTTP Error 504: Gateway Time-out" while cloning https://hg.mozilla.org/build/mozharness, and inconveniently so does loading URLs like https://hg.mozilla.org/integration/mozilla-inbound/json-pushes?full=1&maxhours=24, which have to load for tbpl to be able to load to be able to see the extent of the bustage. Closed every tree at 18:38.
Comment 50•10 years ago
|
||
(I'm getting Error 504 Gateway Time-out when doing 'hg pull -u' locally, too.)
Comment 51•10 years ago
|
||
So this is NOT the same problem as last time. So perhaps should be a new bug. The last time this bug was reopened I had no issues doing hg pulls even remotely.
Closed gaia as well so we don't pile on even more bustage on top of the mystery bustage on the rest of the trees.
Comment 53•10 years ago
|
||
I altered a setting in mod_wsgi to a previous value, which should help availability.
Comment 54•10 years ago
|
||
(In reply to Bill Gianopoulos [:WG9s] from comment #51) > The last time this bug was reopened I had no issues doing hg pulls > even remotely. I did, and other people have too, at various points. (e.g. search for "pull" on http://krijnhoetmer.nl/irc-logs/developers/20140807 for various mentions of people having trouble pulling on 8/7, the last time this was reopened)
Comment 55•10 years ago
|
||
Well, i sometimes have trouble when the tress are not closed with my pulls for my daily builds. SO last time no issues for pulls whatsoever. if people can not pull now this is different!
Comment 56•10 years ago
|
||
So the last time this was opened I could not get the tinderbox pushlog page to ever load but hg pulls always worked. so the symptoms here are not the same.
Comment 57•10 years ago
|
||
Reopened at 19:47.
Closed try at hwine's suggestion.
Comment 59•10 years ago
|
||
Twice we've opened Try. It looks stable for a while and performance starts dropping slowly. Closing try results in near-immediate relief.
Blocks: 979739
Comment 60•10 years ago
|
||
Timeouts loading the json-pushes on multiple (non-try) repos -> all trees closed.
Comment 61•10 years ago
|
||
To be clearer (as well as more grammatically correct): Timeouts loading json-pushes on multiple (non-try) repos, which is required for TBPL (and thus sheriffing) -> all trees closed.
Comment 62•10 years ago
|
||
Also seen an instance of: https://tbpl.mozilla.org/php/getParsedLog.php?id=45844596&tree=Mozilla-Central command: START command: hg pull -r 76b55c0850ca6ba3fe64aa1fb6f26a6024585a96 https://hg.mozilla.org/mozilla-central command: cwd: /builds/hg-shared/mozilla-central command: output: warning: hg.mozilla.org certificate with fingerprint af:27:b9:34:47:4e:e5:98:01:f6:83:2b:51:c9:aa:d8:df:fb:1a:27 not verified (check hostfingerprints or web.cacerts config setting) pulling from https://hg.mozilla.org/mozilla-central abort: unknown revision '76b55c0850ca6ba3fe64aa1fb6f26a6024585a96'!
Comment 63•10 years ago
|
||
All trees are now open again (have been staggering the reopening as requested, to minimise impact).
Comment 64•10 years ago
|
||
Getting gateway timeouts or 503 Service Temporarily Unavailable viewing hgweb pages, and timeouts on json-pushes again; sigh :-( Trees reclosed. Strangely Hal's graphite graph ("hgweb health report") doesn't show a spike yet.
Trees were opened at some point today. Lots of people reporting failures pulling from hg.m.o, so I just closed everything again.
Comment 66•10 years ago
|
||
I know this is not going to solve any actual real problem, but can we at least try a try reset? The endless tree closures are too painful to wait for a real fix.
Filed bug 1053558 to see about a try reset. Ongoing discussion in #vcs
Try reset happened in 1053558, all trees reopened for now.
No longer blocks: 979739
Comment 69•10 years ago
|
||
This bug is tracking an event that has been resolved. While the underlying root causes are still unresolved, we know what they are and we have bugs and team goals to address them. The primary issue is the scalability of Try. And it is one of my Q4 goals on the Developer Services team to change the architecture of Try so it scales to infinity. This will likely be tracked in bug 1055298. I don't see what value this bug is providing in an open state. So I'm closing it.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → WORKSFORME
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•