Closed Bug 1040308 Opened 10 years ago Closed 10 years ago

Trees closed due to constant slowness and unavailability of hg.mozilla.org

Categories

(mozilla.org Graveyard :: Server Operations, task, P1)

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: Callek, Assigned: fubar)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

So, trees have been closed for some time now. 

With :AutomatedTester coming in #releng asking for a bug, and being told by fubar that there was none yet, I'm filing this.

From various talks it looks like Releng Systems are hammering hg.m.o pretty badly, despite us not really knowing what changed.

-- Remediation is Bug 1040255 for that, but unclear how much if any it will help

IT Also was bring down and up new hgweb servers, which was part of changing the disks out in Bug 1026620, this is claimed to have an impact here since there is/was fallout by getting and installing the wrong disks.  Which is leaving a handful of machines out of service.

That is not much info to go on, but I'm hoping that since its filed, :fubar, and related teams can put any relevant data in here. And have someplace to follow.
ok, here's current status...

load appears to be release related, though a few systems were found to be hitting hg.m.o more than expected (e.g. transvisionmozfr.org), but they appear to be benign so far.

the two nodes out for rebuild were set with the disks, but we ran into an issue where commit SHAs weren't jiving. after much wailing and gnashing of teeth, the 127 repos were re-re-synced and everything appears to be happy. 

hgweb2 is currently back in rotation on hg.m.o. hgweb4 can go it, but I'm being wary. barring other issues, will add it to rotation tomorrow morning.

nagios alerts on httpd max-clients turned out to be a red herring. I'm not sure how it ever worked, since all but one was missing server-status configuration. fixed in puppet and nagios is happy again (and nodes are at ~4 out of 256 clients).

apologies for being short in irc; was attempting to juggle all of these, plus an opsec issue, at the same time.
Things seem to be loading slowly or outright failing to load again, so I just reclosed everything again.
At about 1525 Pacific, fubar restarted httpd and added in the last node (which I think means hgweb4). Jobs in the release automation are much better now, but we're just sorting out some fallout before we open the trees. I suggest we do that gently when it's time.
(In reply to Wes Kocher (:KWierso) from comment #4)
> Briefly reopened most non-trunk trees before we had to re-close things
> because of
> https://tbpl.mozilla.org/php/getParsedLog.php?id=44063643&tree=B2g-Inbound

[17:11]	nthomas	| actually, I think that's fallout from when things were broken earlier
[17:14]	KWierso|sheriffduty | nthomas: happy to reopen if that's the case
[17:15]	nthomas	| we got load issues, so not yet
[17:15]	nthomas	| it's as bad as when we got wedged at 3pm
We gradually reopened the trees and monitored load.

[17:37]	KWierso|sheriffduty	nthomas: good to reopen non-trunk stuff again?
	or wait a bit more?
	nthomas	KWierso|sheriffduty: lets go ahead, but make try special and do it later
[17:38]	KWierso|sheriffduty	nthomas: done
[17:39]	nthomas	Let the record show - b2g-i, fx-team, m-b, m-i, try still closed

[17:55]	nthomas	so far so good, although not much started
[18:01]	nthomas	KWierso|sheriffduty: lets do something busier now, one of the inbounds maybe
	KWierso|sheriffduty	nthomas: inbound's open
	nthomas	ok

[18:25]	KWierso|sheriffduty: still fine for load, b2g-i and maybe fx-team too ?
	only had on push to m-i so far
	*one push
	KWierso|sheriffduty	nthomas: okay
[18:26]	KWierso|sheriffduty	nthomas: everything but beta's back to their normal state, and beta's only closed for the merge
	nthomas	KWierso|sheriffduty: and try still closed I see, which is good

[18:51]	load is still looking good

[19:15]	nthomas	KWierso|sheriffduty: so wha do you think will be the bigger deal, gaia or try ?
	in terms of # of landings
[19:18]	nthomas	load peaked at 8 at 7pm, and has dropped back below 5
	nthomas	I don't think gaia is closed anyway, based on b2g-inbound
	KWierso|sheriffduty	nthomas: yeah, we opened gaia a while back
[19:25]	nthomas	ah, ok

[19:35]	nthomas	KWierso|sheriffduty|phone: lets open try whenever you're ready
[19:37]	-->|	fabrice (fabrice@495A1E63.5994028D.9CA62458.IP) has joined #releng
	KWierso|sheriffduty|phone	nthomas: boom
[19:38]	nthomas	I sure hope not
FWIW, I'm seeing significant slowness pushing a mere 14 changesets to Try, due to lock contention.

remote: waiting for lock on repository /repo/hg/mozilla/try/ held by 'hgssh1.dmz.scl3.mozilla.com:29394'

I waited three minutes, twice, then gave up.
Cancelling a try push can delay later pushes, because the server spends time cleaning up and rebuilding a cache. More details at http://dtor.com/halfire/2014/07/02/2014_06_try_server_update.html.
And try is unwedged, and the load from 10 pushes in 10 minutes hasn't killed us. Trees are back in business.
See Also: → 1040255, 1040433
and we are back with the problem 


program finished with exit code 255
elapsedTime=60.338841
========= Finished 'hg clone ...' failed (results: 5, elapsed: 1 mins, 4 secs) (at 2014-07-18 00:45:28.540087) =========

closing trees
after a load tests with just mozilla-inbound open (in coordination with releng and infra)  we had no more error 500 on hg - so reopened everything at 2:21
The load on the webheads spiked again between 2340 and 0110 Pacific. Usul responded with httpd restarts. From the graph you can see there were fairly large numbers of compile jobs starting on try and not try around this time.

The trends are published at http://builddata.pub.build.mozilla.org/reports/pending/, specifically running.html below that.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The trees may not be closed again, but hgweb front ends are being hammered into the ground again. httpd processes taking up all available memory and swap.
Alerts came in through the time range of Sun 02:24:21 PDT - Sun 02:58:20 PDT.

hgweb[135] required multiple graceful restarts of httpd and the OOM killers took out a few httpd processes when I was too slow dealing with the other machines.

Seems to have calmed down now, keeping an eye on it.
(In reply to Peter Radcliffe [:pir] from comment #13)
> The trees may not be closed again, but hgweb front ends are being hammered
> into the ground again. httpd processes taking up all available memory and
> swap.

Did you manage to see where the traffic was originating from?
Flags: needinfo?(pradcliffe)
No, I don't know anything about the logging for hg and was dealing with several things at the same time.
Flags: needinfo?(pradcliffe)
Blocks: 1042210
(In reply to David Burns :automatedtester from comment #15)
> (In reply to Peter Radcliffe [:pir] from comment #13)
> > The trees may not be closed again, but hgweb front ends are being hammered
> > into the ground again. httpd processes taking up all available memory and
> > swap.
> 
> Did you manage to see where the traffic was originating from?

hg tanked again but this time I managed to gather more data since we aren't in peak hours anywhere yet. Load is at 24 on all webheads but that is okay considering each machine has 12xHT CPUs. Swap is barely used. So the webheads are all CPU blocked now (no IO Wait either).

Looking through request logs, the top 5-6 IPs are all internal. Zeus graphs don't show any significant change in incoming requests. However at around 00:15 Pacific, the rate of requests actually dropped, which is likely the result of the webheads not responding and Zeus promptly returning 500s.

Next would be to look at requests just before the time hg started to fail - ~00:20 Pacific and determine whether there were higher requests were for a specific tree. Or something else...

FWIW a couple of webheads have recoveries in Nagios without my interference but the service as a whole is still out sick.
Jobs are timing out (eg https://tbpl.mozilla.org/php/getParsedLog.php?id=45311481&tree=Mozilla-Inbound) and TBPL is failing to load the hg pushlog to display results, so I've just closed all trees. This should hopefully reduce load soon too.
I've kicked httpd on all webheads but that hasn't helped much at all, even with trees closed :-/
I've ack'd the b2g bumper alerts too...
Blocks: 1049430
Some of the URLs in comment 21 are now working intermittently, albeit very slowly, but the try pushlog is still seeing issues:
https://hg.mozilla.org/try/json-pushes?full=1&maxhours=24
we were down one node, so that hgweb3 could be rebuilt. I have since powered it back up, resynced it, and put it back into rotation.

there was a significant load spike at around 0722 UTC. https://pastebin.mozilla.org/5836042 shows a bunch of device nightlies kicking off at :15.

hits to /build/{mozharness,tools}/ account for approximately 30% of the hits. even though they're not all full pulls, that seems high.

looking at graphite, there's also another noticeable spike in the evening, that was considerably worse last night with one node out. 

while I would like to see load spread out more evenly, it seems clear that even having one node out can be a significant problem, which is bad. will see about getting extra nodes before attempting to rebuild the remaining four which need more disk.
(In reply to Kendall Libby [:fubar] from comment #23)
> hits to /build/{mozharness,tools}/ account for approximately 30% of the
> hits. even though they're not all full pulls, that seems high.

Seems like bug 1036122, bug 1035660, bug 851398 and similar might help with this.
Blocks: try-tracker
Attached file #releng chatlog —
All of the URLs in comment 21 and 22 are loading promptly now, and as such TBPL is working, the recent try pushes have been picked up by the poller, and the self-serve nagios alerts have stopped (bug 1049430).

Therefore have reopened the !try trees, will open try shortly after presuming the additional load doesn't cause issues (gaia-try being reopened will see a wave of new PR bot commits aiui).
(In reply to Ed Morley [:edmorley] from comment #25)
> Created attachment 8468396 [details]
> #releng chatlog

FYI, the hg poller that feeds the dashboard retries if it hits network errors. The place it was stuck at was getting the repository list of /gaia-l10n/, which is why it kept banging on that url.

That doesn't send out more requests, btw, it just sends requests at the same pace it always does, but only to the failing url.

Once hg pushes to pulse, reliably so, I'd merrily stop polling, that'd be bug 1022701.
Try trees open now too.
Comment 23 implies our release automation infrastructure is DoSing ourselves because of inefficiencies. Said inefficiencies are already reported and filed under the bugs listed in comment 24. Said bugs have seen no real activity for up to months. While these bugs may not be the root cause, fixing matters shouldn't hurt any party, so why haven't we done that?

Needinfo Taras as I believe these bugs fall into his org chart.
Flags: needinfo?(taras.mozilla)
Having issues again. Trees re-closed at 11:50am PT.
Things seem to have stabilized again, so I've open all trees except Try so we can assess where things stand with some more load on them.
(In reply to Gregory Szorc [:gps] from comment #29)
> Comment 23 implies our release automation infrastructure is DoSing ourselves
> because of inefficiencies. Said inefficiencies are already reported and
> filed under the bugs listed in comment 24. Said bugs have seen no real
> activity for up to months. While these bugs may not be the root cause,
> fixing matters shouldn't hurt any party, so why haven't we done that?
> 
> Needinfo Taras as I believe these bugs fall into his org chart.

I see some random bugs. If someone confirms that  not cloning mozharness will help, I can push on that. I don't see any evidence that we know what specific thing is killing our try perf.
Flags: needinfo?(taras.mozilla)
Since this relates to a releng issue, as far as I can see can you find the someone to confirm/deny this is the issue?
Flags: needinfo?(taras.mozilla)
Flags: needinfo?(laura)
Right now we're gathering evidence and statistics about traffic and attempting to associate certain events with the issue here. Right now the evidence is pointing towards that, although nothing has been proven entirely. This is proving a thorny issue to debug. Generally alleviating load will certainly help, although probably won't fix the issue entirely.
I should add that we're working on adding hardware as well which will help to spread the load.

As Ben suggests, we are trying to debug the issue. The Dev Services team is all hands on deck on this and  we'll update as we figure it out.
Flags: needinfo?(laura)
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #31)
> Things seem to have stabilized again, so I've open all trees except Try so
> we can assess where things stand with some more load on them.

Try and try-comm-central reopened around 3PM Pacific Time.
Try re-closed at 2014-08-06T15:48:41 at hwine's request.
Jordan said he'll look into getting clone/pulls of mozharness/tools repos out of testruns
Flags: needinfo?(taras.mozilla)
Try reopened yet again at  2014-08-06T16:59:46
Depends on: 1049519
Depends on: 1050075
Depends on: 1050109
Closed try x 3 at hwine's request in #releng
All trees closed due to job failures across multiple repos.
Depends on: 1050377
500 ISEs all over again. Trees closed.
It appears that the most recent spike is gone. I've reopened Try and will be gradually reopening the other branches as I get a chance to clean things up.
Try re-closed at Callek's request due to more hgmo spikes and buildbot master command queue backlog.
Try and try-comm-central reopened at Callek's request.
This struck again between 5pm and 6pm today -- hg.m.o giving 503 and 500 errors, and tbpl having over 50 blue auto-retries of some jobs on this push: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=046dc36c93b1

Seems to be back now, though, for the moment...
Builds might have recovered (maybe, or maybe not), but tests continue to fail after hitting "abort: HTTP Error 504: Gateway Time-out" while cloning https://hg.mozilla.org/build/mozharness, and inconveniently so does loading URLs like https://hg.mozilla.org/integration/mozilla-inbound/json-pushes?full=1&maxhours=24, which have to load for tbpl to be able to load to be able to see the extent of the bustage.

Closed every tree at 18:38.
(I'm getting Error 504 Gateway Time-out when doing 'hg pull -u' locally, too.)
So this is NOT the same problem as last time.  So perhaps should be a new bug.  The last time this bug was reopened I had no issues doing hg pulls even remotely.
Closed gaia as well so we don't pile on even more bustage on top of the mystery bustage on the rest of the trees.
I altered a setting in mod_wsgi to a previous value, which should help availability.
(In reply to Bill Gianopoulos [:WG9s] from comment #51)
> The last time this bug was reopened I had no issues doing hg pulls
> even remotely.

I did, and other people have too, at various points. (e.g. search for "pull" on http://krijnhoetmer.nl/irc-logs/developers/20140807 for various mentions of people having trouble pulling on 8/7, the last time this was reopened)
Well, i sometimes have trouble when the tress are not closed with my pulls for my daily builds.  SO last time no issues for pulls whatsoever.  if people can not pull now this is different!
So the last time this was opened I could not get the tinderbox pushlog page to ever load but hg pulls always worked.  so the symptoms here are not the same.
Reopened at 19:47.
Twice we've opened Try. It looks stable for a while and performance starts dropping slowly. Closing try results in near-immediate relief.
Timeouts loading the json-pushes on multiple (non-try) repos -> all trees closed.
To be clearer (as well as more grammatically correct):
Timeouts loading json-pushes on multiple (non-try) repos, which is required for TBPL (and thus sheriffing) -> all trees closed.
Also seen an instance of:

https://tbpl.mozilla.org/php/getParsedLog.php?id=45844596&tree=Mozilla-Central
command: START
command: hg pull -r 76b55c0850ca6ba3fe64aa1fb6f26a6024585a96 https://hg.mozilla.org/mozilla-central
command: cwd: /builds/hg-shared/mozilla-central
command: output:
warning: hg.mozilla.org certificate with fingerprint af:27:b9:34:47:4e:e5:98:01:f6:83:2b:51:c9:aa:d8:df:fb:1a:27 not verified (check hostfingerprints or web.cacerts config setting)
pulling from https://hg.mozilla.org/mozilla-central
abort: unknown revision '76b55c0850ca6ba3fe64aa1fb6f26a6024585a96'!
All trees are now open again (have been staggering the reopening as requested, to minimise impact).
Getting gateway timeouts or 503 Service Temporarily Unavailable viewing hgweb pages, and timeouts on json-pushes again; sigh :-(

Trees reclosed.

Strangely Hal's graphite graph ("hgweb health report") doesn't show a spike yet.
Trees were opened at some point today.


Lots of people reporting failures pulling from hg.m.o, so I just closed everything again.
I know this is not going to solve any actual real problem, but can we at least try a try reset? The endless tree closures are too painful to wait for a real fix.
Filed bug 1053558 to see about a try reset. Ongoing discussion in #vcs
Try reset happened in 1053558, all trees reopened for now.
Depends on: 1053558
Depends on: 1053705
This bug is tracking an event that has been resolved. While the underlying root causes are still unresolved, we know what they are and we have bugs and team goals to address them. The primary issue is the scalability of Try. And it is one of my Q4 goals on the Developer Services team to change the architecture of Try so it scales to infinity. This will likely be tracked in bug 1055298.

I don't see what value this bug is providing in an open state. So I'm closing it.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → WORKSFORME
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: