1040308 - Trees closed due to constant slowness and unavailability of hg.mozilla.org

Reporter

Description

•

11 years ago

So, trees have been closed for some time now. With :AutomatedTester coming in #releng asking for a bug, and being told by fubar that there was none yet, I'm filing this. From various talks it looks like Releng Systems are hammering hg.m.o pretty badly, despite us not really knowing what changed. -- Remediation is Bug 1040255 for that, but unclear how much if any it will help IT Also was bring down and up new hgweb servers, which was part of changing the disks out in Bug 1026620, this is claimed to have an impact here since there is/was fallout by getting and installing the wrong disks. Which is leaving a handful of machines out of service. That is not much info to go on, but I'm hoping that since its filed, :fubar, and related teams can put any relevant data in here. And have someplace to follow.

Kendall Libby [:fubar] (he/him)

Assignee

Comment 1

•

11 years ago

ok, here's current status... load appears to be release related, though a few systems were found to be hitting hg.m.o more than expected (e.g. transvisionmozfr.org), but they appear to be benign so far. the two nodes out for rebuild were set with the disks, but we ran into an issue where commit SHAs weren't jiving. after much wailing and gnashing of teeth, the 127 repos were re-re-synced and everything appears to be happy. hgweb2 is currently back in rotation on hg.m.o. hgweb4 can go it, but I'm being wary. barring other issues, will add it to rotation tomorrow morning. nagios alerts on httpd max-clients turned out to be a red herring. I'm not sure how it ever worked, since all but one was missing server-status configuration. fixed in puppet and nagios is happy again (and nodes are at ~4 out of 256 clients). apologies for being short in irc; was attempting to juggle all of these, plus an opsec issue, at the same time.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 2

•

11 years ago

Things seem to be loading slowly or outright failing to load again, so I just reclosed everything again.

Nick Thomas [:nthomas] (UTC+12)

Comment 3

•

11 years ago

At about 1525 Pacific, fubar restarted httpd and added in the last node (which I think means hgweb4). Jobs in the release automation are much better now, but we're just sorting out some fallout before we open the trees. I suggest we do that gently when it's time.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 4

•

11 years ago

Briefly reopened most non-trunk trees before we had to re-close things because of https://tbpl.mozilla.org/php/getParsedLog.php?id=44063643&tree=B2g-Inbound

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 5

•

11 years ago

(In reply to Wes Kocher (:KWierso) from comment #4) > Briefly reopened most non-trunk trees before we had to re-close things > because of > https://tbpl.mozilla.org/php/getParsedLog.php?id=44063643&tree=B2g-Inbound [17:11] nthomas | actually, I think that's fallout from when things were broken earlier [17:14] KWierso|sheriffduty | nthomas: happy to reopen if that's the case [17:15] nthomas | we got load issues, so not yet [17:15] nthomas | it's as bad as when we got wedged at 3pm

Nick Thomas [:nthomas] (UTC+12)

Comment 6

•

11 years ago

We gradually reopened the trees and monitored load. [17:37] KWierso|sheriffduty nthomas: good to reopen non-trunk stuff again? or wait a bit more? nthomas KWierso|sheriffduty: lets go ahead, but make try special and do it later [17:38] KWierso|sheriffduty nthomas: done [17:39] nthomas Let the record show - b2g-i, fx-team, m-b, m-i, try still closed [17:55] nthomas so far so good, although not much started [18:01] nthomas KWierso|sheriffduty: lets do something busier now, one of the inbounds maybe KWierso|sheriffduty nthomas: inbound's open nthomas ok [18:25] KWierso|sheriffduty: still fine for load, b2g-i and maybe fx-team too ? only had on push to m-i so far *one push KWierso|sheriffduty nthomas: okay [18:26] KWierso|sheriffduty nthomas: everything but beta's back to their normal state, and beta's only closed for the merge nthomas KWierso|sheriffduty: and try still closed I see, which is good [18:51] load is still looking good [19:15] nthomas KWierso|sheriffduty: so wha do you think will be the bigger deal, gaia or try ? in terms of # of landings [19:18] nthomas load peaked at 8 at 7pm, and has dropped back below 5 nthomas I don't think gaia is closed anyway, based on b2g-inbound KWierso|sheriffduty nthomas: yeah, we opened gaia a while back [19:25] nthomas ah, ok [19:35] nthomas KWierso|sheriffduty|phone: lets open try whenever you're ready [19:37] -->| fabrice (fabrice@495A1E63.5994028D.9CA62458.IP) has joined #releng KWierso|sheriffduty|phone nthomas: boom [19:38] nthomas I sure hope not

Richard Newman [:rnewman]

Comment 7

•

11 years ago

FWIW, I'm seeing significant slowness pushing a mere 14 changesets to Try, due to lock contention. remote: waiting for lock on repository /repo/hg/mozilla/try/ held by 'hgssh1.dmz.scl3.mozilla.com:29394' I waited three minutes, twice, then gave up.

Nick Thomas [:nthomas] (UTC+12)

Comment 8

•

11 years ago

Cancelling a try push can delay later pushes, because the server spends time cleaning up and rebuilding a cache. More details at http://dtor.com/halfire/2014/07/02/2014_06_try_server_update.html.

Nick Thomas [:nthomas] (UTC+12)

Comment 9

•

11 years ago

And try is unwedged, and the load from 10 pushes in 10 minutes hasn't killed us. Trees are back in business.

hwine

Updated

•

11 years ago

Comment 10

•

11 years ago

and we are back with the problem program finished with exit code 255 elapsedTime=60.338841 ========= Finished 'hg clone ...' failed (results: 5, elapsed: 1 mins, 4 secs) (at 2014-07-18 00:45:28.540087) ========= closing trees

Carsten Book [:Tomcat]

Comment 11

•

11 years ago

after a load tests with just mozilla-inbound open (in coordination with releng and infra) we had no more error 500 on hg - so reopened everything at 2:21

Nick Thomas [:nthomas] (UTC+12)

Comment 12

•

11 years ago

Attached image Trend for number of running compile jobs — Details

The load on the webheads spiked again between 2340 and 0110 Pacific. Usul responded with httpd restarts. From the graph you can see there were fairly large numbers of compile jobs starting on try and not try around this time. The trends are published at http://builddata.pub.build.mozilla.org/reports/pending/, specifically running.html below that.

Ryan VanderMeulen [:RyanVM]

Updated

•

11 years ago

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Peter Radcliffe [:pir]

Updated

•

11 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Peter Radcliffe [:pir]

Comment 13

•

11 years ago

The trees may not be closed again, but hgweb front ends are being hammered into the ground again. httpd processes taking up all available memory and swap.

Peter Radcliffe [:pir]

Comment 14

•

11 years ago

Alerts came in through the time range of Sun 02:24:21 PDT - Sun 02:58:20 PDT. hgweb[135] required multiple graceful restarts of httpd and the OOM killers took out a few httpd processes when I was too slow dealing with the other machines. Seems to have calmed down now, keeping an eye on it.

David Burns :automatedtester

Comment 15

•

11 years ago

(In reply to Peter Radcliffe [:pir] from comment #13) > The trees may not be closed again, but hgweb front ends are being hammered > into the ground again. httpd processes taking up all available memory and > swap. Did you manage to see where the traffic was originating from?

Flags: needinfo?(pradcliffe)

Peter Radcliffe [:pir]

Comment 16

•

11 years ago

No, I don't know anything about the logging for hg and was dealing with several things at the same time.

Flags: needinfo?(pradcliffe)

hwine

Updated

•

11 years ago

Blocks: 1042210

Ashish Vijayaram [:ashish]

Comment 17

•

11 years ago

(In reply to David Burns :automatedtester from comment #15) > (In reply to Peter Radcliffe [:pir] from comment #13) > > The trees may not be closed again, but hgweb front ends are being hammered > > into the ground again. httpd processes taking up all available memory and > > swap. > > Did you manage to see where the traffic was originating from? hg tanked again but this time I managed to gather more data since we aren't in peak hours anywhere yet. Load is at 24 on all webheads but that is okay considering each machine has 12xHT CPUs. Swap is barely used. So the webheads are all CPU blocked now (no IO Wait either). Looking through request logs, the top 5-6 IPs are all internal. Zeus graphs don't show any significant change in incoming requests. However at around 00:15 Pacific, the rate of requests actually dropped, which is likely the result of the webheads not responding and Zeus promptly returning 500s. Next would be to look at requests just before the time hg started to fail - ~00:20 Pacific and determine whether there were higher requests were for a specific tree. Or something else... FWIW a couple of webheads have recoveries in Nagios without my interference but the service as a whole is still out sick.

Ed Morley [:emorley]

Comment 18

•

11 years ago

Jobs are timing out (eg https://tbpl.mozilla.org/php/getParsedLog.php?id=45311481&tree=Mozilla-Inbound) and TBPL is failing to load the hg pushlog to display results, so I've just closed all trees. This should hopefully reduce load soon too.

Ashish Vijayaram [:ashish]

Comment 19

•

11 years ago

I've kicked httpd on all webheads but that hasn't helped much at all, even with trees closed :-/

Pete Moore [:pmoore][:pete]

Comment 20

•

11 years ago

I've ack'd the b2g bumper alerts too...

Ed Morley [:emorley]

Comment 21

•

11 years ago

Trees are still closed. Conditions for reopening: * Being able to load each of the following used by TBPL: ** https://hg.mozilla.org/mozilla-central/json-pushes?full=1&maxhours=24 ** https://hg.mozilla.org/integration/mozilla-inbound/json-pushes?full=1&maxhours=24 ** https://hg.mozilla.org/integration/b2g-inbound/json-pushes?full=1&maxhours=24 ** https://hg.mozilla.org/integration/fx-team/json-pushes?full=1&maxhours=24 * Being able to load https://secure.pub.build.mozilla.org/buildapi/self-serve/jobs and other parts of self-serve (bug 1049430, depends on hg.m.o)

Ed Morley [:emorley]

Updated

•

11 years ago

Blocks: 1049430

Ed Morley [:emorley]

Comment 22

•

11 years ago

Some of the URLs in comment 21 are now working intermittently, albeit very slowly, but the try pushlog is still seeing issues: https://hg.mozilla.org/try/json-pushes?full=1&maxhours=24

Kendall Libby [:fubar] (he/him)

Assignee

Comment 23

•

11 years ago

we were down one node, so that hgweb3 could be rebuilt. I have since powered it back up, resynced it, and put it back into rotation. there was a significant load spike at around 0722 UTC. https://pastebin.mozilla.org/5836042 shows a bunch of device nightlies kicking off at :15. hits to /build/{mozharness,tools}/ account for approximately 30% of the hits. even though they're not all full pulls, that seems high. looking at graphite, there's also another noticeable spike in the evening, that was considerably worse last night with one node out. while I would like to see load spread out more evenly, it seems clear that even having one node out can be a significant problem, which is bad. will see about getting extra nodes before attempting to rebuild the remaining four which need more disk.

Ed Morley [:emorley]

Comment 24

•

11 years ago

(In reply to Kendall Libby [:fubar] from comment #23) > hits to /build/{mozharness,tools}/ account for approximately 30% of the > hits. even though they're not all full pulls, that seems high. Seems like bug 1036122, bug 1035660, bug 851398 and similar might help with this.

Ed Morley [:emorley]

Updated

•

11 years ago

Blocks: try-tracker

Ed Morley [:emorley]

Comment 25

•

11 years ago

Attached file #releng chatlog — Details

Ed Morley [:emorley]

Comment 26

•

11 years ago

All of the URLs in comment 21 and 22 are loading promptly now, and as such TBPL is working, the recent try pushes have been picked up by the poller, and the self-serve nagios alerts have stopped (bug 1049430). Therefore have reopened the !try trees, will open try shortly after presuming the additional load doesn't cause issues (gaia-try being reopened will see a wave of new PR bot commits aiui).

Axel Hecht [:Pike]

Comment 27

•

11 years ago

(In reply to Ed Morley [:edmorley] from comment #25) > Created attachment 8468396 [details] > #releng chatlog FYI, the hg poller that feeds the dashboard retries if it hits network errors. The place it was stuck at was getting the repository list of /gaia-l10n/, which is why it kept banging on that url. That doesn't send out more requests, btw, it just sends requests at the same pace it always does, but only to the failing url. Once hg pushes to pulse, reliably so, I'd merrily stop polling, that'd be bug 1022701.

Ed Morley [:emorley]

Comment 28

•

11 years ago

Try trees open now too.

Gregory Szorc [:gps]

Comment 29

•

11 years ago

Comment 23 implies our release automation infrastructure is DoSing ourselves because of inefficiencies. Said inefficiencies are already reported and filed under the bugs listed in comment 24. Said bugs have seen no real activity for up to months. While these bugs may not be the root cause, fixing matters shouldn't hurt any party, so why haven't we done that? Needinfo Taras as I believe these bugs fall into his org chart.

Flags: needinfo?(taras.mozilla)

Ryan VanderMeulen [:RyanVM]

Comment 30

•

11 years ago

Having issues again. Trees re-closed at 11:50am PT.

Ryan VanderMeulen [:RyanVM]

Comment 31

•

11 years ago

Things seem to have stabilized again, so I've open all trees except Try so we can assess where things stand with some more load on them.

(dormant account)

Comment 32

•

11 years ago

(In reply to Gregory Szorc [:gps] from comment #29) > Comment 23 implies our release automation infrastructure is DoSing ourselves > because of inefficiencies. Said inefficiencies are already reported and > filed under the bugs listed in comment 24. Said bugs have seen no real > activity for up to months. While these bugs may not be the root cause, > fixing matters shouldn't hurt any party, so why haven't we done that? > > Needinfo Taras as I believe these bugs fall into his org chart. I see some random bugs. If someone confirms that not cloning mozharness will help, I can push on that. I don't see any evidence that we know what specific thing is killing our try perf.

Flags: needinfo?(taras.mozilla)

David Burns :automatedtester

Comment 33

•

11 years ago

Since this relates to a releng issue, as far as I can see can you find the someone to confirm/deny this is the issue?

Flags: needinfo?(taras.mozilla)

Flags: needinfo?(laura)

Ben Kero [:bkero]

Comment 34

•

11 years ago

Right now we're gathering evidence and statistics about traffic and attempting to associate certain events with the issue here. Right now the evidence is pointing towards that, although nothing has been proven entirely. This is proving a thorny issue to debug. Generally alleviating load will certainly help, although probably won't fix the issue entirely.

Laura Thomson :laura

Comment 35

•

11 years ago

I should add that we're working on adding hardware as well which will help to spread the load. As Ben suggests, we are trying to debug the issue. The Dev Services team is all hands on deck on this and we'll update as we figure it out.

Flags: needinfo?(laura)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 36

•

11 years ago

(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #31) > Things seem to have stabilized again, so I've open all trees except Try so > we can assess where things stand with some more load on them. Try and try-comm-central reopened around 3PM Pacific Time.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 37

•

11 years ago

Try re-closed at 2014-08-06T15:48:41 at hwine's request.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 38

•

11 years ago

And reopened at 4:07PM

(dormant account)

Comment 39

•

11 years ago

Jordan said he'll look into getting clone/pulls of mozharness/tools repos out of testruns

Flags: needinfo?(taras.mozilla)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 40

•

11 years ago

Try reopened yet again at 2014-08-06T16:59:46

Laura Thomson :laura

Updated

•

11 years ago

Depends on: 1049519

hwine

Updated

•

11 years ago

Depends on: 1050075

Jordan Lund (:jlund)

Updated

•

11 years ago

Depends on: 1050109

Ed Morley [:emorley]

Comment 41

•

11 years ago

Closed try x 3 at hwine's request in #releng

Ed Morley [:emorley]

Comment 42

•

11 years ago

All trees closed due to job failures across multiple repos.

hwine

Updated

•

11 years ago

Depends on: 1050377

Ryan VanderMeulen [:RyanVM]

Comment 43

•

11 years ago

500 ISEs all over again. Trees closed.

Ryan VanderMeulen [:RyanVM]

Comment 44

•

11 years ago

It appears that the most recent spike is gone. I've reopened Try and will be gradually reopening the other branches as I get a chance to clean things up.

Ryan VanderMeulen [:RyanVM]

Comment 45

•

11 years ago

Try re-closed at Callek's request due to more hgmo spikes and buildbot master command queue backlog.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 46

•

11 years ago

Try and try-comm-central reopened at Callek's request.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 47

•

11 years ago

Everything's now reopened.

Daniel Holbert [:dholbert]

Comment 48

•

11 years ago

This struck again between 5pm and 6pm today -- hg.m.o giving 503 and 500 errors, and tbpl having over 50 blue auto-retries of some jobs on this push: https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=046dc36c93b1 Seems to be back now, though, for the moment...

Phil Ringnalda (:philor)

Comment 49

•

11 years ago

Builds might have recovered (maybe, or maybe not), but tests continue to fail after hitting "abort: HTTP Error 504: Gateway Time-out" while cloning https://hg.mozilla.org/build/mozharness, and inconveniently so does loading URLs like https://hg.mozilla.org/integration/mozilla-inbound/json-pushes?full=1&maxhours=24, which have to load for tbpl to be able to load to be able to see the extent of the bustage. Closed every tree at 18:38.

Daniel Holbert [:dholbert]

Comment 50

•

11 years ago

(I'm getting Error 504 Gateway Time-out when doing 'hg pull -u' locally, too.)

Bill Gianopoulos [:WG9s]

Comment 51

•

11 years ago

So this is NOT the same problem as last time. So perhaps should be a new bug. The last time this bug was reopened I had no issues doing hg pulls even remotely.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 52

•

11 years ago

Closed gaia as well so we don't pile on even more bustage on top of the mystery bustage on the rest of the trees.

Ben Kero [:bkero]

Comment 53

•

11 years ago

I altered a setting in mod_wsgi to a previous value, which should help availability.

Daniel Holbert [:dholbert]

Comment 54

•

11 years ago

(In reply to Bill Gianopoulos [:WG9s] from comment #51) > The last time this bug was reopened I had no issues doing hg pulls > even remotely. I did, and other people have too, at various points. (e.g. search for "pull" on http://krijnhoetmer.nl/irc-logs/developers/20140807 for various mentions of people having trouble pulling on 8/7, the last time this was reopened)

Bill Gianopoulos [:WG9s]

Comment 55

•

11 years ago

Well, i sometimes have trouble when the tress are not closed with my pulls for my daily builds. SO last time no issues for pulls whatsoever. if people can not pull now this is different!

Bill Gianopoulos [:WG9s]

Comment 56

•

11 years ago

So the last time this was opened I could not get the tinderbox pushlog page to ever load but hg pulls always worked. so the symptoms here are not the same.

Phil Ringnalda (:philor)

Comment 57

•

11 years ago

Reopened at 19:47.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 58

•

11 years ago

Closed try at hwine's suggestion.

Nigel Babu [:nigelb]

Comment 59

•

11 years ago

Twice we've opened Try. It looks stable for a while and performance starts dropping slowly. Closing try results in near-immediate relief.

Chuck Lee [:chucklee, Triggered by NI, Review, and Feedback requests]

Updated

•

11 years ago

Blocks: 979739

Ed Morley [:emorley]

Comment 60

•

11 years ago

Timeouts loading the json-pushes on multiple (non-try) repos -> all trees closed.

Ed Morley [:emorley]

Comment 61

•

11 years ago

To be clearer (as well as more grammatically correct): Timeouts loading json-pushes on multiple (non-try) repos, which is required for TBPL (and thus sheriffing) -> all trees closed.

Ed Morley [:emorley]

Comment 62

•

11 years ago

Also seen an instance of: https://tbpl.mozilla.org/php/getParsedLog.php?id=45844596&tree=Mozilla-Central command: START command: hg pull -r 76b55c0850ca6ba3fe64aa1fb6f26a6024585a96 https://hg.mozilla.org/mozilla-central command: cwd: /builds/hg-shared/mozilla-central command: output: warning: hg.mozilla.org certificate with fingerprint af:27:b9:34:47:4e:e5:98:01:f6:83:2b:51:c9:aa:d8:df:fb:1a:27 not verified (check hostfingerprints or web.cacerts config setting) pulling from https://hg.mozilla.org/mozilla-central abort: unknown revision '76b55c0850ca6ba3fe64aa1fb6f26a6024585a96'!

Ed Morley [:emorley]

Comment 63

•

11 years ago

All trees are now open again (have been staggering the reopening as requested, to minimise impact).

Ed Morley [:emorley]

Comment 64

•

11 years ago

Getting gateway timeouts or 503 Service Temporarily Unavailable viewing hgweb pages, and timeouts on json-pushes again; sigh :-( Trees reclosed. Strangely Hal's graphite graph ("hgweb health report") doesn't show a spike yet.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 65

•

11 years ago

Trees were opened at some point today. Lots of people reporting failures pulling from hg.m.o, so I just closed everything again.

Mike Hommey [:glandium]

Comment 66

•

11 years ago

I know this is not going to solve any actual real problem, but can we at least try a try reset? The endless tree closures are too painful to wait for a real fix.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 67

•

11 years ago

Filed bug 1053558 to see about a try reset. Ongoing discussion in #vcs

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 68

•

11 years ago

Try reset happened in 1053558, all trees reopened for now.

Ed Morley [:emorley]

Updated

•

11 years ago

Depends on: 1053558

Ed Morley [:emorley]

Updated

•

11 years ago

Depends on: 1053705

Chuck Lee [:chucklee, Triggered by NI, Review, and Feedback requests]

Updated

•

11 years ago

No longer blocks: 979739

Gregory Szorc [:gps]

Comment 69

•

11 years ago

This bug is tracking an event that has been resolved. While the underlying root causes are still unresolved, we know what they are and we have bugs and team goals to address them. The primary issue is the scalability of Try. And it is one of my Q4 goals on the Developer Services team to change the architecture of Try so it scales to infinity. This will likely be tracked in bug 1055298. I don't see what value this bug is providing in an open state. So I'm closing it.

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → WORKSFORME

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard

Trend for number of running compile jobs 11 years ago Nick Thomas [:nthomas] (UTC+12) 45.44 KB, image/png		Details
#releng chatlog 11 years ago Ed Morley [:emorley] 9.43 KB, text/plain		Details