Last Comment Bug 742129 - hg.m.o outage
: hg.m.o outage
Product: Graveyard
Classification: Graveyard
Component: Server Operations (show other bugs)
: other
: x86 All
: -- blocker (vote)
: ---
Assigned To: Corey Shields [:cshields]
: matthew zeier [:mrz]
Depends on:
  Show dependency treegraph
Reported: 2012-04-03 18:15 PDT by Nick Thomas [:nthomas]
Modified: 2015-03-12 08:17 PDT (History)
8 users (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Description Nick Thomas [:nthomas] 2012-04-03 18:15:24 PDT
Nagios reporting errors with assorted dm-vcviewNN initially with corresponding 500 errors when making my own requests. rbryce & cshields are looking into it.
Comment 1 Nick Thomas [:nthomas] 2012-04-03 18:37:44 PDT
I closed Try, Mozilla-Inbound, and Firefox trees to prevent further load getting piled on.
Comment 2 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2012-04-03 18:59:09 PDT
In nagios, I've ack'd all the alerts about hg mirrors being out of sync. Based on past experience with reviving these mirrors, these will likely need manual revival after hg.m.o is back online.
Comment 3 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2012-04-03 19:03:09 PDT
looping in akeybl, lsblakk because of impact to tonight's FF12.0b4.
Comment 4 Corey Shields [:cshields] 2012-04-03 19:04:27 PDT
Yup..  and just fyi the mirrors falling out of sync happened 25 mins after the load went up on the hg cluster, this is a symptom and not a cause..
Comment 5 Corey Shields [:cshields] 2012-04-03 19:08:34 PDT

We added back the disabled nodes and they immediately ramped up in load to match the others.

Nothing has changed on the boxes (no new rpms, puppet changes at the time, etc..) - logs only show this error when this happens:

Premature end of script headers: hgweb.wsgi
Comment 6 Corey Shields [:cshields] 2012-04-03 20:42:53 PDT
correlation here, the last time try's pushlog db was touched was about the same time we started having issues..  I don't see corruption in that pushlog though.
Comment 7 Corey Shields [:cshields] 2012-04-03 20:45:20 PDT
There was some kind of corruption.  Removing these entries unclogged hg:



Now, normally this would just hork try and give errors on hg hooks.  For some reason though, this time it caused pain through the json-pushes hits, causing wsgi to become unresponsive.

We'll talk around about this and look at pushlog sooner in the future.  (fwiw, pushlog is probably 75% of the problems we have with hg)  :(
Comment 8 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2012-04-03 20:52:39 PDT
(In reply to Corey Shields [:cshields] from comment #7)
> We'll talk around about this and look at pushlog sooner in the future. 
> (fwiw, pushlog is probably 75% of the problems we have with hg)  :(

Thanks cshields.


1) FF12.0b4 builds started at 20.39PDT
2) trees being reopened as I type.
Comment 9 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2012-04-03 20:53:27 PDT
grrr... STAY CLOSED (sorry for the accidental reopen... browser cache bug?)
Comment 10 Nick Thomas [:nthomas] 2012-04-03 20:54:04 PDT
Trees are reopened.
Comment 11 Amy Rich [:arr] [:arich] 2012-04-04 07:06:54 PDT
As an added bit of information, the processes that keep things in sync on the hg mirrors in scl1 got completely wedged, and I had to do a lot of housecleaning to kill all of the wedged and defunct processes. I think I have them all at this point, but bug 742233 covers that in more detail.

Note You need to log in before you can comment on or make changes to this bug.