Last Comment Bug 742129 - hg.m.o outage
: hg.m.o outage
Status: RESOLVED FIXED
:
Product: mozilla.org Graveyard
Classification: Graveyard
Component: Server Operations (show other bugs)
: other
: x86 All
: -- blocker (vote)
: ---
Assigned To: Corey Shields [:cshields]
: matthew zeier [:mrz]
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-04-03 18:15 PDT by Nick Thomas [:nthomas]
Modified: 2015-03-12 08:17 PDT (History)
8 users (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments

Description Nick Thomas [:nthomas] 2012-04-03 18:15:24 PDT
Nagios reporting errors with assorted dm-vcviewNN initially with corresponding 500 errors when making my own requests. rbryce & cshields are looking into it.
Comment 1 Nick Thomas [:nthomas] 2012-04-03 18:37:44 PDT
I closed Try, Mozilla-Inbound, and Firefox trees to prevent further load getting piled on.
Comment 2 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2012-04-03 18:59:09 PDT
In nagios, I've ack'd all the alerts about hg mirrors being out of sync. Based on past experience with reviving these mirrors, these will likely need manual revival after hg.m.o is back online.
Comment 3 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2012-04-03 19:03:09 PDT
looping in akeybl, lsblakk because of impact to tonight's FF12.0b4.
Comment 4 Corey Shields [:cshields] 2012-04-03 19:04:27 PDT
Yup..  and just fyi the mirrors falling out of sync happened 25 mins after the load went up on the hg cluster, this is a symptom and not a cause..
Comment 5 Corey Shields [:cshields] 2012-04-03 19:08:34 PDT
Update:

We added back the disabled nodes and they immediately ramped up in load to match the others.

Nothing has changed on the boxes (no new rpms, puppet changes at the time, etc..) - logs only show this error when this happens:

Premature end of script headers: hgweb.wsgi
Comment 6 Corey Shields [:cshields] 2012-04-03 20:42:53 PDT
correlation here, the last time try's pushlog db was touched was about the same time we started having issues..  I don't see corruption in that pushlog though.
Comment 7 Corey Shields [:cshields] 2012-04-03 20:45:20 PDT
There was some kind of corruption.  Removing these entries unclogged hg:

26551|112166|17de77dc2913ef4564ddd150051e0e79b79c6250
26551|112165|76e0cfbdaee4817ed567c6bc612f6ba5e3eaa7bf
26551|112164|c32ac80de0a868f404d9c111159fd2acca468759
26551|112163|02a23a6ce874913577ac76f9d698c7ad2ee7df73
26551|112162|2f4fd7a92427fd3bbd2e10a107392eac44326105
26551|112161|a1f945622afd0bc0ffdbc90aa72e743807aead63
26551|112160|d4d45a3aabc43837df276a986bc138279fe5cf83
26551|112159|261fbbaa668135b8de4d3c2da2a5812978e723f8

26551|honzab.moz@firemni.cz|1333500051

Now, normally this would just hork try and give errors on hg hooks.  For some reason though, this time it caused pain through the json-pushes hits, causing wsgi to become unresponsive.

We'll talk around about this and look at pushlog sooner in the future.  (fwiw, pushlog is probably 75% of the problems we have with hg)  :(
Comment 8 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2012-04-03 20:52:39 PDT
(In reply to Corey Shields [:cshields] from comment #7)
> We'll talk around about this and look at pushlog sooner in the future. 
> (fwiw, pushlog is probably 75% of the problems we have with hg)  :(

Thanks cshields.

ftr, 

1) FF12.0b4 builds started at 20.39PDT
2) trees being reopened as I type.
Comment 9 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2012-04-03 20:53:27 PDT
grrr... STAY CLOSED (sorry for the accidental reopen... browser cache bug?)
Comment 10 Nick Thomas [:nthomas] 2012-04-03 20:54:04 PDT
Trees are reopened.
Comment 11 Amy Rich [:arr] [:arich] 2012-04-04 07:06:54 PDT
As an added bit of information, the processes that keep things in sync on the hg mirrors in scl1 got completely wedged, and I had to do a lot of housecleaning to kill all of the wedged and defunct processes. I think I have them all at this point, but bug 742233 covers that in more detail.

Note You need to log in before you can comment on or make changes to this bug.