Closed Bug 742129 Opened 12 years ago Closed 12 years ago

hg.m.o outage

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
All
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: cshields)

Details

Nagios reporting errors with assorted dm-vcviewNN initially with corresponding 500 errors when making my own requests. rbryce & cshields are looking into it.
I closed Try, Mozilla-Inbound, and Firefox trees to prevent further load getting piled on.
In nagios, I've ack'd all the alerts about hg mirrors being out of sync. Based on past experience with reviving these mirrors, these will likely need manual revival after hg.m.o is back online.
looping in akeybl, lsblakk because of impact to tonight's FF12.0b4.
Yup..  and just fyi the mirrors falling out of sync happened 25 mins after the load went up on the hg cluster, this is a symptom and not a cause..
Update:

We added back the disabled nodes and they immediately ramped up in load to match the others.

Nothing has changed on the boxes (no new rpms, puppet changes at the time, etc..) - logs only show this error when this happens:

Premature end of script headers: hgweb.wsgi
correlation here, the last time try's pushlog db was touched was about the same time we started having issues..  I don't see corruption in that pushlog though.
There was some kind of corruption.  Removing these entries unclogged hg:

26551|112166|17de77dc2913ef4564ddd150051e0e79b79c6250
26551|112165|76e0cfbdaee4817ed567c6bc612f6ba5e3eaa7bf
26551|112164|c32ac80de0a868f404d9c111159fd2acca468759
26551|112163|02a23a6ce874913577ac76f9d698c7ad2ee7df73
26551|112162|2f4fd7a92427fd3bbd2e10a107392eac44326105
26551|112161|a1f945622afd0bc0ffdbc90aa72e743807aead63
26551|112160|d4d45a3aabc43837df276a986bc138279fe5cf83
26551|112159|261fbbaa668135b8de4d3c2da2a5812978e723f8

26551|honzab.moz@firemni.cz|1333500051

Now, normally this would just hork try and give errors on hg hooks.  For some reason though, this time it caused pain through the json-pushes hits, causing wsgi to become unresponsive.

We'll talk around about this and look at pushlog sooner in the future.  (fwiw, pushlog is probably 75% of the problems we have with hg)  :(
Assignee: rbryce → cshields
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
(In reply to Corey Shields [:cshields] from comment #7)
> We'll talk around about this and look at pushlog sooner in the future. 
> (fwiw, pushlog is probably 75% of the problems we have with hg)  :(

Thanks cshields.

ftr, 

1) FF12.0b4 builds started at 20.39PDT
2) trees being reopened as I type.
Status: RESOLVED → UNCONFIRMED
Ever confirmed: false
Resolution: FIXED → ---
grrr... STAY CLOSED (sorry for the accidental reopen... browser cache bug?)
Status: UNCONFIRMED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Trees are reopened.
As an added bit of information, the processes that keep things in sync on the hg mirrors in scl1 got completely wedged, and I had to do a lot of housecleaning to kill all of the wedged and defunct processes. I think I have them all at this point, but bug 742233 covers that in more detail.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.