Nagios reporting errors with assorted dm-vcviewNN initially with corresponding 500 errors when making my own requests. rbryce & cshields are looking into it.
I closed Try, Mozilla-Inbound, and Firefox trees to prevent further load getting piled on.
In nagios, I've ack'd all the alerts about hg mirrors being out of sync. Based on past experience with reviving these mirrors, these will likely need manual revival after hg.m.o is back online.
looping in akeybl, lsblakk because of impact to tonight's FF12.0b4.
Yup.. and just fyi the mirrors falling out of sync happened 25 mins after the load went up on the hg cluster, this is a symptom and not a cause..
We added back the disabled nodes and they immediately ramped up in load to match the others.
Nothing has changed on the boxes (no new rpms, puppet changes at the time, etc..) - logs only show this error when this happens:
Premature end of script headers: hgweb.wsgi
correlation here, the last time try's pushlog db was touched was about the same time we started having issues.. I don't see corruption in that pushlog though.
There was some kind of corruption. Removing these entries unclogged hg:
Now, normally this would just hork try and give errors on hg hooks. For some reason though, this time it caused pain through the json-pushes hits, causing wsgi to become unresponsive.
We'll talk around about this and look at pushlog sooner in the future. (fwiw, pushlog is probably 75% of the problems we have with hg) :(
(In reply to Corey Shields [:cshields] from comment #7)
> We'll talk around about this and look at pushlog sooner in the future.
> (fwiw, pushlog is probably 75% of the problems we have with hg) :(
1) FF12.0b4 builds started at 20.39PDT
2) trees being reopened as I type.
grrr... STAY CLOSED (sorry for the accidental reopen... browser cache bug?)
Trees are reopened.
As an added bit of information, the processes that keep things in sync on the hg mirrors in scl1 got completely wedged, and I had to do a lot of housecleaning to kill all of the wedged and defunct processes. I think I have them all at this point, but bug 742233 covers that in more detail.