Nagios reporting errors with assorted dm-vcviewNN initially with corresponding 500 errors when making my own requests. rbryce & cshields are looking into it.
I closed Try, Mozilla-Inbound, and Firefox trees to prevent further load getting piled on.
In nagios, I've ack'd all the alerts about hg mirrors being out of sync. Based on past experience with reviving these mirrors, these will likely need manual revival after hg.m.o is back online.
looping in akeybl, lsblakk because of impact to tonight's FF12.0b4.
Yup.. and just fyi the mirrors falling out of sync happened 25 mins after the load went up on the hg cluster, this is a symptom and not a cause..
Update: We added back the disabled nodes and they immediately ramped up in load to match the others. Nothing has changed on the boxes (no new rpms, puppet changes at the time, etc..) - logs only show this error when this happens: Premature end of script headers: hgweb.wsgi
correlation here, the last time try's pushlog db was touched was about the same time we started having issues.. I don't see corruption in that pushlog though.
There was some kind of corruption. Removing these entries unclogged hg: 26551|112166|17de77dc2913ef4564ddd150051e0e79b79c6250 26551|112165|76e0cfbdaee4817ed567c6bc612f6ba5e3eaa7bf 26551|112164|c32ac80de0a868f404d9c111159fd2acca468759 26551|112163|02a23a6ce874913577ac76f9d698c7ad2ee7df73 26551|112162|2f4fd7a92427fd3bbd2e10a107392eac44326105 26551|112161|a1f945622afd0bc0ffdbc90aa72e743807aead63 26551|112160|d4d45a3aabc43837df276a986bc138279fe5cf83 26551|112159|261fbbaa668135b8de4d3c2da2a5812978e723f8 firstname.lastname@example.org|1333500051 Now, normally this would just hork try and give errors on hg hooks. For some reason though, this time it caused pain through the json-pushes hits, causing wsgi to become unresponsive. We'll talk around about this and look at pushlog sooner in the future. (fwiw, pushlog is probably 75% of the problems we have with hg) :(
(In reply to Corey Shields [:cshields] from comment #7) > We'll talk around about this and look at pushlog sooner in the future. > (fwiw, pushlog is probably 75% of the problems we have with hg) :( Thanks cshields. ftr, 1) FF12.0b4 builds started at 20.39PDT 2) trees being reopened as I type.
grrr... STAY CLOSED (sorry for the accidental reopen... browser cache bug?)
Trees are reopened.
As an added bit of information, the processes that keep things in sync on the hg mirrors in scl1 got completely wedged, and I had to do a lot of housecleaning to kill all of the wedged and defunct processes. I think I have them all at this point, but bug 742233 covers that in more detail.