Closed Bug 997868 Opened 11 years ago Closed 11 years ago

web[2-4].bugs.scl3 down/alerting.

Categories

(bugzilla.mozilla.org :: Infrastructure, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: rwatson, Unassigned)

Details

first machine was: nagios-scl3 Thu 09:55:08 PDT [5110] web2.bugs.scl3.mozilla.com:SCL3 Zeus - Port 443 is CRITICAL: web2.bugs.scl3.mozilla.com down(2) Drained from Zues and rebooted came back online, then the others began alerting.
web3.bugs.scl3.mozilla.com couldn't ping it's default router. Seamicro config looked fine. Restarted the network service and it came back to life. web4.bugs.scl3.mozilla.com was out of memory. It looks like there were many instances of sentry.pl which was possibly blocking on not being able to connect to errormill.mozilla.org.
(In reply to Ryan Watson [:w0ts0n] from comment #0) > first machine was: > nagios-scl3 Thu 09:55:08 PDT [5110] web2.bugs.scl3.mozilla.com:SCL3 Zeus - > Port 443 is CRITICAL: web2.bugs.scl3.mozilla.com down(2) > > Drained from Zues and rebooted came back online, then the others began > alerting. What was the issue here? Same as Comment 1?
I'll defer to ashish/gozer, since I was busy on something else, but I believe sentry.pl got wedged trying to communicate with errormill. i.e. they couldn't connect, and the locking caused hundreds of procs to back up, making the machines really unhappy.
Closing this out, not enough data here to investigate further.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → INCOMPLETE
(In reply to Kendall Libby [:fubar] from comment #3) > I'll defer to ashish/gozer, since I was busy on something else, but I > believe sentry.pl got wedged trying to communicate with errormill. i.e. they > couldn't connect, and the locking caused hundreds of procs to back up, > making the machines really unhappy. Sorry for the delay, but for posterity, here is what I managed to find out, but didn't get at the bottom of the problem. What I saw was hundred of sentry.pl stuck on the flock() it holds, to ensure it runs exclusively, not really sure why. My theory is that it was caused by errormill.mozilla.org, which it talks to, without specifiying a timeout, so that defaulted to 5 minutes. No clue what the root cause was, but I can see at least this behaviour. + httpd/cgi invokes system(sentry.pl) + sentry.pl takes a flock() + sentry.pl takes a long time or stalls against errormill.mozilla.org + httpd/cgi invokes system(sentry.pl) + sentry.pl waits for flock() [...] Each of these taking up a httpd and keeping it useless to the parent httpd for a long time. Eventually, there were enough inbound requests to take up all child processes and leave the parent httpd process apparently stuck from the outside. This also caused lots of RAM to be used and the OOM Killer to start killing stuff. Overall, not a very happy resulting system.
(In reply to Philippe M. Chiasson (:gozer) from comment #5) > What I saw was hundred of sentry.pl stuck on the flock() it holds, to ensure > it runs exclusively, not really sure why. the exclusivity lock is to prevent us from flooding errormill when a massive amount of simultaneous reports when something major breaks (yes, this has happened). > My theory is that it was caused by errormill.mozilla.org, which it talks to, without > specifiying a timeout, so that defaulted to 5 minutes. the default is 3 minutes, but i agree that's too long. filed as bug 1002315.
In any case, never got to the root cause of this. Was there a problem with errormill around that time?
Component: WebOps: Bugzilla → Infrastructure
Product: Infrastructure & Operations → bugzilla.mozilla.org
You need to log in before you can comment on or make changes to this bug.