Status

mozilla.org Graveyard
Server Operations
--
critical
RESOLVED INCOMPLETE
6 years ago
3 years ago

People

(Reporter: sheppy, Assigned: jakem)

Tracking

Details

(Reporter)

Description

6 years ago
Looks like at least one of the hosts is broken; please restart them all and see if they come back to life. Thanks!
(Assignee)

Comment 1

6 years ago
Done.
Assignee: server-ops → nmaul
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
(Reporter)

Comment 2

6 years ago
Something more serious must be afoot; it's not responding reliably again already. Lots of "Service unavailable" errors and broken connections. Someone needs to figure out what's wrong.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 3

6 years ago
I'm not having any luck replicating this.

Is it the django or deki portion that's failing? Always a certain URL that fails, or any element?

What kind of Service Unavailable page are you getting? Bold red lettering (Zeus), or the more normal Apache kind?
(Reporter)

Comment 4

6 years ago
Bold red. It's not happened for about 10 minutes now.
(Assignee)

Comment 5

6 years ago
We've added this cluster to our ganglia performance monitoring/graphing system. If this happens again we may have better data on it.
Severity: critical → major
Status: REOPENED → RESOLVED
Last Resolved: 6 years ago6 years ago
Resolution: --- → INCOMPLETE
(Reporter)

Comment 6

6 years ago
Having this happen again right now.
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
(Reporter)

Comment 7

6 years ago
This is coming and going in waves, where it'll not work at all for a few minutes, then work fine for a while, then stop working again. It's as if some service is dying and being restarted after a while (that's the feeling I get, not some special knowledge I have).

It's making getting work done very difficult, so bumping the urgency a bit here.
Severity: major → critical
(Assignee)

Comment 8

6 years ago
In other bugs this was determined to be a problem with the database server tm-b01-master01. It's load got extremely high due to lots of disk I/O wait, caused by an unrelated database. Since this is more thoroughly documented in other bugs, I'll close this one back out.

The TL;DR is: we're investigating what can be done to mitigate this situation. One (highly recommended) improvement would be to make use of the slave database server(s) for this cluster for read queries. The slave was not affected by this issue, and would have been far faster to respond.

For the record, I don't see any significant issues reported by ganglia for the actual dekiwiki cluster, so I believe all is well there. This appears to have been purely a database concern.
Status: REOPENED → RESOLVED
Last Resolved: 6 years ago6 years ago
Resolution: --- → INCOMPLETE
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.