Closed Bug 701489 Opened 14 years ago Closed 14 years ago

Restart devmo wiki

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: sheppy, Assigned: nmaul)

Details

Looks like at least one of the hosts is broken; please restart them all and see if they come back to life. Thanks!
Done.
Assignee: server-ops → nmaul
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Something more serious must be afoot; it's not responding reliably again already. Lots of "Service unavailable" errors and broken connections. Someone needs to figure out what's wrong.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I'm not having any luck replicating this. Is it the django or deki portion that's failing? Always a certain URL that fails, or any element? What kind of Service Unavailable page are you getting? Bold red lettering (Zeus), or the more normal Apache kind?
Bold red. It's not happened for about 10 minutes now.
We've added this cluster to our ganglia performance monitoring/graphing system. If this happens again we may have better data on it.
Severity: critical → major
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → INCOMPLETE
Having this happen again right now.
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
This is coming and going in waves, where it'll not work at all for a few minutes, then work fine for a while, then stop working again. It's as if some service is dying and being restarted after a while (that's the feeling I get, not some special knowledge I have). It's making getting work done very difficult, so bumping the urgency a bit here.
Severity: major → critical
In other bugs this was determined to be a problem with the database server tm-b01-master01. It's load got extremely high due to lots of disk I/O wait, caused by an unrelated database. Since this is more thoroughly documented in other bugs, I'll close this one back out. The TL;DR is: we're investigating what can be done to mitigate this situation. One (highly recommended) improvement would be to make use of the slave database server(s) for this cluster for read queries. The slave was not affected by this issue, and would have been far faster to respond. For the record, I don't see any significant issues reported by ganglia for the actual dekiwiki cluster, so I believe all is well there. This appears to have been purely a database concern.
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → INCOMPLETE
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.