Looks like at least one of the hosts is broken; please restart them all and see if they come back to life. Thanks!
Something more serious must be afoot; it's not responding reliably again already. Lots of "Service unavailable" errors and broken connections. Someone needs to figure out what's wrong.
I'm not having any luck replicating this. Is it the django or deki portion that's failing? Always a certain URL that fails, or any element? What kind of Service Unavailable page are you getting? Bold red lettering (Zeus), or the more normal Apache kind?
Bold red. It's not happened for about 10 minutes now.
We've added this cluster to our ganglia performance monitoring/graphing system. If this happens again we may have better data on it.
Having this happen again right now.
This is coming and going in waves, where it'll not work at all for a few minutes, then work fine for a while, then stop working again. It's as if some service is dying and being restarted after a while (that's the feeling I get, not some special knowledge I have). It's making getting work done very difficult, so bumping the urgency a bit here.
In other bugs this was determined to be a problem with the database server tm-b01-master01. It's load got extremely high due to lots of disk I/O wait, caused by an unrelated database. Since this is more thoroughly documented in other bugs, I'll close this one back out. The TL;DR is: we're investigating what can be done to mitigate this situation. One (highly recommended) improvement would be to make use of the slave database server(s) for this cluster for read queries. The slave was not affected by this issue, and would have been far faster to respond. For the record, I don't see any significant issues reported by ganglia for the actual dekiwiki cluster, so I believe all is well there. This appears to have been purely a database concern.