Closed
Bug 714048
Opened 12 years ago
Closed 12 years ago
MDN wiki is down
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: teoli, Assigned: nmaul)
References
Details
The wiki is down right now: [14:51:14.463] GET https://developer.mozilla.org/en/CSS/font [HTTP/1.1 500 Internal Server Error 1269ms] (And it isn't a time-out this time). Note that this happens several times a week, for about 10 minutes each time.
Reporter | ||
Updated•12 years ago
|
Summary: The wiki is down → MDN wiki is down
Reporter | ||
Updated•12 years ago
|
Severity: critical → blocker
Reporter | ||
Updated•12 years ago
|
Assignee: nobody → server-ops
Component: Deki Infrastructure → Server Operations
Product: Mozilla Developer Network → mozilla.org
QA Contact: infrastructure → cshields
Version: Deki → other
Updated•12 years ago
|
Assignee: server-ops → ashish
Comment 1•12 years ago
|
||
Confirmed. Django and phpbb are still up.
Comment 2•12 years ago
|
||
The zeus health check was returning a 500 and zeus failed out the nodes. I've fixed that and zeus shows the nodes as active and serving traffic.
Reporter | ||
Comment 3•12 years ago
|
||
Now, I get, mixed with intermittent timeout-like "Server Unavailable": [15:55:09.766] GET https://developer.mozilla.org/en/CSS/font [HTTP/1.1 500 Internal Server Error 15525ms] Site settings could not be loaded We were unable to locate the API to request site settings. Please see below for debugging information. If this is a new install, try refreshing - the API is simply taking its time loading up! HTTP Response Status Code: 500 An exception was thrown: Object reference not set to an instance of an object in MySql.Data
Reporter | ||
Comment 5•12 years ago
|
||
It is working again for me. Thank you for the appreciated help. Could you explain what was the problem? (May be useful if it happens again)
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 6•12 years ago
|
||
I kicked dekiwiki and most page loads are quite snappy now. I've cc'd jakem on this bug already. He'd take soon after he's in today.
Status: RESOLVED → UNCONFIRMED
Ever confirmed: false
Resolution: FIXED → ---
Assignee | ||
Comment 7•12 years ago
|
||
Taking this and dropping prio since the actual outage is resolved (for now). To be honest, I'm not sure what we can do here. We've been fighting MindTouch reliability issues for a long time, and have implemented just about every "fix" the vendor has recommended... including installation of a cron script that's supposed to detect failures and automatically kick deki. Even that doesn't seem to get the job done anymore. I guess whatever the underlying issue is that triggers the problem, the script just doesn't detect. The 'checkdeki' script from MindTouch checks the host/test and host/status pages in dekiwiki. We could have a situation where perhaps those pages work fine, but actual wiki pages don't for some reason. I'm not sure what would cause that. However, we have a django-restart script in place that does this: $CURL -s --connect-timeout 5 -m 15 -I -L -H "Host: developer.mozilla.org" http://localhost:81/en-US/ > /dev/null It restarts Apache if that check fails. We could do something similar for Dekiwiki... the main difference between this and MT's "checkdeki" would be checking an actual page instead of the status page. Whether or not that will help, I can't really say. We could also try just arbitrarily kicking deki every X minutes. This will help if the problem is something that builds up over time- periodically restarting should avoid a major outage, at the cost of a minor one (a few seconds). It won't help much if it's something that suddenly happens and breaks it right away- it'd still be down until the cron job comes by and kicks it... and you'd still get the periodic small downtimes when it gets kicked but nothing is wrong. The best fix will be migrating away from MindTouch, which is hopefully still on the plate. I know development on this has been underway for some time now, but don't know the current status. In any case, it's likely not close enough to be a quick fix. Any thoughts on these options?
Assignee: ashish → nmaul
Severity: blocker → major
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment 8•12 years ago
|
||
We're targeting Q1 for MindTouch -> django migration. It's ambitious but possible. We'll discuss MindTouch migration at next week's Wednesday MDN meeting.
Assignee | ||
Comment 9•12 years ago
|
||
Tomorrow I will look into a curl test/kick trigger for deki like django has. Once that's in place, we'll try adding 02 and 03 back into the pool.
Comment 10•12 years ago
|
||
Thanks Jake. I'll be out tomorrow until Tuesday. :teoli and :sheppy - which pages are good ones to check with curl?
Reporter | ||
Comment 11•12 years ago
|
||
:groovecoder I would say: https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Date (which is the average small page) https://developer.mozilla.org/en/Mozilla_Quirks_Mode_Behavior (which is long to generate) and the test page (50 templates w/ link to bmo) https://developer.mozilla.org/User:teoli/Bugzilla_test50 If the first time-out, we have a big problem (yesterday) If the second time-out, we have a slowness problem If the third time-out, there may be a problem w/ the bmo connection like earlier that week. Right now, all three are working (in less than 5 seconds)
Assignee | ||
Comment 12•12 years ago
|
||
I have set this up, checking all 3 of those URLs. If any one of them fails (times out), it will kick dekiwiki on that server. This runs every 2 minutes, on the opposite minute from the MT 'checkdeki' script. It's very late in the day now though, right before a 3-day weekend. I will turn 02 and 03 back on on Tuesday, rather than doing it now and leaving any fallout for on-call to handle.
Comment 13•12 years ago
|
||
Thanks Jake!
Assignee | ||
Comment 14•12 years ago
|
||
Nodes 2 and 3 are turned back up. The only additional work here is to replace MT/Deki with our own Django app, which is well out of scope for this bug... so I'm closing this out.
Status: NEW → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•