Closed Bug 714048 Opened 12 years ago Closed 12 years ago

MDN wiki is down

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
macOS
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: teoli, Assigned: nmaul)

References

Details

The wiki is down right now:
[14:51:14.463] GET https://developer.mozilla.org/en/CSS/font [HTTP/1.1 500 Internal Server Error 1269ms]

(And it isn't a time-out this time).

Note that this happens several times a week, for about 10 minutes each time.
Summary: The wiki is down → MDN wiki is down
Severity: critical → blocker
Assignee: nobody → server-ops
Component: Deki Infrastructure → Server Operations
Product: Mozilla Developer Network → mozilla.org
QA Contact: infrastructure → cshields
Version: Deki → other
Assignee: server-ops → ashish
Confirmed. Django and phpbb are still up.
The zeus health check was returning a 500 and zeus failed out the nodes. I've fixed that and zeus shows the nodes as active and serving traffic.
Now, I get, mixed with intermittent timeout-like "Server Unavailable":
[15:55:09.766] GET https://developer.mozilla.org/en/CSS/font [HTTP/1.1 500 Internal Server Error 15525ms]

Site settings could not be loaded

We were unable to locate the API to request site settings. Please see below for debugging information. If this is a new install, try refreshing - the API is simply taking its time loading up!

HTTP Response Status Code: 500

An exception was thrown: Object reference not set to an instance of an object in MySql.Data
It is working again for me. Thank you for the appreciated help. 

Could you explain what was the problem? (May be useful if it happens again)
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
I kicked dekiwiki and most page loads are quite snappy now.

I've cc'd jakem on this bug already. He'd take soon after he's in today.
Status: RESOLVED → UNCONFIRMED
Ever confirmed: false
Resolution: FIXED → ---
Taking this and dropping prio since the actual outage is resolved (for now).

To be honest, I'm not sure what we can do here. We've been fighting MindTouch reliability issues for a long time, and have implemented just about every "fix" the vendor has recommended... including installation of a cron script that's supposed to detect failures and automatically kick deki. Even that doesn't seem to get the job done anymore. I guess whatever the underlying issue is that triggers the problem, the script just doesn't detect.

The 'checkdeki' script from MindTouch checks the host/test and host/status pages in dekiwiki. We could have a situation where perhaps those pages work fine, but actual wiki pages don't for some reason. I'm not sure what would cause that. However, we have a django-restart script in place that does this:

$CURL -s --connect-timeout 5 -m 15 -I -L -H "Host: developer.mozilla.org" http://localhost:81/en-US/ > /dev/null

It restarts Apache if that check fails. We could do something similar for Dekiwiki... the main difference between this and MT's "checkdeki" would be checking an actual page instead of the status page. Whether or not that will help, I can't really say.


We could also try just arbitrarily kicking deki every X minutes. This will help if the problem is something that builds up over time- periodically restarting should avoid a major outage, at the cost of a minor one (a few seconds). It won't help much if it's something that suddenly happens and breaks it right away- it'd still be down until the cron job comes by and kicks it... and you'd still get the periodic small downtimes when it gets kicked but nothing is wrong.


The best fix will be migrating away from MindTouch, which is hopefully still on the plate. I know development on this has been underway for some time now, but don't know the current status. In any case, it's likely not close enough to be a quick fix.

Any thoughts on these options?
Assignee: ashish → nmaul
Severity: blocker → major
Status: UNCONFIRMED → NEW
Ever confirmed: true
We're targeting Q1 for MindTouch -> django migration. It's ambitious but possible. We'll discuss MindTouch migration at next week's Wednesday MDN meeting.
Tomorrow I will look into a curl test/kick trigger for deki like django has. Once that's in place, we'll try adding 02 and 03 back into the pool.
Thanks Jake. I'll be out tomorrow until Tuesday. :teoli and :sheppy - which pages are good ones to check with curl?
:groovecoder 

I would say:

https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Date (which is the average small page)

https://developer.mozilla.org/en/Mozilla_Quirks_Mode_Behavior (which is long to generate)

and the test page (50 templates w/ link to bmo)

https://developer.mozilla.org/User:teoli/Bugzilla_test50

If the first time-out, we have a big problem (yesterday)
If the second time-out, we have a slowness problem 
If the third time-out, there may be a problem w/ the bmo connection like earlier that week.

Right now, all three are working (in less than 5 seconds)
I have set this up, checking all 3 of those URLs. If any one of them fails (times out), it will kick dekiwiki on that server. This runs every 2 minutes, on the opposite minute from the MT 'checkdeki' script.

It's very late in the day now though, right before a 3-day weekend. I will turn 02 and 03 back on on Tuesday, rather than doing it now and leaving any fallout for on-call to handle.
Thanks Jake!
Nodes 2 and 3 are turned back up. The only additional work here is to replace MT/Deki with our own Django app, which is well out of scope for this bug... so I'm closing this out.
Status: NEW → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.