Closed Bug 1029541 Opened 11 years ago Closed 11 years ago

developer.mozilla.org is down again

Categories

(Infrastructure & Operations Graveyard :: WebOps: Community Platform, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: davidwalsh, Assigned: nmaul)

Details

(Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/422] )

Attachments

(1 file)

At roughly 9am this morning I discovered that MDN appeared down again, showing a "service unavailable" message. Working with :cyliang and :jezdez to fix.
Attached file zlb_event.log
Smells related to bug 1027052. ZLB reports being unable to connect to the web heads (both production and dev). The HTTP error log on all three production webheads report "server reached MaxClients setting, consider raising the MaxClients setting". Production recoveries begin to occur when I forcibly restarted HTTP to clear connections.
What caused that outage: the toolbar gathers info about the contributors of a document when the document loads. It happens to load them one by one, which can be slow. This is a performance problem, but one that's easy to handle when a small number of contributors have worked on a document. The scenario that caused the crash was multiple people opening a document at around the same time that had 209 contributors. It was simply too much work for the webhead to maintain that many database connections ongoing, and it brought the service as a whole to its knees. We're working on a fix now. I worked with jezdez to determine root cause and a potential fix. He'll implement shortly.
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/422]
https://github.com/mozilla/kuma/pull/2504 has various fixes for those serial queries by doing the ordering by count of revision creators on the database side in two queries (albeit risking large IN queries) and also not requiring a JOIN for the user_profile table that was caused by circular code to generate the user's gravatar URL. Push comings soon..
Err, coming..
Assignee: server-ops-webops → nmaul
The pull request is merged and pushed. MDN is back up.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Confirmed and tested with the page that was causing the issues. All resolved. The page load went from 31.06 seconds (newrelic reporting at time of incident) to < 1.32 seconds (newrelic reporting after patch). Big win for performance, and big win for the contributor bar!
Status: RESOLVED → VERIFIED
Woohoo!
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: