Investigate MDN down-time incident 2015-09-17



4 years ago
4 years ago


(Reporter: groovecoder, Unassigned)






Comment 1

4 years ago
There was a large spike in time spent in $compare transaction(s) before the outage:

All from a user agent "mozilla" ... so can't block it with the agent-blocking prevention we put in place for the last down-time.

We've seen this spike before down-times before, so I'm going to take the most expensive part of the $compare transaction (tidying the HTML of the revisions) out of the HTTP request completely. (We had previously moved it to a cache-behind operation, so now I'm making it an asynchronous cache-only operation.)

:jakem - can you dig into why Apache seems to hit max connection limits(?) after long-running transactions like this? Our down-times are always a massive spike of "Request Queuing" in New Relic, and we seem to hit Apache connection limits far too often.


4 years ago
See Also: → bug 1205579
Commits pushed to master at
bug 1205667 - get_tidied_content can return blank

When a $compare request is made for a large revision,
we want to skip tidy_content and return a warning to the user,
so we don't block requests on the expensive tidy operation.
bug 1205667 - tests for get_tidied_content
Merge pull request #3497 from mozilla/never-tidy-in-compare-request-1205667

bug 1205667 - get_tidied_content can return blank
Keywords: in-triage

Comment 3

4 years ago
As with bug 1203528, this downtime was immediately preceded by a spike in $compare transaction CPU time. Based on, I'm going to call both incidents investigated and resolved, knowing that we still haven't quite cleaned up everything.
Last Resolved: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.