Investigate MDN down-time incident 2015-09-17

RESOLVED FIXED

Status

developer.mozilla.org
General
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: groovecoder, Unassigned)

Tracking

({in-triage})

Details

(URL)

(Reporter)

Comment 1

3 years ago
There was a large spike in time spent in $compare transaction(s) before the outage:

https://rpm.newrelic.com/accounts/263620/applications/3172075/transactions#

All from a user agent "mozilla" ... so can't block it with the agent-blocking prevention we put in place for the last down-time.

We've seen this spike before down-times before, so I'm going to take the most expensive part of the $compare transaction (tidying the HTML of the revisions) out of the HTTP request completely. (We had previously moved it to a cache-behind operation, so now I'm making it an asynchronous cache-only operation.)

:jakem - can you dig into why Apache seems to hit max connection limits(?) after long-running transactions like this? Our down-times are always a massive spike of "Request Queuing" in New Relic, and we seem to hit Apache connection limits far too often.
(Reporter)

Updated

3 years ago
See Also: → bug 1205579

Comment 2

3 years ago
Commits pushed to master at https://github.com/mozilla/kuma

https://github.com/mozilla/kuma/commit/8c546d9348331136cab4ad48f08bd3bfd8addeb0
bug 1205667 - get_tidied_content can return blank

When a $compare request is made for a large revision,
we want to skip tidy_content and return a warning to the user,
so we don't block requests on the expensive tidy operation.

https://github.com/mozilla/kuma/commit/315fbf3a13f3205205eb64e8fc590a53f0666c5a
bug 1205667 - tests for get_tidied_content

https://github.com/mozilla/kuma/commit/4f282be57ddfae08e1cb816857682a8e10eb8860
Merge pull request #3497 from mozilla/never-tidy-in-compare-request-1205667

bug 1205667 - get_tidied_content can return blank
Keywords: in-triage
(Reporter)

Comment 3

2 years ago
As with bug 1203528, this downtime was immediately preceded by a spike in $compare transaction CPU time. Based on https://bugzilla.mozilla.org/show_bug.cgi?id=1203528#c3, I'm going to call both incidents investigated and resolved, knowing that we still haven't quite cleaned up everything.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.