There was a large spike in time spent in $compare transaction(s) before the outage: https://rpm.newrelic.com/accounts/263620/applications/3172075/transactions# All from a user agent "mozilla" ... so can't block it with the agent-blocking prevention we put in place for the last down-time. We've seen this spike before down-times before, so I'm going to take the most expensive part of the $compare transaction (tidying the HTML of the revisions) out of the HTTP request completely. (We had previously moved it to a cache-behind operation, so now I'm making it an asynchronous cache-only operation.) :jakem - can you dig into why Apache seems to hit max connection limits(?) after long-running transactions like this? Our down-times are always a massive spike of "Request Queuing" in New Relic, and we seem to hit Apache connection limits far too often.
Commits pushed to master at https://github.com/mozilla/kuma https://github.com/mozilla/kuma/commit/8c546d9348331136cab4ad48f08bd3bfd8addeb0 bug 1205667 - get_tidied_content can return blank When a $compare request is made for a large revision, we want to skip tidy_content and return a warning to the user, so we don't block requests on the expensive tidy operation. https://github.com/mozilla/kuma/commit/315fbf3a13f3205205eb64e8fc590a53f0666c5a bug 1205667 - tests for get_tidied_content https://github.com/mozilla/kuma/commit/4f282be57ddfae08e1cb816857682a8e10eb8860 Merge pull request #3497 from mozilla/never-tidy-in-compare-request-1205667 bug 1205667 - get_tidied_content can return blank
As with bug 1203528, this downtime was immediately preceded by a spike in $compare transaction CPU time. Based on https://bugzilla.mozilla.org/show_bug.cgi?id=1203528#c3, I'm going to call both incidents investigated and resolved, knowing that we still haven't quite cleaned up everything.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.