Closed Bug 1190171 Opened 9 years ago Closed 9 years ago

Investigate MDN down-time incident 2015-08-02

Categories

(developer.mozilla.org Graveyard :: Code Cleanup, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: groovecoder, Assigned: jezdez)

References

Details

(Keywords: in-triage)

Attachments

(4 files)

Summary: Investigate MDN down-time incident 2015-07-02 → Investigate MDN down-time incident 2015-08-02
Attached image load on web1
seems like a high load on the web node
Attached image full trace details
Looks like this happened somewhere while rendering the template, which could indicate an issue with the revision_diff template macro call we do in there. The user agent of the longest running transactions is wget on cygwin.

Sadly we can't see further in the stack since, but from looking at the Jinja2 macro code the heavy lifting of that macros is done in the diff_table template helper. 

That helper again calls to tidylib to clean up the html of the two revisions to compare and then builds a diff table with the difflib library.

I don't understand why we're running tidy on those revisions on every request, tbh. We can easily just store that info in the revision table on save and use that during rendering instead.

Additionally I think we should look into other ways to create the diff table, surely there are faster libraries than the one in the stdlib.

Lastly, I think we should ratelimit the compare revisions view to prevent it from becoming a hot zone by scrapers that are not acting honorably.
Assignee: nobody → jezdez
Status: NEW → ASSIGNED
Blocks: 1177264
We made an early technical decision to maintain complete data fidelity to what the client sends to the back-end to store in the database. It's the same reason we sanitize HTML content on its way out of the database instead of sanitizing it before it goes into the database.

Changing that policy could introduce other - potentially more serious - bugs or issues wherein buggy code (ours or a library) alters HTML content and "corrupts" data in the database. E.g., an html5lib update could trigger passive manipulations of content in the database, making bugs like bug 1156936 require data clean-up in addition to code cleanup.

So, having said all that, I agree we should look for other ways to create the diff table, and rate-limit the compare revisions view. And I suggest we should file a follow-up bug to investigate the consequences of manipulating content on its way **into** the database.
Okay, I've been working on a patch for this today and can share some findings later in a PR.

As to the notion of always storing the highest fidelity of user content, I think with all due respect that's short sighted and ignores the added computational cost (and effective slower rendering) of relying on that kind of raw data.

The complexity of code that does this content processing right now all the time (e.g. sample code extraction, kuma.wiki.content) and the use of *four* different libraries to do that (bleach, html5lib, tidy and lxml) boggles my simplistic mind. I consider this architectural decision to be the source of many needs for caching and denormalization, something that wasn't done with a central architecture but with many different techniques as well over time. All in all, I'm not impressed by that early technical decision and strongly believe that we need to overcome it, e.g. adding a single API for caching and async invalidation via cacheback.

There is no reason to not run tidy on the content from the client during save since it's been stable for a long while. Using write-through techniques to fill a denormalization column with the tidied revision content is easy to implement and can fill up the columns with a async task if needed, too. Even in the unlikely case tidy does get updated drastically or another issue comes up along that lines, we have the ability to mass scale updating the denormalized fields with Celery as proven by the mass search reindex ability.
Commits pushed to master at https://github.com/mozilla/kuma

https://github.com/mozilla/kuma/commit/0ff9858e5205013a6431b92522075ba349b56d8c
Bug 1190171 - Refactor wiki signal handling.

This makes use of Django's new app loading classes that are guarenteed to connect the signals at startup and is the recommended way of doing this.

This also moves the search index config into the wiki app.

https://github.com/mozilla/kuma/commit/321e7d07e7d7a408592f6e662e0f4ba0b4532d3d
Bug 1190171 - Ratelimit the compare revisions view by user or IP to once per second.

https://github.com/mozilla/kuma/commit/032e743a579d9072379778ed80370f3bb67d3b04
Bug 1190171 - Prefetch document and tags in compare revisions view.

That reduces the number of queries being done when rendering the view.

https://github.com/mozilla/kuma/commit/e588c670aae9afdfc4c063d09ad0a68d4d825bcb
Bug 1190171 - Cleanup of wiki helpers.

https://github.com/mozilla/kuma/commit/6fe6b468b8f4a73cc348315732ee335bd291230f
Bug 1190171 - Work around a bug in Jingo that can't handle new style INSTALLED_APPS items.

https://github.com/mozilla/kuma/commit/5afacc65e75964c42d4b934e2408db241f3d1690
Bug 1190171 - Add tidied_content field to Revision model.

This will be filled on post_save and via a Celery task and occasionally for older revisions on demand (also via Celery). The task is rate limited in a way to be able to get through the ~550k revisions rather quickly.

https://github.com/mozilla/kuma/commit/9120a26499879eaa1001cca5eb42a9247993e30e
Merge pull request #3386 from mozilla/bug1190171

Bug 1190171 - Revision content denormalization and other cleanups of the compare revision view.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
revision feed and compare view are not using cpu time anymore
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: