Closed Bug 1271509 Opened 8 years ago Closed 7 years ago

Provide an MDN database with sample data

Categories

(developer.mozilla.org Graveyard :: Code Cleanup, enhancement)

All
Other
enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jwhitlock, Assigned: jwhitlock)

Details

(Whiteboard: [specification][type:feature])

What problem would this feature solve?
======================================
The MDN installation instructions leave a developer with a working but empty database. In order to do useful work, they need to set configuration values, add user accounts, import pages and kumascript macros, and enter other sample data.

This situation also limits automated integration testing of MDN. A server could be automatically provisioned from a PR branch for testing. Without useful data in the database, there won't be anything for a Selenium-driven browser to check.

Who has this problem?
=====================
Core contributors to MDN

How do you know that the users identified above have this problem?
==================================================================
Developers are able to follow the install instructions, but start having difficulty when they begin enabling KumaScript, explore content, or reproduce bugs.

How are the users identified above solving this problem now?
============================================================
An anonymized copy of the production database is periodically generated, and distributed on an as-needed basis.  This is overkill for most development needs, and weighs in at 11 GB uncompressed. Bugs in the anonymizing process risk leaking personally identifiable information (PII), so distribution is limited on a need-to-have-it basis.

Alternatively, data from the MDN website is manually copied to the empty production database, using the APIs or opening pages for editing to copy the raw data.

Do you have any suggestions for solving the problem? Please explain in detail.
==============================================================================
A script can use the MDN APIs to download select data:

* Key pages from MDN, such as the Firefox landing page and pages linked from the homepage and toolbars
* Historical revisions for key pages
* Some or all translations for key pages
* Public user profile data for related revisions
* Kumascript macros

Additional scripts can manually add other features, such as
* Moved and redirected pages
* Zones with custom CSS
* Sample user and IP bans
* Basic configuration
* Waffle flags
* Groups

The script can run periodically to sync with production changes, and the sample database published as a SQLite database or a database dump for developers or automation tools to download, install, and customize.

Alternatively, the sample data can be saved as a collection of Django fixtures, checked into the project, and periodically refreshed via the script.

Is there anything else we should know?
======================================
Generating mock data may be used for some features, but can't be used for all. For example, Kumascript templates are based in JavaScript, and generating useful scripts will not be worth the coding effort, compared to downloading working scripts from MDN.
Commit pushed to master at https://github.com/mozilla/kuma

https://github.com/mozilla/kuma/commit/23b04895f7bdee2b86d4fa0ed5d54c0f8f39ed85
bug 1271509: Update install to use sample database

* Add details for installing Docker on Linux
* Switch database instructions to use the sample database
* Update Front-end instructions to go in order (install node.js, then
  gulp, and optionally install globally), move below viewing the
  homepage.
* Drop instructions from adding "kumaediting" flag (in sample DB)
* Drop instructions for setting KUMASCRIPT_TIMEOUT (in sample DB)
* Mark GitHub Auth as optional, and move admin password disabling into
  that section.
Commits pushed to master at https://github.com/mozilla/kuma

https://github.com/mozilla/kuma/commit/1199dd428fba32ccae2cce07552276629c49a591
bug 1271509: Add scrape_user command and framework

Add a content scraping framework, for adding or updating data to a local
Kuma instance. The first user of the framework is the management command
scrape_user, which can mirror a production user locally, using only
public data on the user's profile.

https://github.com/mozilla/kuma/commit/873363f382c8e9c63b60644089b4b0e758796230
bug 1271509: Fix scraping of banned users

https://github.com/mozilla/kuma/commit/c3cdc7a9964871993fc10065d1af6068a748b66c
bug 1271509: Strong assertion that session is same

https://github.com/mozilla/kuma/commit/ce08787f8c7808f66a7af6931d3a968a15031e8b
Merge pull request #4245 from jwhitlock/scrape_user_1271509

bug 1271509: Add scrape_user command and framework
Commits pushed to master at https://github.com/mozilla/kuma

https://github.com/mozilla/kuma/commit/ceeba3df03ea0dfe8c866e1c3029439800cf3835
bug 1271509: Reduce user test fixtures

Any fixture that touches the database has to be destroyed and recreated
with each test, so there is no benefit to capturing HTML or one-time
model setup as a fixture.

https://github.com/mozilla/kuma/commit/7953547884e8948de154cc7fb30bf340f614563d
bug 1271509: Improve scraper reporting, cycling

Use pre-assembled format strings to report the progress on a source in
the scraper, and make it a little clearer that repeat=True is getting
set to detect if the scraper should loop again.

Process the sources in reverse order. This makes it more likely that the
new sources will be done, so that the old sources that needed them can
complete. Reduces a full scrape from 18 cycles to 13, but still about
the same time (40 minutes).

https://github.com/mozilla/kuma/commit/879734650db50b38575d675b5622f225eae574ff
bug 1271509: Add ./manage.py scrape_document

Add the ability to scrape a document, which also includes scraping:

* The rendered page (to detect zones)
* The ancestor pages (parent and higher)
* Metadata and Translations (optional)
* Revisions (configurable)
* Child documents (optional)
* The top-level zone document

It does not include attachments.

https://github.com/mozilla/kuma/commit/e9045a9985bba8717859827bce8b7e4d576e4c2c
bug 1271509: Remove unneeded .get() for verbosity

Verbosity is included in all Django management commands. Assume it is
available, like scrape_user does.

https://github.com/mozilla/kuma/commit/c484b30b78a80b409ca01f4407f347b365f3f273
bug 1271509: Refactor load_prereq_parent_topic

Convert the optional parent_topic loader to look more like other
optional loaders.

https://github.com/mozilla/kuma/commit/e07448592b299498f8997fdf9ca0d276f4f643d2
Merge pull request #4248 from jwhitlock/scrape_document_1271509

bug 1271509: Add ./manage.py scrape_document

r=rjohnson
Commits pushed to master at https://github.com/mozilla/kuma

https://github.com/mozilla/kuma/commit/4dd193f822af6c82306a2db1bb70efc838eb07bc
bug 1271509: Add ./manage.py scrape_links

Scrape the documents linked from a page, such as the homepage.

https://github.com/mozilla/kuma/commit/e051f21e40779c155bfc110a1b933ddac051430b
bug 1271509: Add ./manage.py sample_mdn

Add a management command to populate a database with fixtures and with
data scraped from MDN.

https://github.com/mozilla/kuma/commit/338113cff7fbf490ac079770dd4a54e53cae690a
bug 1271509: Add script to create sample database

Add a script and specification file to create the sample database used
for MDN development and integration testing.

https://github.com/mozilla/kuma/commit/fae8163b906b9b234a6ec87debfd16a552286c93
bug 1271509: Handle 404s for document_rendered

One common case is that the document metadata refers to a deleted
translation. Instead of halting scrape with error, set the
document_rendered source to state=ERROR.

https://github.com/mozilla/kuma/commit/ea9944af9b73d44bda8da865de72a4757304a4dd
bug 1271509: Fix past revision clearing doc.html

When a new revision is created, even w/ Revision.objects.get_or_create,
the Revision.save() method is called. This surpised me, I thought it was
skipped, which is only true for bulk creation.

If is_approved=True (the default), the save() method makes the revision
the current revision, which sets Document.html to the revision.content
(default empty string).

I've fixed the storage code to save the initial revision with
is_approved=False, so that we don't accidentally clear doc.html when
adding a historical revision

https://github.com/mozilla/kuma/commit/424895e95bf7e0d832c918bf8a994c18c923179d
bug 1271509: Update test fixtures for rev.save()

Fixtures assumed that Revision.objects.create() did not call .save(),
and document.current_revision had to be called directly. Update fixture
setup for my new understanding.

https://github.com/mozilla/kuma/commit/d61bca617d5bc0d3b1a2cc807a1db6c4d910c8e9
bug 1271509: Move error logging to the scraper

Instead of logging errors in Source.gather, capture the error so that it
can be logged by the scraper, and so that a summary of errors can be
printed at the end of the run.

https://github.com/mozilla/kuma/commit/d59069582cbf1fa779480fc63330429b4c80a987
bug 1271509: document_history detects all history

If the document history has less revisions than requested, then remember
that all revisions have been scraped.

https://github.com/mozilla/kuma/commit/08818cf5aa53833b5de97a233a1a64c99fe905a4
bug 1271509: Handle doc.current_rev is not latest

At the end of document scraping, chech that the document has a
current_revision. If it doesn't, scrape more revisions and/or more
history, until we find a current revision or run out of revisions.

Page moves mark the page move in the most recent revision, but leave the
n-1 revision as the current_revision.  The default for ./scrape_document
and others is to just scrape one document, leaving these stuck with no
content.

https://github.com/mozilla/kuma/commit/e5de46c67ebfa3cd72c3ae74c709908bd624be9c
bug 1271509: Omit profiles when scraping links

When scraping a page for Document URLs, omit the links to profiles, such
as the contributor bar.

https://github.com/mozilla/kuma/commit/d6cec60826b274f7e4af36761abaa4d5d2cdd0eb
bug 1271509: Fix spelling

https://github.com/mozilla/kuma/commit/b8af1b57169414f515603e64953b533b5c451af6
bug 1271509: Use dotted shortcut for get_model

https://github.com/mozilla/kuma/commit/0cd965362e12456363f457105800c2ccf6493e53
bug 1271509: Remove redundant urlparse

https://github.com/mozilla/kuma/commit/3d6abcd60e78bb4b58b21cb58017a36efc423dc3
bug 1271509: Fix docs for when dependency needed

https://github.com/mozilla/kuma/commit/22b7ab29740b82f43b05956deaaf0b0b1b5f0a2b
bug 1271509: Improve scrape management commands

- Update scrape_links docstring for abilties beyond homepage
- Use options['verbosity'], since always populated
- Use CommandError to report source failures
- Remove unneeded parentheses
- Close specification file after processing it
- Fix pluralization of incomplete sources message

https://github.com/mozilla/kuma/commit/6b29825d3441b95ee96f82509d3e8ee9062f34a2
Merge pull request #4076 from jwhitlock/sample_database_wip_1271509

Bug 1271509: Sample database and production scraper
Assignee: nobody → jwhitlock
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Commits pushed to master at https://github.com/mozilla/kuma

https://github.com/mozilla/kuma/commit/3e559737dfa02ef82a23fd69283fc1480c6df17b
bug 1271509: Beta style in CKEditor

Use the design style in CKEditor, if the user is in the beta program.

https://github.com/mozilla/kuma/commit/3b5f6a521ad9f2fc5631d6e9501a5a0a29a1e10c
Merge pull request #4307 from jwhitlock/zilla-editor-1271509

bug 1271509: Beta style in CKEditor
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.