Provide an MDN database with sample data


Code Cleanup
2 years ago
10 months ago


(Reporter: jwhitlock, Assigned: jwhitlock)



(Whiteboard: [specification][type:feature])



2 years ago
What problem would this feature solve?
The MDN installation instructions leave a developer with a working but empty database. In order to do useful work, they need to set configuration values, add user accounts, import pages and kumascript macros, and enter other sample data.

This situation also limits automated integration testing of MDN. A server could be automatically provisioned from a PR branch for testing. Without useful data in the database, there won't be anything for a Selenium-driven browser to check.

Who has this problem?
Core contributors to MDN

How do you know that the users identified above have this problem?
Developers are able to follow the install instructions, but start having difficulty when they begin enabling KumaScript, explore content, or reproduce bugs.

How are the users identified above solving this problem now?
An anonymized copy of the production database is periodically generated, and distributed on an as-needed basis.  This is overkill for most development needs, and weighs in at 11 GB uncompressed. Bugs in the anonymizing process risk leaking personally identifiable information (PII), so distribution is limited on a need-to-have-it basis.

Alternatively, data from the MDN website is manually copied to the empty production database, using the APIs or opening pages for editing to copy the raw data.

Do you have any suggestions for solving the problem? Please explain in detail.
A script can use the MDN APIs to download select data:

* Key pages from MDN, such as the Firefox landing page and pages linked from the homepage and toolbars
* Historical revisions for key pages
* Some or all translations for key pages
* Public user profile data for related revisions
* Kumascript macros

Additional scripts can manually add other features, such as
* Moved and redirected pages
* Zones with custom CSS
* Sample user and IP bans
* Basic configuration
* Waffle flags
* Groups

The script can run periodically to sync with production changes, and the sample database published as a SQLite database or a database dump for developers or automation tools to download, install, and customize.

Alternatively, the sample data can be saved as a collection of Django fixtures, checked into the project, and periodically refreshed via the script.

Is there anything else we should know?
Generating mock data may be used for some features, but can't be used for all. For example, Kumascript templates are based in JavaScript, and generating useful scripts will not be worth the coding effort, compared to downloading working scripts from MDN.

Comment 1

a year ago
Commit pushed to master at
bug 1271509: Update install to use sample database

* Add details for installing Docker on Linux
* Switch database instructions to use the sample database
* Update Front-end instructions to go in order (install node.js, then
  gulp, and optionally install globally), move below viewing the
* Drop instructions from adding "kumaediting" flag (in sample DB)
* Drop instructions for setting KUMASCRIPT_TIMEOUT (in sample DB)
* Mark GitHub Auth as optional, and move admin password disabling into
  that section.

Comment 2

11 months ago
Commits pushed to master at
bug 1271509: Add scrape_user command and framework

Add a content scraping framework, for adding or updating data to a local
Kuma instance. The first user of the framework is the management command
scrape_user, which can mirror a production user locally, using only
public data on the user's profile.
bug 1271509: Fix scraping of banned users
bug 1271509: Strong assertion that session is same
Merge pull request #4245 from jwhitlock/scrape_user_1271509

bug 1271509: Add scrape_user command and framework

Comment 3

11 months ago
Commits pushed to master at
bug 1271509: Reduce user test fixtures

Any fixture that touches the database has to be destroyed and recreated
with each test, so there is no benefit to capturing HTML or one-time
model setup as a fixture.
bug 1271509: Improve scraper reporting, cycling

Use pre-assembled format strings to report the progress on a source in
the scraper, and make it a little clearer that repeat=True is getting
set to detect if the scraper should loop again.

Process the sources in reverse order. This makes it more likely that the
new sources will be done, so that the old sources that needed them can
complete. Reduces a full scrape from 18 cycles to 13, but still about
the same time (40 minutes).
bug 1271509: Add ./ scrape_document

Add the ability to scrape a document, which also includes scraping:

* The rendered page (to detect zones)
* The ancestor pages (parent and higher)
* Metadata and Translations (optional)
* Revisions (configurable)
* Child documents (optional)
* The top-level zone document

It does not include attachments.
bug 1271509: Remove unneeded .get() for verbosity

Verbosity is included in all Django management commands. Assume it is
available, like scrape_user does.
bug 1271509: Refactor load_prereq_parent_topic

Convert the optional parent_topic loader to look more like other
optional loaders.
Merge pull request #4248 from jwhitlock/scrape_document_1271509

bug 1271509: Add ./ scrape_document


Comment 4

10 months ago
Commits pushed to master at
bug 1271509: Add ./ scrape_links

Scrape the documents linked from a page, such as the homepage.
bug 1271509: Add ./ sample_mdn

Add a management command to populate a database with fixtures and with
data scraped from MDN.
bug 1271509: Add script to create sample database

Add a script and specification file to create the sample database used
for MDN development and integration testing.
bug 1271509: Handle 404s for document_rendered

One common case is that the document metadata refers to a deleted
translation. Instead of halting scrape with error, set the
document_rendered source to state=ERROR.
bug 1271509: Fix past revision clearing doc.html

When a new revision is created, even w/ Revision.objects.get_or_create,
the method is called. This surpised me, I thought it was
skipped, which is only true for bulk creation.

If is_approved=True (the default), the save() method makes the revision
the current revision, which sets Document.html to the revision.content
(default empty string).

I've fixed the storage code to save the initial revision with
is_approved=False, so that we don't accidentally clear doc.html when
adding a historical revision
bug 1271509: Update test fixtures for

Fixtures assumed that Revision.objects.create() did not call .save(),
and document.current_revision had to be called directly. Update fixture
setup for my new understanding.
bug 1271509: Move error logging to the scraper

Instead of logging errors in Source.gather, capture the error so that it
can be logged by the scraper, and so that a summary of errors can be
printed at the end of the run.
bug 1271509: document_history detects all history

If the document history has less revisions than requested, then remember
that all revisions have been scraped.
bug 1271509: Handle doc.current_rev is not latest

At the end of document scraping, chech that the document has a
current_revision. If it doesn't, scrape more revisions and/or more
history, until we find a current revision or run out of revisions.

Page moves mark the page move in the most recent revision, but leave the
n-1 revision as the current_revision.  The default for ./scrape_document
and others is to just scrape one document, leaving these stuck with no
bug 1271509: Omit profiles when scraping links

When scraping a page for Document URLs, omit the links to profiles, such
as the contributor bar.
bug 1271509: Fix spelling
bug 1271509: Use dotted shortcut for get_model
bug 1271509: Remove redundant urlparse
bug 1271509: Fix docs for when dependency needed
bug 1271509: Improve scrape management commands

- Update scrape_links docstring for abilties beyond homepage
- Use options['verbosity'], since always populated
- Use CommandError to report source failures
- Remove unneeded parentheses
- Close specification file after processing it
- Fix pluralization of incomplete sources message
Merge pull request #4076 from jwhitlock/sample_database_wip_1271509

Bug 1271509: Sample database and production scraper


10 months ago
Assignee: nobody → jwhitlock
Last Resolved: 10 months ago
Resolution: --- → FIXED

Comment 5

10 months ago
Commits pushed to master at
bug 1271509: Beta style in CKEditor

Use the design style in CKEditor, if the user is in the beta program.
Merge pull request #4307 from jwhitlock/zilla-editor-1271509

bug 1271509: Beta style in CKEditor
You need to log in before you can comment on or make changes to this bug.