Closed
Bug 1271509
Opened 8 years ago
Closed 7 years ago
Provide an MDN database with sample data
Categories
(developer.mozilla.org Graveyard :: Code Cleanup, enhancement)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jwhitlock, Assigned: jwhitlock)
Details
(Whiteboard: [specification][type:feature])
What problem would this feature solve? ====================================== The MDN installation instructions leave a developer with a working but empty database. In order to do useful work, they need to set configuration values, add user accounts, import pages and kumascript macros, and enter other sample data. This situation also limits automated integration testing of MDN. A server could be automatically provisioned from a PR branch for testing. Without useful data in the database, there won't be anything for a Selenium-driven browser to check. Who has this problem? ===================== Core contributors to MDN How do you know that the users identified above have this problem? ================================================================== Developers are able to follow the install instructions, but start having difficulty when they begin enabling KumaScript, explore content, or reproduce bugs. How are the users identified above solving this problem now? ============================================================ An anonymized copy of the production database is periodically generated, and distributed on an as-needed basis. This is overkill for most development needs, and weighs in at 11 GB uncompressed. Bugs in the anonymizing process risk leaking personally identifiable information (PII), so distribution is limited on a need-to-have-it basis. Alternatively, data from the MDN website is manually copied to the empty production database, using the APIs or opening pages for editing to copy the raw data. Do you have any suggestions for solving the problem? Please explain in detail. ============================================================================== A script can use the MDN APIs to download select data: * Key pages from MDN, such as the Firefox landing page and pages linked from the homepage and toolbars * Historical revisions for key pages * Some or all translations for key pages * Public user profile data for related revisions * Kumascript macros Additional scripts can manually add other features, such as * Moved and redirected pages * Zones with custom CSS * Sample user and IP bans * Basic configuration * Waffle flags * Groups The script can run periodically to sync with production changes, and the sample database published as a SQLite database or a database dump for developers or automation tools to download, install, and customize. Alternatively, the sample data can be saved as a collection of Django fixtures, checked into the project, and periodically refreshed via the script. Is there anything else we should know? ====================================== Generating mock data may be used for some features, but can't be used for all. For example, Kumascript templates are based in JavaScript, and generating useful scripts will not be worth the coding effort, compared to downloading working scripts from MDN.
Comment 1•7 years ago
|
||
Commit pushed to master at https://github.com/mozilla/kuma https://github.com/mozilla/kuma/commit/23b04895f7bdee2b86d4fa0ed5d54c0f8f39ed85 bug 1271509: Update install to use sample database * Add details for installing Docker on Linux * Switch database instructions to use the sample database * Update Front-end instructions to go in order (install node.js, then gulp, and optionally install globally), move below viewing the homepage. * Drop instructions from adding "kumaediting" flag (in sample DB) * Drop instructions for setting KUMASCRIPT_TIMEOUT (in sample DB) * Mark GitHub Auth as optional, and move admin password disabling into that section.
Comment 2•7 years ago
|
||
Commits pushed to master at https://github.com/mozilla/kuma https://github.com/mozilla/kuma/commit/1199dd428fba32ccae2cce07552276629c49a591 bug 1271509: Add scrape_user command and framework Add a content scraping framework, for adding or updating data to a local Kuma instance. The first user of the framework is the management command scrape_user, which can mirror a production user locally, using only public data on the user's profile. https://github.com/mozilla/kuma/commit/873363f382c8e9c63b60644089b4b0e758796230 bug 1271509: Fix scraping of banned users https://github.com/mozilla/kuma/commit/c3cdc7a9964871993fc10065d1af6068a748b66c bug 1271509: Strong assertion that session is same https://github.com/mozilla/kuma/commit/ce08787f8c7808f66a7af6931d3a968a15031e8b Merge pull request #4245 from jwhitlock/scrape_user_1271509 bug 1271509: Add scrape_user command and framework
Comment 3•7 years ago
|
||
Commits pushed to master at https://github.com/mozilla/kuma https://github.com/mozilla/kuma/commit/ceeba3df03ea0dfe8c866e1c3029439800cf3835 bug 1271509: Reduce user test fixtures Any fixture that touches the database has to be destroyed and recreated with each test, so there is no benefit to capturing HTML or one-time model setup as a fixture. https://github.com/mozilla/kuma/commit/7953547884e8948de154cc7fb30bf340f614563d bug 1271509: Improve scraper reporting, cycling Use pre-assembled format strings to report the progress on a source in the scraper, and make it a little clearer that repeat=True is getting set to detect if the scraper should loop again. Process the sources in reverse order. This makes it more likely that the new sources will be done, so that the old sources that needed them can complete. Reduces a full scrape from 18 cycles to 13, but still about the same time (40 minutes). https://github.com/mozilla/kuma/commit/879734650db50b38575d675b5622f225eae574ff bug 1271509: Add ./manage.py scrape_document Add the ability to scrape a document, which also includes scraping: * The rendered page (to detect zones) * The ancestor pages (parent and higher) * Metadata and Translations (optional) * Revisions (configurable) * Child documents (optional) * The top-level zone document It does not include attachments. https://github.com/mozilla/kuma/commit/e9045a9985bba8717859827bce8b7e4d576e4c2c bug 1271509: Remove unneeded .get() for verbosity Verbosity is included in all Django management commands. Assume it is available, like scrape_user does. https://github.com/mozilla/kuma/commit/c484b30b78a80b409ca01f4407f347b365f3f273 bug 1271509: Refactor load_prereq_parent_topic Convert the optional parent_topic loader to look more like other optional loaders. https://github.com/mozilla/kuma/commit/e07448592b299498f8997fdf9ca0d276f4f643d2 Merge pull request #4248 from jwhitlock/scrape_document_1271509 bug 1271509: Add ./manage.py scrape_document r=rjohnson
Comment 4•7 years ago
|
||
Commits pushed to master at https://github.com/mozilla/kuma https://github.com/mozilla/kuma/commit/4dd193f822af6c82306a2db1bb70efc838eb07bc bug 1271509: Add ./manage.py scrape_links Scrape the documents linked from a page, such as the homepage. https://github.com/mozilla/kuma/commit/e051f21e40779c155bfc110a1b933ddac051430b bug 1271509: Add ./manage.py sample_mdn Add a management command to populate a database with fixtures and with data scraped from MDN. https://github.com/mozilla/kuma/commit/338113cff7fbf490ac079770dd4a54e53cae690a bug 1271509: Add script to create sample database Add a script and specification file to create the sample database used for MDN development and integration testing. https://github.com/mozilla/kuma/commit/fae8163b906b9b234a6ec87debfd16a552286c93 bug 1271509: Handle 404s for document_rendered One common case is that the document metadata refers to a deleted translation. Instead of halting scrape with error, set the document_rendered source to state=ERROR. https://github.com/mozilla/kuma/commit/ea9944af9b73d44bda8da865de72a4757304a4dd bug 1271509: Fix past revision clearing doc.html When a new revision is created, even w/ Revision.objects.get_or_create, the Revision.save() method is called. This surpised me, I thought it was skipped, which is only true for bulk creation. If is_approved=True (the default), the save() method makes the revision the current revision, which sets Document.html to the revision.content (default empty string). I've fixed the storage code to save the initial revision with is_approved=False, so that we don't accidentally clear doc.html when adding a historical revision https://github.com/mozilla/kuma/commit/424895e95bf7e0d832c918bf8a994c18c923179d bug 1271509: Update test fixtures for rev.save() Fixtures assumed that Revision.objects.create() did not call .save(), and document.current_revision had to be called directly. Update fixture setup for my new understanding. https://github.com/mozilla/kuma/commit/d61bca617d5bc0d3b1a2cc807a1db6c4d910c8e9 bug 1271509: Move error logging to the scraper Instead of logging errors in Source.gather, capture the error so that it can be logged by the scraper, and so that a summary of errors can be printed at the end of the run. https://github.com/mozilla/kuma/commit/d59069582cbf1fa779480fc63330429b4c80a987 bug 1271509: document_history detects all history If the document history has less revisions than requested, then remember that all revisions have been scraped. https://github.com/mozilla/kuma/commit/08818cf5aa53833b5de97a233a1a64c99fe905a4 bug 1271509: Handle doc.current_rev is not latest At the end of document scraping, chech that the document has a current_revision. If it doesn't, scrape more revisions and/or more history, until we find a current revision or run out of revisions. Page moves mark the page move in the most recent revision, but leave the n-1 revision as the current_revision. The default for ./scrape_document and others is to just scrape one document, leaving these stuck with no content. https://github.com/mozilla/kuma/commit/e5de46c67ebfa3cd72c3ae74c709908bd624be9c bug 1271509: Omit profiles when scraping links When scraping a page for Document URLs, omit the links to profiles, such as the contributor bar. https://github.com/mozilla/kuma/commit/d6cec60826b274f7e4af36761abaa4d5d2cdd0eb bug 1271509: Fix spelling https://github.com/mozilla/kuma/commit/b8af1b57169414f515603e64953b533b5c451af6 bug 1271509: Use dotted shortcut for get_model https://github.com/mozilla/kuma/commit/0cd965362e12456363f457105800c2ccf6493e53 bug 1271509: Remove redundant urlparse https://github.com/mozilla/kuma/commit/3d6abcd60e78bb4b58b21cb58017a36efc423dc3 bug 1271509: Fix docs for when dependency needed https://github.com/mozilla/kuma/commit/22b7ab29740b82f43b05956deaaf0b0b1b5f0a2b bug 1271509: Improve scrape management commands - Update scrape_links docstring for abilties beyond homepage - Use options['verbosity'], since always populated - Use CommandError to report source failures - Remove unneeded parentheses - Close specification file after processing it - Fix pluralization of incomplete sources message https://github.com/mozilla/kuma/commit/6b29825d3441b95ee96f82509d3e8ee9062f34a2 Merge pull request #4076 from jwhitlock/sample_database_wip_1271509 Bug 1271509: Sample database and production scraper
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → jwhitlock
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Comment 5•7 years ago
|
||
Commits pushed to master at https://github.com/mozilla/kuma https://github.com/mozilla/kuma/commit/3e559737dfa02ef82a23fd69283fc1480c6df17b bug 1271509: Beta style in CKEditor Use the design style in CKEditor, if the user is in the beta program. https://github.com/mozilla/kuma/commit/3b5f6a521ad9f2fc5631d6e9501a5a0a29a1e10c Merge pull request #4307 from jwhitlock/zilla-editor-1271509 bug 1271509: Beta style in CKEditor
Updated•4 years ago
|
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•