Closed Bug 1132658 Opened 7 years ago Closed 7 years ago

[Compat Data][Importer] Scrape a more recent (ideally current) MDN URL set


( Graveyard :: General, defect)

Not set


(Not tracked)



(Reporter: fs, Assigned: jwhitlock)



(Whiteboard: [specification][type:bug])

What did you do?
Checked what is imported at

What happened?
Saw that the sub pages of e.g. Web/JavaScript/Reference/Global_Objects/Array/* are not imported.

What should have happened?
All pages of Global_objects/<obj>/<subpages> should be imported.

Is there anything else we should know?
I am maintaining the JavaScript docs on MDN and I am interested in good how the importer is so far. The majority of pages (maybe 400+) live under Global_objects/*/*.

A full scrape would help to identify parsing problems.
What serves as the source for pages to crawl?
Blocks: 1132269
We'll have to create features for the sub pages to scrape the data into the API.  Is there a list of URLs for the subpages?

The source of the MDN URLs with compatibility data is the WebPlatform CompaTables project:

They scraped compatibility data from MDN in August 2014.  While the scraped data was unsuitable for redisplay on MDN, it did give a collection of MDN URLs used for scraping, without doing a detailed crawl of MDN.  Pages added or removed after August 2014 are not included, as well as pages the WebPlatform team didn't include in their scrape.
Flags: needinfo?(fscholz)
A lot happened on MDN since August 2014 and WPD only provides a subset of our documentation really. There is a lot more on MDN. Changing the summary of this bug as it not only affects JS docs.

There is a way to get the current (sub) pages in Kuma. Most of the pages under the following trees should provide compat data:$children$children$children$children$children$children$children

We are documenting new API and features everyday. So, ideally we would use this data directly from MDN to get the most and the latest data into the compat store.
Flags: needinfo?(fscholz)
Summary: [Compat Data][Importer] Scrape whole JavaScript reference docs → [Compat Data][Importer] Scrape a more recent (ideally current) MDN URL set
Thank you, that's a useful resource.  I can build a tool to process the JSON, compare it to the current feature list, and create new features (4-12 hours).  The complication will be removing pages with no specification or browser compatibility section - this is available on the page meta data (such as$json), but relies on the h3 ID being set correctly.  Some pages use the id "Browser_compatibility" for the specification section as well.  However, as long as one or the other is present, I can add it to the scrape list.

I can then use the existing tool to grab the pages and run the scraper against them.  This will probably generate the next list of scraper and page issues to fix :)
fwiw, there is also an expanded JSON children list. Just add "?expand" to the URLs above.$children?expand

There you have a "sections" key and it would contain something like this:
"sections": [

    "id": "Specifications",
    "title": "Specifications"


    "id": "Browser_compatibility",
    "title": "Browser compatibility"


So checking if "Browser_compatibility" is in "sections" could help to only get pages that actually provide compat information.
An even better approach might be to use the search API that looks for the CompatibilityTable macro that is used to generate the current compat tables.

This also gives you an idea of how many features should be scraped (ca. 2797 at this point).
I've submitted the PR for the code fixing this bug [1].  It is a lot of code, and may take a few weeks for the PR to be accepted and close the bug.  The code is running on, features have been imported, and the initial scrape run successfully.

I've created features for all the pages under Web[2], Navigation_timing[3], Server-sent_events[4], WebAPI[5], and WebSockets[6], using the $children API[7]. Additional trees can be added (open a bug), and the tool (tools/ can be re-run to update with MDN changes. I trust the $children API a little more than the search engine.

After the mirroring and scraping, there are 6042 features[8]. I also synced the specifications[9] and maturities[10], and added some versions[11] to reflect changes on MDN since February.  I've tried not to add invalid versions, so that the "unknown_versions" issue will highlight actual issues.  The importer has found 6239 issues[12], including critical errors that may mask additional issues.

The importer search form, either at the top of the imported pages list [13] or it's own URL [14], can be used to find a particular MDN page of interest.

This is really great, John!

Only today for the first time, I spotted something that was not scraped:$children

14 pages under "handler". Is this due to the deep nesting?
These pages don't appear when fetching the Web $children [1].  There's a maximum depth of 5 supported by the MDN API, so anything more deeply nested are not imported.  I've added bug 1182542 to fix it.

[1]$children (fixed from comment 6)
Assignee: nobody → jwhitlock
Commits pushed to master at
bug 1132658 - Bump requirements

Bump requirements as suggested by
bug 1132658 - Remove redundant resource name

When displaying the diff of a collection, adding the resource name to
the output is redundant, since it appears in the JSON API output.
bug 1132658 - Improve diffs for translated strings

Order the locales in the translated string (en, then in alpha order), so
that the diffs highlight the changes rather than random dict sorting.
bug 1132658 - Data ID for section uses subpath

It is likely that the subpath will be set but not the number.
bug 1132658 - Refactor common tool code

Combine common tool code into tools/, to reduce duplication and
simplify command line parsing.
bug 1132658 - Add Collection.load_collection

Collection.load_collection will copy the resources from another
collection. The resources can then be modified, and a
CollectionChangeset used to update the original Collection.
bug 1132658 - Allow 'false' as a canonical name

Previously, "false" was interpreted as the JSON value for False.
bug 1132658 - Add tools/

New tool for gathering MDN pages and creating or updating the branch
features related to them.  Includes converting to canonical names,
better handling of long slugs, and adding URLs to pages w/o
compatibility data.
bug 1132658 - Improve importer search by URL

When searching the importer by MDN URL, drop the querystring and
fragment automatically, rather than returning an error.
bug 1132658 - Improve admin for mdn app

Use readonly_fields to prevent loading the whole database in the admin.
bug 1132658 - Add "No Data" status for pages

Add a page status for "page imported w/o compat data"
bug 1132658 - Fix canonical feature names

In the scrape-constructed view_feature, use name="canonical", rather
than name={"zxx": "canonical"}.
bug 1132658 - Add doc_parse_error issue

Some pages don't have the expected structure, such as <h2> headings,
resulting in the doc rule not matching.  Turn this into a
doc_parse_error issue, instead of an Exception, for further processing.

bug 1132658 - Turn Spec* assertions into issues

Add issues specname_blank_key, spec2_wrong_kumascript, and
spec2_arg_count to replace assertions resulting in an exception issue.
Also, refactor visitor.unknown_kumascript_issue into
visitor.kumascript_issue, so it can be used in more KumaScript issue
bug 1132658 - Fix typos in issue templates
bug 1132658 - Download MDN pages w/o cache
bug 1132658 - Small importer UI fixes

- Issues name is a link in importer/issues
- "Download MDN page" rather than "Download MDN pages"
bug 1132658 - Fix template for failed_download
bug 1132658 - Handle text with partial quotes
has the text '"max-zoom" descriptor, which doesn't end in quotes.
bug 1132658 - Handle redirect on $json
bug 1132658 - Improve handling of MDN locales
has a Bengali (bn-BD) translation that broke processing, prompting several
- MDN paths increased to 1024 characters
- Gather localized titles of MDN pages from metadata
- Migrate featurepage.status and issue.slug choices from previous work
- When scraping, add localized names of the page to the target feature
  if it is not set as a canonical name
- When a task encounters an unexpected status, assert with the
  human-friendly name
- Add STATUS_NO_DATA as an 'already fetched' state
- Display IRI (with unicode) instead of URI (with percent-encoded
  unicode) on the sample feature page
fix bug 1132658 - Convert download failure to issue

When mdn.tasks.fetch_translation gets a non-200 response, report as a
failed_download issue and continue. Previously, an exception was also
raised, which halted tools/
Merge pull request #31 from jwhitlock/1132658_more_mdn

Fix bug 1132658 - Scrape all MDN pages
Closed: 7 years ago
Resolution: --- → FIXED
Product: → Graveyard
You need to log in before you can comment on or make changes to this bug.