Closed Bug 1132658 Opened 9 years ago Closed 9 years ago

[Compat Data][Importer] Scrape a more recent (ideally current) MDN URL set

Categories

(developer.mozilla.org Graveyard :: General, defect)

All
Other
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: fs, Assigned: jwhitlock)

References

Details

(Whiteboard: [specification][type:bug])

What did you do?
================
Checked what is imported at https://browsercompat.herokuapp.com/importer/?page=20


What happened?
==============
Saw that the sub pages of e.g. Web/JavaScript/Reference/Global_Objects/Array/* are not imported.

What should have happened?
==========================
All pages of Global_objects/<obj>/<subpages> should be imported.

Is there anything else we should know?
======================================
I am maintaining the JavaScript docs on MDN and I am interested in good how the importer is so far. The majority of pages (maybe 400+) live under Global_objects/*/*.

A full scrape would help to identify parsing problems.
What serves as the source for pages to crawl?
Blocks: 1132269
We'll have to create features for the sub pages to scrape the data into the API.  Is there a list of URLs for the subpages?

The source of the MDN URLs with compatibility data is the WebPlatform CompaTables project:

https://docs.webplatform.org/wiki/WPD:Projects/CompaTables

They scraped compatibility data from MDN in August 2014.  While the scraped data was unsuitable for redisplay on MDN, it did give a collection of MDN URLs used for scraping, without doing a detailed crawl of MDN.  Pages added or removed after August 2014 are not included, as well as pages the WebPlatform team didn't include in their scrape.
Flags: needinfo?(fscholz)
A lot happened on MDN since August 2014 and WPD only provides a subset of our documentation really. There is a lot more on MDN. Changing the summary of this bug as it not only affects JS docs.

There is a way to get the current (sub) pages in Kuma. Most of the pages under the following trees should provide compat data:

https://developer.mozilla.org/en-US/docs/Web/API$children
https://developer.mozilla.org/en-US/docs/Web/CSS$children
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference$children
https://developer.mozilla.org/en-US/docs/Web/HTML/Element$children
https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes$children
https://developer.mozilla.org/en-US/docs/Web/MathML/Element$children
https://developer.mozilla.org/en-US/docs/Web/SVG/Element$children

We are documenting new API and features everyday. So, ideally we would use this data directly from MDN to get the most and the latest data into the compat store.
Flags: needinfo?(fscholz)
Summary: [Compat Data][Importer] Scrape whole JavaScript reference docs → [Compat Data][Importer] Scrape a more recent (ideally current) MDN URL set
Thank you, that's a useful resource.  I can build a tool to process the JSON, compare it to the current feature list, and create new features (4-12 hours).  The complication will be removing pages with no specification or browser compatibility section - this is available on the page meta data (such as https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model$json), but relies on the h3 ID being set correctly.  Some pages use the id "Browser_compatibility" for the specification section as well.  However, as long as one or the other is present, I can add it to the scrape list.

I can then use the existing tool to grab the pages and run the scraper against them.  This will probably generate the next list of scraper and page issues to fix :)
fwiw, there is also an expanded JSON children list. Just add "?expand" to the URLs above.

https://developer.mozilla.org/en-US/docs/Web/CSS$children?expand

There you have a "sections" key and it would contain something like this:
"sections": [
{

    "id": "Specifications",
    "title": "Specifications"

},
{

    "id": "Browser_compatibility",
    "title": "Browser compatibility"

},
...

So checking if "Browser_compatibility" is in "sections" could help to only get pages that actually provide compat information.
An even better approach might be to use the search API that looks for the CompatibilityTable macro that is used to generate the current compat tables.

https://developer.mozilla.org/en-US/search?locale=en-US&kumascript_macros=CompatibilityTable
https://developer.mozilla.org/en-US/search.json?locale=en-US&kumascript_macros=CompatibilityTable

This also gives you an idea of how many features should be scraped (ca. 2797 at this point).
I've submitted the PR for the code fixing this bug [1].  It is a lot of code, and may take a few weeks for the PR to be accepted and close the bug.  The code is running on https://browsercompat.herokuapp.com, features have been imported, and the initial scrape run successfully.

I've created features for all the pages under Web[2], Navigation_timing[3], Server-sent_events[4], WebAPI[5], and WebSockets[6], using the $children API[7]. Additional trees can be added (open a bug), and the tool (tools/mirror_mdn_features.py) can be re-run to update with MDN changes. I trust the $children API a little more than the search engine.

After the mirroring and scraping, there are 6042 features[8]. I also synced the specifications[9] and maturities[10], and added some versions[11] to reflect changes on MDN since February.  I've tried not to add invalid versions, so that the "unknown_versions" issue will highlight actual issues.  The importer has found 6239 issues[12], including critical errors that may mask additional issues.

The importer search form, either at the top of the imported pages list [13] or it's own URL [14], can be used to find a particular MDN page of interest.

[1] https://github.com/mozilla/web-platform-compat/pull/31
[2] https://developer.mozilla.org/en-US/Web
[3] https://developer.mozilla.org/en-US/docs/Navigation_timing
[4] https://developer.mozilla.org/en-US/docs/Server-sent_events
[5] https://developer.mozilla.org/en-US/docs/WebAPI
[6] https://developer.mozilla.org/en-US/docs/WebSockets
[7] https://developer.mozilla.org/en-US/docs/WebSockets$children
[8] https://browsercompat.herokuapp.com/browse/features
[9] https://browsercompat.herokuapp.com/browse/specifications
[10] https://browsercompat.herokuapp.com/browse/maturities
[11] https://browsercompat.herokuapp.com/browse/browsers
[12] https://browsercompat.herokuapp.com/importer/issues
[13] https://browsercompat.herokuapp.com/importer/
[14] https://browsercompat.herokuapp.com/importer/search
This is really great, John!

Only today for the first time, I spotted something that was not scraped: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Proxy/handler$children

14 pages under "handler". Is this due to the deep nesting?
These pages don't appear when fetching the Web $children [1].  There's a maximum depth of 5 supported by the MDN API, so anything more deeply nested are not imported.  I've added bug 1182542 to fix it.

[1] https://developer.mozilla.org/en-US/docs/Web$children (fixed from comment 6)
Assignee: nobody → jwhitlock
Status: NEW → ASSIGNED
Commits pushed to master at https://github.com/mozilla/web-platform-compat

https://github.com/mozilla/web-platform-compat/commit/0ac90e3035184ba2bca31b57d45515da59036277
bug 1132658 - Bump requirements

Bump requirements as suggested by requires.io

https://github.com/mozilla/web-platform-compat/commit/2e16684db36d981d9491731102d9cc5bcbd185fe
bug 1132658 - Remove redundant resource name

When displaying the diff of a collection, adding the resource name to
the output is redundant, since it appears in the JSON API output.

https://github.com/mozilla/web-platform-compat/commit/50f754f6542ada032f01adabae6f20b436c162d9
bug 1132658 - Improve diffs for translated strings

Order the locales in the translated string (en, then in alpha order), so
that the diffs highlight the changes rather than random dict sorting.

https://github.com/mozilla/web-platform-compat/commit/7a568b49a410597c769e9ffd5eb5668be978bee1
bug 1132658 - Data ID for section uses subpath

It is likely that the subpath will be set but not the number.

https://github.com/mozilla/web-platform-compat/commit/cd89c466751fcfbf38ba2a2eff6a479dbc620068
bug 1132658 - Refactor common tool code

Combine common tool code into tools/common.py, to reduce duplication and
simplify command line parsing.

https://github.com/mozilla/web-platform-compat/commit/0dd9fed3df42f61675c406a3a7c48b54f44a2667
bug 1132658 - Add Collection.load_collection

Collection.load_collection will copy the resources from another
collection. The resources can then be modified, and a
CollectionChangeset used to update the original Collection.

https://github.com/mozilla/web-platform-compat/commit/9522c042f25e309e0aa9c2f708e3f42359d37c8c
bug 1132658 - Allow 'false' as a canonical name

Previously, "false" was interpreted as the JSON value for False.

https://github.com/mozilla/web-platform-compat/commit/358aae284ad66e7b4ba33397ad9c1c0ac0aeeda2
bug 1132658 - Add tools/mirror_mdn_features.py

New tool for gathering MDN pages and creating or updating the branch
features related to them.  Includes converting to canonical names,
better handling of long slugs, and adding URLs to pages w/o
compatibility data.

https://github.com/mozilla/web-platform-compat/commit/6b5613aea0b918642d8327a2df9361bc41abf4f2
bug 1132658 - Improve importer search by URL

When searching the importer by MDN URL, drop the querystring and
fragment automatically, rather than returning an error.

https://github.com/mozilla/web-platform-compat/commit/8c33f5375ac14fd03599cc5e1c9fdde10d9c8e97
bug 1132658 - Improve admin for mdn app

Use readonly_fields to prevent loading the whole database in the admin.

https://github.com/mozilla/web-platform-compat/commit/0d5827f3dfee97394a96817e7375d3e191ee2a94
bug 1132658 - Add "No Data" status for pages

Add a page status for "page imported w/o compat data"

https://github.com/mozilla/web-platform-compat/commit/ba0911eb8d19f51fbec3dfe9646bfb9646a3e2bc
bug 1132658 - Fix canonical feature names

In the scrape-constructed view_feature, use name="canonical", rather
than name={"zxx": "canonical"}.

https://github.com/mozilla/web-platform-compat/commit/fb516abcd7c479b5e542035e081ab54aed6a9608
bug 1132658 - Add doc_parse_error issue

Some pages don't have the expected structure, such as <h2> headings,
resulting in the doc rule not matching.  Turn this into a
doc_parse_error issue, instead of an Exception, for further processing.

Example:

https://developer.mozilla.org/en-US/docs/Navigation_timing

https://github.com/mozilla/web-platform-compat/commit/34d7b3ebd2471039044cbace6dabb11238358f5c
bug 1132658 - Turn Spec* assertions into issues

Add issues specname_blank_key, spec2_wrong_kumascript, and
spec2_arg_count to replace assertions resulting in an exception issue.
Also, refactor visitor.unknown_kumascript_issue into
visitor.kumascript_issue, so it can be used in more KumaScript issue
reporting.

https://github.com/mozilla/web-platform-compat/commit/5e1e0b2c6fa05a4bad6c41471bd0a80cd2eb87c4
bug 1132658 - Fix typos in issue templates

https://github.com/mozilla/web-platform-compat/commit/c0b79272d868ed00a9d2a6eba97217b85b57167b
bug 1132658 - Download MDN pages w/o cache

https://github.com/mozilla/web-platform-compat/commit/f21935950a8e07d621f09c704638d469a5602387
bug 1132658 - Small importer UI fixes

- Issues name is a link in importer/issues
- "Download MDN page" rather than "Download MDN pages"

https://github.com/mozilla/web-platform-compat/commit/e8bfa3b79394bef4da6ff6a5255e2a9689a1c742
bug 1132658 - Fix template for failed_download

https://github.com/mozilla/web-platform-compat/commit/ea7d61ccc7da77bbf3df885a377340bbc4e7d7e1
bug 1132658 - Handle text with partial quotes

https://developer.mozilla.org/en-US/docs/Web/CSS/@viewport/max-zoom
has the text '"max-zoom" descriptor, which doesn't end in quotes.

https://github.com/mozilla/web-platform-compat/commit/bf6c3bf156bd8706cc9220c82efd9a07cf105eb1
bug 1132658 - Handle redirect on $json

https://github.com/mozilla/web-platform-compat/commit/c6b82a57c8f3c4292a6a658319cfe1010c7b180e
bug 1132658 - Improve handling of MDN locales

https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Using_CSS_gradients
has a Bengali (bn-BD) translation that broke processing, prompting several
changes:
- MDN paths increased to 1024 characters
- Gather localized titles of MDN pages from metadata
- Migrate featurepage.status and issue.slug choices from previous work
- When scraping, add localized names of the page to the target feature
  if it is not set as a canonical name
- When a task encounters an unexpected status, assert with the
  human-friendly name
- Add STATUS_NO_DATA as an 'already fetched' state
- Display IRI (with unicode) instead of URI (with percent-encoded
  unicode) on the sample feature page

https://github.com/mozilla/web-platform-compat/commit/a491b967f703c2bc7376e9c8bfe9b16e70c10d4d
fix bug 1132658 - Convert download failure to issue

When mdn.tasks.fetch_translation gets a non-200 response, report as a
failed_download issue and continue. Previously, an exception was also
raised, which halted tools/import_mdn.py.

https://github.com/mozilla/web-platform-compat/commit/42db082a4217b06fa7b878bfc681f369875828d2
Merge pull request #31 from jwhitlock/1132658_more_mdn

Fix bug 1132658 - Scrape all MDN pages
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.