1132658 - [Compat Data][Importer] Scrape a more recent (ideally current) MDN URL set

Reporter

Description

•

10 years ago

What did you do? ================ Checked what is imported at https://browsercompat.herokuapp.com/importer/?page=20 What happened? ============== Saw that the sub pages of e.g. Web/JavaScript/Reference/Global_Objects/Array/* are not imported. What should have happened? ========================== All pages of Global_objects/<obj>/<subpages> should be imported. Is there anything else we should know? ====================================== I am maintaining the JavaScript docs on MDN and I am interested in good how the importer is so far. The majority of pages (maybe 400+) live under Global_objects/*/*. A full scrape would help to identify parsing problems. What serves as the source for pages to crawl?

Florian Scholz (Open Web Docs)

Reporter

Updated

•

10 years ago

Blocks: 1132269

John Whitlock [:jwhitlock]

Assignee

Comment 1

•

10 years ago

We'll have to create features for the sub pages to scrape the data into the API. Is there a list of URLs for the subpages? The source of the MDN URLs with compatibility data is the WebPlatform CompaTables project: https://docs.webplatform.org/wiki/WPD:Projects/CompaTables They scraped compatibility data from MDN in August 2014. While the scraped data was unsuitable for redisplay on MDN, it did give a collection of MDN URLs used for scraping, without doing a detailed crawl of MDN. Pages added or removed after August 2014 are not included, as well as pages the WebPlatform team didn't include in their scrape.

Flags: needinfo?(fscholz)

Florian Scholz (Open Web Docs)

Reporter

Comment 2

•

10 years ago

A lot happened on MDN since August 2014 and WPD only provides a subset of our documentation really. There is a lot more on MDN. Changing the summary of this bug as it not only affects JS docs. There is a way to get the current (sub) pages in Kuma. Most of the pages under the following trees should provide compat data: https://developer.mozilla.org/en-US/docs/Web/API$children https://developer.mozilla.org/en-US/docs/Web/CSS$children https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference$children https://developer.mozilla.org/en-US/docs/Web/HTML/Element$children https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes$children https://developer.mozilla.org/en-US/docs/Web/MathML/Element$children https://developer.mozilla.org/en-US/docs/Web/SVG/Element$children We are documenting new API and features everyday. So, ideally we would use this data directly from MDN to get the most and the latest data into the compat store.

Flags: needinfo?(fscholz)

Summary: [Compat Data][Importer] Scrape whole JavaScript reference docs → [Compat Data][Importer] Scrape a more recent (ideally current) MDN URL set

John Whitlock [:jwhitlock]

Assignee

Comment 3

•

10 years ago

Thank you, that's a useful resource. I can build a tool to process the JSON, compare it to the current feature list, and create new features (4-12 hours). The complication will be removing pages with no specification or browser compatibility section - this is available on the page meta data (such as https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model$json), but relies on the h3 ID being set correctly. Some pages use the id "Browser_compatibility" for the specification section as well. However, as long as one or the other is present, I can add it to the scrape list. I can then use the existing tool to grab the pages and run the scraper against them. This will probably generate the next list of scraper and page issues to fix :)

Florian Scholz (Open Web Docs)

Reporter

Comment 4

•

10 years ago

fwiw, there is also an expanded JSON children list. Just add "?expand" to the URLs above. https://developer.mozilla.org/en-US/docs/Web/CSS$children?expand There you have a "sections" key and it would contain something like this: "sections": [ { "id": "Specifications", "title": "Specifications" }, { "id": "Browser_compatibility", "title": "Browser compatibility" }, ... So checking if "Browser_compatibility" is in "sections" could help to only get pages that actually provide compat information.

Florian Scholz (Open Web Docs)

Reporter

Comment 5

•

10 years ago

An even better approach might be to use the search API that looks for the CompatibilityTable macro that is used to generate the current compat tables. https://developer.mozilla.org/en-US/search?locale=en-US&kumascript_macros=CompatibilityTable https://developer.mozilla.org/en-US/search.json?locale=en-US&kumascript_macros=CompatibilityTable This also gives you an idea of how many features should be scraped (ca. 2797 at this point).

John Whitlock [:jwhitlock]

Assignee

Comment 6

•

10 years ago

I've submitted the PR for the code fixing this bug [1]. It is a lot of code, and may take a few weeks for the PR to be accepted and close the bug. The code is running on https://browsercompat.herokuapp.com, features have been imported, and the initial scrape run successfully. I've created features for all the pages under Web[2], Navigation_timing[3], Server-sent_events[4], WebAPI[5], and WebSockets[6], using the $children API[7]. Additional trees can be added (open a bug), and the tool (tools/mirror_mdn_features.py) can be re-run to update with MDN changes. I trust the $children API a little more than the search engine. After the mirroring and scraping, there are 6042 features[8]. I also synced the specifications[9] and maturities[10], and added some versions[11] to reflect changes on MDN since February. I've tried not to add invalid versions, so that the "unknown_versions" issue will highlight actual issues. The importer has found 6239 issues[12], including critical errors that may mask additional issues. The importer search form, either at the top of the imported pages list [13] or it's own URL [14], can be used to find a particular MDN page of interest. [1] https://github.com/mozilla/web-platform-compat/pull/31 [2] https://developer.mozilla.org/en-US/Web [3] https://developer.mozilla.org/en-US/docs/Navigation_timing [4] https://developer.mozilla.org/en-US/docs/Server-sent_events [5] https://developer.mozilla.org/en-US/docs/WebAPI [6] https://developer.mozilla.org/en-US/docs/WebSockets [7] https://developer.mozilla.org/en-US/docs/WebSockets$children [8] https://browsercompat.herokuapp.com/browse/features [9] https://browsercompat.herokuapp.com/browse/specifications [10] https://browsercompat.herokuapp.com/browse/maturities [11] https://browsercompat.herokuapp.com/browse/browsers [12] https://browsercompat.herokuapp.com/importer/issues [13] https://browsercompat.herokuapp.com/importer/ [14] https://browsercompat.herokuapp.com/importer/search

Florian Scholz (Open Web Docs)

Reporter

Comment 7

•

10 years ago

This is really great, John! Only today for the first time, I spotted something that was not scraped: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Proxy/handler$children 14 pages under "handler". Is this due to the deep nesting?

John Whitlock [:jwhitlock]

Assignee

Comment 8

•

10 years ago

These pages don't appear when fetching the Web $children [1]. There's a maximum depth of 5 supported by the MDN API, so anything more deeply nested are not imported. I've added bug 1182542 to fix it. [1] https://developer.mozilla.org/en-US/docs/Web$children (fixed from comment 6)

John Whitlock [:jwhitlock]

Assignee

Updated

•

10 years ago

Assignee: nobody → jwhitlock

Status: NEW → ASSIGNED

MDN Team (:mdn-dev)

Comment 9

•

10 years ago

Commits pushed to master at https://github.com/mozilla/web-platform-compat https://github.com/mozilla/web-platform-compat/commit/0ac90e3035184ba2bca31b57d45515da59036277 bug 1132658 - Bump requirements Bump requirements as suggested by requires.io https://github.com/mozilla/web-platform-compat/commit/2e16684db36d981d9491731102d9cc5bcbd185fe bug 1132658 - Remove redundant resource name When displaying the diff of a collection, adding the resource name to the output is redundant, since it appears in the JSON API output. https://github.com/mozilla/web-platform-compat/commit/50f754f6542ada032f01adabae6f20b436c162d9 bug 1132658 - Improve diffs for translated strings Order the locales in the translated string (en, then in alpha order), so that the diffs highlight the changes rather than random dict sorting. https://github.com/mozilla/web-platform-compat/commit/7a568b49a410597c769e9ffd5eb5668be978bee1 bug 1132658 - Data ID for section uses subpath It is likely that the subpath will be set but not the number. https://github.com/mozilla/web-platform-compat/commit/cd89c466751fcfbf38ba2a2eff6a479dbc620068 bug 1132658 - Refactor common tool code Combine common tool code into tools/common.py, to reduce duplication and simplify command line parsing. https://github.com/mozilla/web-platform-compat/commit/0dd9fed3df42f61675c406a3a7c48b54f44a2667 bug 1132658 - Add Collection.load_collection Collection.load_collection will copy the resources from another collection. The resources can then be modified, and a CollectionChangeset used to update the original Collection. https://github.com/mozilla/web-platform-compat/commit/9522c042f25e309e0aa9c2f708e3f42359d37c8c bug 1132658 - Allow 'false' as a canonical name Previously, "false" was interpreted as the JSON value for False. https://github.com/mozilla/web-platform-compat/commit/358aae284ad66e7b4ba33397ad9c1c0ac0aeeda2 bug 1132658 - Add tools/mirror_mdn_features.py New tool for gathering MDN pages and creating or updating the branch features related to them. Includes converting to canonical names, better handling of long slugs, and adding URLs to pages w/o compatibility data. https://github.com/mozilla/web-platform-compat/commit/6b5613aea0b918642d8327a2df9361bc41abf4f2 bug 1132658 - Improve importer search by URL When searching the importer by MDN URL, drop the querystring and fragment automatically, rather than returning an error. https://github.com/mozilla/web-platform-compat/commit/8c33f5375ac14fd03599cc5e1c9fdde10d9c8e97 bug 1132658 - Improve admin for mdn app Use readonly_fields to prevent loading the whole database in the admin. https://github.com/mozilla/web-platform-compat/commit/0d5827f3dfee97394a96817e7375d3e191ee2a94 bug 1132658 - Add "No Data" status for pages Add a page status for "page imported w/o compat data" https://github.com/mozilla/web-platform-compat/commit/ba0911eb8d19f51fbec3dfe9646bfb9646a3e2bc bug 1132658 - Fix canonical feature names In the scrape-constructed view_feature, use name="canonical", rather than name={"zxx": "canonical"}. https://github.com/mozilla/web-platform-compat/commit/fb516abcd7c479b5e542035e081ab54aed6a9608 bug 1132658 - Add doc_parse_error issue Some pages don't have the expected structure, such as <h2> headings, resulting in the doc rule not matching. Turn this into a doc_parse_error issue, instead of an Exception, for further processing. Example: https://developer.mozilla.org/en-US/docs/Navigation_timing https://github.com/mozilla/web-platform-compat/commit/34d7b3ebd2471039044cbace6dabb11238358f5c bug 1132658 - Turn Spec* assertions into issues Add issues specname_blank_key, spec2_wrong_kumascript, and spec2_arg_count to replace assertions resulting in an exception issue. Also, refactor visitor.unknown_kumascript_issue into visitor.kumascript_issue, so it can be used in more KumaScript issue reporting. https://github.com/mozilla/web-platform-compat/commit/5e1e0b2c6fa05a4bad6c41471bd0a80cd2eb87c4 bug 1132658 - Fix typos in issue templates https://github.com/mozilla/web-platform-compat/commit/c0b79272d868ed00a9d2a6eba97217b85b57167b bug 1132658 - Download MDN pages w/o cache https://github.com/mozilla/web-platform-compat/commit/f21935950a8e07d621f09c704638d469a5602387 bug 1132658 - Small importer UI fixes - Issues name is a link in importer/issues - "Download MDN page" rather than "Download MDN pages" https://github.com/mozilla/web-platform-compat/commit/e8bfa3b79394bef4da6ff6a5255e2a9689a1c742 bug 1132658 - Fix template for failed_download https://github.com/mozilla/web-platform-compat/commit/ea7d61ccc7da77bbf3df885a377340bbc4e7d7e1 bug 1132658 - Handle text with partial quotes https://developer.mozilla.org/en-US/docs/Web/CSS/@viewport/max-zoom has the text '"max-zoom" descriptor, which doesn't end in quotes. https://github.com/mozilla/web-platform-compat/commit/bf6c3bf156bd8706cc9220c82efd9a07cf105eb1 bug 1132658 - Handle redirect on $json https://github.com/mozilla/web-platform-compat/commit/c6b82a57c8f3c4292a6a658319cfe1010c7b180e bug 1132658 - Improve handling of MDN locales https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Using_CSS_gradients has a Bengali (bn-BD) translation that broke processing, prompting several changes: - MDN paths increased to 1024 characters - Gather localized titles of MDN pages from metadata - Migrate featurepage.status and issue.slug choices from previous work - When scraping, add localized names of the page to the target feature if it is not set as a canonical name - When a task encounters an unexpected status, assert with the human-friendly name - Add STATUS_NO_DATA as an 'already fetched' state - Display IRI (with unicode) instead of URI (with percent-encoded unicode) on the sample feature page https://github.com/mozilla/web-platform-compat/commit/a491b967f703c2bc7376e9c8bfe9b16e70c10d4d fix bug 1132658 - Convert download failure to issue When mdn.tasks.fetch_translation gets a non-200 response, report as a failed_download issue and continue. Previously, an exception was also raised, which halted tools/import_mdn.py. https://github.com/mozilla/web-platform-compat/commit/42db082a4217b06fa7b878bfc681f369875828d2 Merge pull request #31 from jwhitlock/1132658_more_mdn Fix bug 1132658 - Scrape all MDN pages

MDN Team (:mdn-dev)

Updated

•

10 years ago

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

5 years ago

Product: developer.mozilla.org → developer.mozilla.org Graveyard

Bugzilla

[Compat Data][Importer] Scrape a more recent (ideally current) MDN URL set

Categories

(developer.mozilla.org Graveyard :: General, defect)

Tracking

(Not tracked)

People

(Reporter: fs, Assigned: jwhitlock)

References

Details

(Whiteboard: [specification][type:bug])

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Updated

Updated