[Compat Data][Importer] Improve MDN importer, Round 3

RESOLVED WONTFIX

Status

defect
--
major
RESOLVED WONTFIX
4 years ago
2 years ago

People

(Reporter: jwhitlock, Assigned: jwhitlock)

Tracking

({meta})

Details

(Whiteboard: [bc:infra][bc:milestone=motorbike])

What problem would this feature solve?
======================================
MDN pages with compatibility data are not standardized.  Structural differences, such as different section orders or paragraphs of explanatory text, cause the parser to become lost.  This results in general page structure errors and no parsed content.  A more flexible parser would handle structural differences and still parse the data.

Who has this problem?
=====================
Staff contributors to MDN

How do you know that the users identified above have this problem?
==================================================================
Writers are submitting bugs that the parser can't handle pages with alternate structures.

How are the users identified above solving this problem now?
============================================================
Writers are submitting parsing bugs.  They may also be standardizing the content of pages, but that is harder to track.

Do you have any suggestions for solving the problem? Please explain in detail.
==============================================================================
Writers and contributors could standardize the content of pages, by re-arranging sections, moving narrative paragraphs to different sections, etc.

Or, the parser could be refactored to be more flexible.

Is there anything else we should know?
======================================
This is a tracking bug for the next round of MDN importer issues, to be completed in Q3 2015.  This includes a refactor of the parsing code, to:

1. Make the parser more flexible, creating targeted issues for structural problems and continuing to parse the compatibility data,
2. Improve the structure of the parser code, as an aid for code reviews and contributions, and
3. Extract an HTML parsing subset that may be useful for API validation when HTML is allowed in a field.

Other parsing improvements will be delayed until after the refactor.
Blocks: 996570
Depends on: 1171957
Depends on: 1175177
Depends on: 1174808
Depends on: 1180573
Depends on: 1181158
Depends on: 1181161
Depends on: 1182542
Depends on: 1183593
Depends on: 1187927
Depends on: 1188049
Depends on: 1188503
Depends on: 1188546
Depends on: 1183599
Depends on: 1134584
Assignee: nobody → jwhitlock
Status: NEW → ASSIGNED
Component: General → BrowserCompat
Depends on: 1194565
Depends on: 1198746
Depends on: 1198749
Depends on: 1198751
Depends on: 1198753
Depends on: 1198761
Depends on: 1198762
Depends on: 1198767
Depends on: 1198770
Depends on: 1198777
Depends on: 1198781
Depends on: 1198782
Depends on: 1198784
Depends on: 1198788
Depends on: 1198791
Depends on: 1198793
Depends on: 1198799
Depends on: 1198801
Depends on: 1198802
Depends on: 1198806
Depends on: 1198812
Depends on: 1198818
Depends on: 1198822
Depends on: 1198858
Depends on: 1198860
Depends on: 1198862
Depends on: 1198865
Depends on: 1198868
Depends on: 1198870
Depends on: 1198873
Depends on: 1198879
Depends on: 1198881
Depends on: 1198896
Depends on: 1198907
Depends on: 1198910
Depends on: 1198912
Depends on: 1198919
Depends on: 1198977
Depends on: 1198985
Depends on: 1198989
Commits pushed to report_progress_1181140 at https://github.com/mdn/browsercompat

https://github.com/mdn/browsercompat/commit/c8c8782029e00eda2f9fd59ce5bd667c567e2351
bug 1181140 - Whitespace cleanup in templates

Jinja2 documentation appears to prefer whitespace like this:

<p>Hi, {{ username }}!</p>

to the compressed version:

<p>Hi, {{username}}!</p>

https://github.com/mdn/browsercompat/commit/94c02036117182e17fd77f653db371fb50be474a
bug 1181140 - Add issue level to import status

New statuses for FeaturePage:
* STATUS_PARSED_CRITICAL: Worst issue is a critical error
* STATUS_PARSED_ERROR: Worst issue is an error
* STATUS_PARSED_WARNINGL Worst issues is a warning

STATUS_PARSED becomes "parsed with no issues". Status is set after
scrape, displayed in list view. Because it depends on mdn/issues.py and
not database data, it has to be updated by reparsing the page.

https://github.com/mdn/browsercompat/commit/dc189259db6f2ecec22819837faa8ab25f33b215
bug 1181140 - Add helpers for querystring filters

Add new helper template functions "add_filter_to_current_url" and
"drop_filter_from_current_url" to make it easy add or remove filters
from the querystring. Use for pagination and topic filtering.

https://github.com/mdn/browsercompat/commit/7099219958b801da2a60cd67542e61a8937c1e87
bug 1181140 - Add filtering by status

Change "Filter By Topic" section to "Filter" with "By Topic" and "By
Status" button groups.

https://github.com/mdn/browsercompat/commit/309557baa7b65f0e121fbc8422d8093f5936ae54
bug 1181140 - Add import progress bar

The progress bar shows what percentage of pages with data import without
issues, and a breakdown of the issue severity for the other pages. This
makes it easier to see how close we are to the 80% of pages goal, across
all pages or a topic subset.

https://github.com/mdn/browsercompat/commit/4a7c893e753b3b157c85294b5729d00fcff93704
bug 1181140 - Trim topic list

docs/Web/Accessibility - 2/232 pages with 'data', but really just
sections that end in text 'Specifications'.
docs/Web/XPath - 0 pages with data
docs/Web/XSLT - 0 pages with data

https://github.com/mdn/browsercompat/commit/37b00635acca4075decaf9fbd4b4570aaaaf263b
bug 1181140 - Add select params to issue details

When an importer issue includes params, display the issue as a table of
importer pages and select issue parameters.

https://github.com/mdn/browsercompat/commit/4a644baf9ffacea436f0bcf3a4837012f3f565c0
bug 1181140 - Add download of issue CSVs

Replicates functionality of tools/gather_import_issues.py, but much
faster and a little more useful. The CSV of issue counts is unchanged,
but the detailed issue CSV is specific to an issue type, and extracts
the parameters into columns (rather than one columns of JSON).

https://github.com/mdn/browsercompat/commit/81e9e4f5c6dcef4dfc78b8dc426c1af4549d0669
bug 1181140 - Remove gather_import_issues.py

The new CSV views cover the same functionality, but run in seconds
rather than minutes.
Commits pushed to master at https://github.com/mdn/browsercompat

https://github.com/mdn/browsercompat/commit/ad228812ed2c772f8ef775651589d21738857d50
bug 1181140 - Generalize version extraction

Get ready to support CompatChrome, CompatIE, etc., by:
* Extracting mdn.utils.format_version
* Extract CompatBasicKumaScript for simple KumaScript
* Change CompatSupportVisitor to work with CompatKumaScript

https://github.com/mdn/browsercompat/commit/54b867840ee5641826b9bf8f6d790bee548aef03
Merge pull request #41 from mdn/kumascript_compat_1181140

bug 1181140 - Handle {{Compat*}} macros
Commits pushed to master at https://github.com/mdn/browsercompat

https://github.com/mdn/browsercompat/commit/9e5ea4dec8e1b1d5de8f97b853b6145a9d4f7419
bug 1181140 - Whitespace cleanup in templates

Jinja2 documentation appears to prefer whitespace like this:

<p>Hi, {{ username }}!</p>

to the compressed version:

<p>Hi, {{username}}!</p>

https://github.com/mdn/browsercompat/commit/27c906e933622ea66ffc1201c3e443cd09565178
bug 1181140 - Add issue level to import status

New statuses for FeaturePage:
* STATUS_PARSED_CRITICAL: Worst issue is a critical error
* STATUS_PARSED_ERROR: Worst issue is an error
* STATUS_PARSED_WARNINGL Worst issues is a warning

STATUS_PARSED becomes "parsed with no issues". Status is set after
scrape, displayed in list view. Because it depends on mdn/issues.py and
not database data, it has to be updated by reparsing the page.

https://github.com/mdn/browsercompat/commit/eb33c1ea6e6359a09bd524b7141d9c3f4f5f32ce
bug 1181140 - Add helpers for querystring filters

Add new helper template functions "add_filter_to_current_url" and
"drop_filter_from_current_url" to make it easy add or remove filters
from the querystring. Use for pagination and topic filtering.

https://github.com/mdn/browsercompat/commit/2d01467fd8c98b7a674ca14a0cfbe9b7f4c4f9cf
bug 1181140 - Add filtering by status

Change "Filter By Topic" section to "Filter" with "By Topic" and "By
Status" button groups.

https://github.com/mdn/browsercompat/commit/74d918b48dcd803472147467fb01b50dbcbde8db
bug 1181140 - Add import progress bar

The progress bar shows what percentage of pages with data import without
issues, and a breakdown of the issue severity for the other pages. This
makes it easier to see how close we are to the 80% of pages goal, across
all pages or a topic subset.

https://github.com/mdn/browsercompat/commit/694cac76fa5e9e4619d6eade77d31ecad1acf7c7
bug 1181140 - Trim topic list

docs/Web/Accessibility - 2/232 pages with 'data', but really just
sections that end in text 'Specifications'.
docs/Web/XPath - 0 pages with data
docs/Web/XSLT - 0 pages with data

https://github.com/mdn/browsercompat/commit/8e756f6d344ab56618a41c7cef7c0746414f6023
bug 1181140 - Add select params to issue details

When an importer issue includes params, display the issue as a table of
importer pages and select issue parameters.

https://github.com/mdn/browsercompat/commit/624d1ce58a312c1738c29c2ec8e587e77387de90
bug 1181140 - Add download of issue CSVs

Replicates functionality of tools/gather_import_issues.py, but much
faster and a little more useful. The CSV of issue counts is unchanged,
but the detailed issue CSV is specific to an issue type, and extracts
the parameters into columns (rather than one columns of JSON).

https://github.com/mdn/browsercompat/commit/4be022087b9548c03f74faac3a2e7bd923154479
bug 1181140 - Remove gather_import_issues.py

The new CSV views cover the same functionality, but run in seconds
rather than minutes.

https://github.com/mdn/browsercompat/commit/c904c90fdba582dcb0a191c130cfda60d184df0f
Merge pull request #40 from mdn/report_progress_1181140

Bug 1181140 - Report importer progress
Depends on: 1208238
Depends on: 1208681
Depends on: 1208686
Commit pushed to master at https://github.com/mdn/browsercompat

https://github.com/mdn/browsercompat/commit/d0d8bae03281896ccd94e1f4a2eedb323d67a0db
bug 1181140 - Give up importing huge pages

Some pages have the lethal combination of lots of compat data, many
translations, and lots of data issues.  Because we don't have the async
infrastructure in place, processing these pages takes longer than the
requests timeout. This change gives up after 3 tries, but continues to
the next page rather than failing the import.
Commits pushed to master at https://github.com/mdn/browsercompat

https://github.com/mdn/browsercompat/commit/0f151d53d5e60f26563e59bf737c5e68c02f9093
bug 1181140 - Hide reparse button in production

The reparse button is useful when developing locally, to test changes in
the parsing code.  In production, the right thing is almost always to
reset the page to download the latest from MDN. This change shows the
button is DEBUG is True, overridden by environment MDN_SHOW_REPARSE

https://github.com/mdn/browsercompat/commit/a379c5004772eef933caa4aedb282f634c9930f1
Merge pull request #71 from mdn/1181140_hide_reparse

bug 1181140 - Hide reparse button in production
Commits pushed to master at https://github.com/mdn/browsercompat

https://github.com/mdn/browsercompat/commit/adedb7b04109fc711aa91bed434530279c2fa6e4
bug 1181140 - Fix new Section IDs

mdn.utils.is_new_id() requires that new IDs start with an underscore.
Fix new ID generation for Section IDs to ensure they always start with
underscore.

https://github.com/mdn/browsercompat/commit/491a7482e35d478f8a0832a6ea3d291f63546dba
bug 1181140 - Refactor mdn tests

TestScrapeFeaturePage
* Make good_content a test fixture
* Use "content" instead of "page" for test page content

TestFeaturePageListView
* Expand add_page into a general Feature plus FeaturePage creator
* Refactor filter tests into assert_filter_only_feature

https://github.com/mdn/browsercompat/commit/b280a2090c339528b318bf6a3bf335ce6d33ce3e
bug 1181140 - Drop [meta][scrape][raw] from data

The [meta][scrape][raw] element was for debugging importing, which is
pretty solid now, and was retained during the transition to
database-backed issues, which is complete in production.

https://github.com/mdn/browsercompat/commit/d75897398ec5f9d3fea7e1a65d6411e4ac8e4d54
bug 1181140 - Handle URLs with unsafe characters

FeaturePage.url uses the URL-encoded URLs returned from Kuma's metadata
API. This allows both the URL-encoded and non-encoded forms to be used
in the search box, such as:

https://developer.mozilla.org/en-US/docs/Web/CSS/%3A%3Abefore
https://developer.mozilla.org/en-US/docs/Web/CSS/::before

Previously, only the first form would have loaded the correct page.

https://github.com/mdn/browsercompat/commit/662e294b51d52b960505ad644a5a39280d85d613
bug 1181140 - Create GetFormView base class

GetFormView is derived from django.views.generic.FormView, but allows
quick redirects using a GET and query string parameters.

https://github.com/mdn/browsercompat/commit/f82367260b902623afcb76455bfd414d66142ec9
bug 1181140 - Add base template for mdn app

The base is a very thin override of webplatform/base.html, and will soon
include CSS specific to the importer.

https://github.com/mdn/browsercompat/commit/6ec123b8e06359728441fd6b2392a50accb74940
bug 1181140 - Search by Feature slug

The search can work with GET and query string parameters, so when a page
is converted to use API backed tables like this:

{{EmbedCompatTable("feature-slug")}}

the MDN page can link back to the importer page with this URL:

https://browsercompat.herokuapp.com/importer/slug_search?slug=feature-slug
Commit pushed to master at https://github.com/mdn/browsercompat

https://github.com/mdn/browsercompat/commit/da37d9ad05c0a50b1576b815d44fc81a4760e556
bug 1181140 - Use Reset after Commit if no Reparse

If the Reparse form is unavailable, use a Reset after a Commit in order
to load the new API IDs.
Commits pushed to master at https://github.com/mdn/browsercompat

https://github.com/mdn/browsercompat/commit/6f6ef5fe414ddbfdd3ff05570ff7d30882078177
bug 1181140 - Handle unicode feature names

Fix a 500 error when importing a feature with unicode characters, such
as this page:

cubic-bezier() w/ ordinate ∉[0,1]
https://developer.mozilla.org/en-US/docs/Web/CSS/timing-function
https://browsercompat.herokuapp.com/importer/840

https://github.com/mdn/browsercompat/commit/97fbd4a242fb6e84f467988cfed2dbbce924fad1
bug 1181140 - Focus load_compat branch tests

Only test full response on the new/existing load_compat tests, and just
focus on the important part of response for the branch tests.

https://github.com/mdn/browsercompat/commit/21f28c49953900795cd6c033221ba5a8f57dc31d
Merge pull request #79 from mdn/unicode_names_1181140

bug 1181140 - Handle unicode feature names

r=groovecoder
Commits pushed to master at https://github.com/mdn/browsercompat

https://github.com/mdn/browsercompat/commit/107232292c742c2de2f80823f2362be38413a809
bug 1181140 - Omit specification.sections

When constructing the JSON for view_features in the importer, don't
include specification.sections. The relation section.specification is
enough to contruct the relationship, and the returned view_features JSON
doesn't set it.

https://github.com/mdn/browsercompat/commit/6986018b2d7c1d5781349e97cae46ae01b18906d
bug 1181140 - Handle <tfoot> elements

Used on:
https://developer.mozilla.org/en-US/docs/Web/API/KeyboardEvent/keyCode
Depends on: 1180022
Keywords: in-triagemeta
Severity: enhancement → major
OS: Other → All
Whiteboard: [specification][type:feature] → [bc:infra]
Whiteboard: [bc:infra] → [bc:infra][bc:milestone=motorbike]
Status: ASSIGNED → NEW
The BrowserCompat project is canceled.  See https://github.com/mdn/browsercompat for current effort. Bulk status change includes the random word TEMPOTHRONE.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.