Closed Bug 1181140 Opened 10 years ago Closed 8 years ago

[Compat Data][Importer] Improve MDN importer, Round 3

Categories

(developer.mozilla.org Graveyard :: BrowserCompat, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jwhitlock, Assigned: jwhitlock)

References

Details

(Keywords: meta, Whiteboard: [bc:infra][bc:milestone=motorbike])

What problem would this feature solve? ====================================== MDN pages with compatibility data are not standardized. Structural differences, such as different section orders or paragraphs of explanatory text, cause the parser to become lost. This results in general page structure errors and no parsed content. A more flexible parser would handle structural differences and still parse the data. Who has this problem? ===================== Staff contributors to MDN How do you know that the users identified above have this problem? ================================================================== Writers are submitting bugs that the parser can't handle pages with alternate structures. How are the users identified above solving this problem now? ============================================================ Writers are submitting parsing bugs. They may also be standardizing the content of pages, but that is harder to track. Do you have any suggestions for solving the problem? Please explain in detail. ============================================================================== Writers and contributors could standardize the content of pages, by re-arranging sections, moving narrative paragraphs to different sections, etc. Or, the parser could be refactored to be more flexible. Is there anything else we should know? ====================================== This is a tracking bug for the next round of MDN importer issues, to be completed in Q3 2015. This includes a refactor of the parsing code, to: 1. Make the parser more flexible, creating targeted issues for structural problems and continuing to parse the compatibility data, 2. Improve the structure of the parser code, as an aid for code reviews and contributions, and 3. Extract an HTML parsing subset that may be useful for API validation when HTML is allowed in a field. Other parsing improvements will be delayed until after the refactor.
Blocks: 996570
Depends on: 1171957
Depends on: 1175177
Depends on: 1174808
Depends on: 1180573
Depends on: 1181158
Depends on: 1181161
Depends on: 1182542
Depends on: 1183593
Depends on: 1187927
Depends on: 1188049
Depends on: 1188503
Depends on: 1188546
Depends on: 1183599
Depends on: 1134584
Assignee: nobody → jwhitlock
Status: NEW → ASSIGNED
Component: General → BrowserCompat
Depends on: 1194565
Depends on: 1198746
Depends on: 1198749
Depends on: 1198751
Depends on: 1198753
Depends on: 1198761
Depends on: 1198762
Depends on: 1198767
Depends on: 1198770
Depends on: 1198777
Depends on: 1198781
Depends on: 1198782
Depends on: 1198784
Depends on: 1198788
Depends on: 1198791
Depends on: 1198793
Depends on: 1198799
Depends on: 1198801
Depends on: 1198802
Depends on: 1198806
Depends on: 1198812
Depends on: 1198818
Depends on: 1198822
Depends on: 1198858
Depends on: 1198860
Depends on: 1198862
Depends on: 1198865
Depends on: 1198868
Depends on: 1198870
Depends on: 1198873
Depends on: 1198879
Depends on: 1198881
Depends on: 1198896
Depends on: 1198907
Depends on: 1198910
Depends on: 1198912
Depends on: 1198919
Depends on: 1198977
Depends on: 1198985
Depends on: 1198989
Commits pushed to report_progress_1181140 at https://github.com/mdn/browsercompat https://github.com/mdn/browsercompat/commit/c8c8782029e00eda2f9fd59ce5bd667c567e2351 bug 1181140 - Whitespace cleanup in templates Jinja2 documentation appears to prefer whitespace like this: <p>Hi, {{ username }}!</p> to the compressed version: <p>Hi, {{username}}!</p> https://github.com/mdn/browsercompat/commit/94c02036117182e17fd77f653db371fb50be474a bug 1181140 - Add issue level to import status New statuses for FeaturePage: * STATUS_PARSED_CRITICAL: Worst issue is a critical error * STATUS_PARSED_ERROR: Worst issue is an error * STATUS_PARSED_WARNINGL Worst issues is a warning STATUS_PARSED becomes "parsed with no issues". Status is set after scrape, displayed in list view. Because it depends on mdn/issues.py and not database data, it has to be updated by reparsing the page. https://github.com/mdn/browsercompat/commit/dc189259db6f2ecec22819837faa8ab25f33b215 bug 1181140 - Add helpers for querystring filters Add new helper template functions "add_filter_to_current_url" and "drop_filter_from_current_url" to make it easy add or remove filters from the querystring. Use for pagination and topic filtering. https://github.com/mdn/browsercompat/commit/7099219958b801da2a60cd67542e61a8937c1e87 bug 1181140 - Add filtering by status Change "Filter By Topic" section to "Filter" with "By Topic" and "By Status" button groups. https://github.com/mdn/browsercompat/commit/309557baa7b65f0e121fbc8422d8093f5936ae54 bug 1181140 - Add import progress bar The progress bar shows what percentage of pages with data import without issues, and a breakdown of the issue severity for the other pages. This makes it easier to see how close we are to the 80% of pages goal, across all pages or a topic subset. https://github.com/mdn/browsercompat/commit/4a7c893e753b3b157c85294b5729d00fcff93704 bug 1181140 - Trim topic list docs/Web/Accessibility - 2/232 pages with 'data', but really just sections that end in text 'Specifications'. docs/Web/XPath - 0 pages with data docs/Web/XSLT - 0 pages with data https://github.com/mdn/browsercompat/commit/37b00635acca4075decaf9fbd4b4570aaaaf263b bug 1181140 - Add select params to issue details When an importer issue includes params, display the issue as a table of importer pages and select issue parameters. https://github.com/mdn/browsercompat/commit/4a644baf9ffacea436f0bcf3a4837012f3f565c0 bug 1181140 - Add download of issue CSVs Replicates functionality of tools/gather_import_issues.py, but much faster and a little more useful. The CSV of issue counts is unchanged, but the detailed issue CSV is specific to an issue type, and extracts the parameters into columns (rather than one columns of JSON). https://github.com/mdn/browsercompat/commit/81e9e4f5c6dcef4dfc78b8dc426c1af4549d0669 bug 1181140 - Remove gather_import_issues.py The new CSV views cover the same functionality, but run in seconds rather than minutes.
Commits pushed to master at https://github.com/mdn/browsercompat https://github.com/mdn/browsercompat/commit/ad228812ed2c772f8ef775651589d21738857d50 bug 1181140 - Generalize version extraction Get ready to support CompatChrome, CompatIE, etc., by: * Extracting mdn.utils.format_version * Extract CompatBasicKumaScript for simple KumaScript * Change CompatSupportVisitor to work with CompatKumaScript https://github.com/mdn/browsercompat/commit/54b867840ee5641826b9bf8f6d790bee548aef03 Merge pull request #41 from mdn/kumascript_compat_1181140 bug 1181140 - Handle {{Compat*}} macros
Commits pushed to master at https://github.com/mdn/browsercompat https://github.com/mdn/browsercompat/commit/9e5ea4dec8e1b1d5de8f97b853b6145a9d4f7419 bug 1181140 - Whitespace cleanup in templates Jinja2 documentation appears to prefer whitespace like this: <p>Hi, {{ username }}!</p> to the compressed version: <p>Hi, {{username}}!</p> https://github.com/mdn/browsercompat/commit/27c906e933622ea66ffc1201c3e443cd09565178 bug 1181140 - Add issue level to import status New statuses for FeaturePage: * STATUS_PARSED_CRITICAL: Worst issue is a critical error * STATUS_PARSED_ERROR: Worst issue is an error * STATUS_PARSED_WARNINGL Worst issues is a warning STATUS_PARSED becomes "parsed with no issues". Status is set after scrape, displayed in list view. Because it depends on mdn/issues.py and not database data, it has to be updated by reparsing the page. https://github.com/mdn/browsercompat/commit/eb33c1ea6e6359a09bd524b7141d9c3f4f5f32ce bug 1181140 - Add helpers for querystring filters Add new helper template functions "add_filter_to_current_url" and "drop_filter_from_current_url" to make it easy add or remove filters from the querystring. Use for pagination and topic filtering. https://github.com/mdn/browsercompat/commit/2d01467fd8c98b7a674ca14a0cfbe9b7f4c4f9cf bug 1181140 - Add filtering by status Change "Filter By Topic" section to "Filter" with "By Topic" and "By Status" button groups. https://github.com/mdn/browsercompat/commit/74d918b48dcd803472147467fb01b50dbcbde8db bug 1181140 - Add import progress bar The progress bar shows what percentage of pages with data import without issues, and a breakdown of the issue severity for the other pages. This makes it easier to see how close we are to the 80% of pages goal, across all pages or a topic subset. https://github.com/mdn/browsercompat/commit/694cac76fa5e9e4619d6eade77d31ecad1acf7c7 bug 1181140 - Trim topic list docs/Web/Accessibility - 2/232 pages with 'data', but really just sections that end in text 'Specifications'. docs/Web/XPath - 0 pages with data docs/Web/XSLT - 0 pages with data https://github.com/mdn/browsercompat/commit/8e756f6d344ab56618a41c7cef7c0746414f6023 bug 1181140 - Add select params to issue details When an importer issue includes params, display the issue as a table of importer pages and select issue parameters. https://github.com/mdn/browsercompat/commit/624d1ce58a312c1738c29c2ec8e587e77387de90 bug 1181140 - Add download of issue CSVs Replicates functionality of tools/gather_import_issues.py, but much faster and a little more useful. The CSV of issue counts is unchanged, but the detailed issue CSV is specific to an issue type, and extracts the parameters into columns (rather than one columns of JSON). https://github.com/mdn/browsercompat/commit/4be022087b9548c03f74faac3a2e7bd923154479 bug 1181140 - Remove gather_import_issues.py The new CSV views cover the same functionality, but run in seconds rather than minutes. https://github.com/mdn/browsercompat/commit/c904c90fdba582dcb0a191c130cfda60d184df0f Merge pull request #40 from mdn/report_progress_1181140 Bug 1181140 - Report importer progress
Depends on: 1208238
Depends on: 1208681
Depends on: 1208686
Commit pushed to master at https://github.com/mdn/browsercompat https://github.com/mdn/browsercompat/commit/d0d8bae03281896ccd94e1f4a2eedb323d67a0db bug 1181140 - Give up importing huge pages Some pages have the lethal combination of lots of compat data, many translations, and lots of data issues. Because we don't have the async infrastructure in place, processing these pages takes longer than the requests timeout. This change gives up after 3 tries, but continues to the next page rather than failing the import.
Commits pushed to master at https://github.com/mdn/browsercompat https://github.com/mdn/browsercompat/commit/0f151d53d5e60f26563e59bf737c5e68c02f9093 bug 1181140 - Hide reparse button in production The reparse button is useful when developing locally, to test changes in the parsing code. In production, the right thing is almost always to reset the page to download the latest from MDN. This change shows the button is DEBUG is True, overridden by environment MDN_SHOW_REPARSE https://github.com/mdn/browsercompat/commit/a379c5004772eef933caa4aedb282f634c9930f1 Merge pull request #71 from mdn/1181140_hide_reparse bug 1181140 - Hide reparse button in production
Commits pushed to master at https://github.com/mdn/browsercompat https://github.com/mdn/browsercompat/commit/adedb7b04109fc711aa91bed434530279c2fa6e4 bug 1181140 - Fix new Section IDs mdn.utils.is_new_id() requires that new IDs start with an underscore. Fix new ID generation for Section IDs to ensure they always start with underscore. https://github.com/mdn/browsercompat/commit/491a7482e35d478f8a0832a6ea3d291f63546dba bug 1181140 - Refactor mdn tests TestScrapeFeaturePage * Make good_content a test fixture * Use "content" instead of "page" for test page content TestFeaturePageListView * Expand add_page into a general Feature plus FeaturePage creator * Refactor filter tests into assert_filter_only_feature https://github.com/mdn/browsercompat/commit/b280a2090c339528b318bf6a3bf335ce6d33ce3e bug 1181140 - Drop [meta][scrape][raw] from data The [meta][scrape][raw] element was for debugging importing, which is pretty solid now, and was retained during the transition to database-backed issues, which is complete in production. https://github.com/mdn/browsercompat/commit/d75897398ec5f9d3fea7e1a65d6411e4ac8e4d54 bug 1181140 - Handle URLs with unsafe characters FeaturePage.url uses the URL-encoded URLs returned from Kuma's metadata API. This allows both the URL-encoded and non-encoded forms to be used in the search box, such as: https://developer.mozilla.org/en-US/docs/Web/CSS/%3A%3Abefore https://developer.mozilla.org/en-US/docs/Web/CSS/::before Previously, only the first form would have loaded the correct page. https://github.com/mdn/browsercompat/commit/662e294b51d52b960505ad644a5a39280d85d613 bug 1181140 - Create GetFormView base class GetFormView is derived from django.views.generic.FormView, but allows quick redirects using a GET and query string parameters. https://github.com/mdn/browsercompat/commit/f82367260b902623afcb76455bfd414d66142ec9 bug 1181140 - Add base template for mdn app The base is a very thin override of webplatform/base.html, and will soon include CSS specific to the importer. https://github.com/mdn/browsercompat/commit/6ec123b8e06359728441fd6b2392a50accb74940 bug 1181140 - Search by Feature slug The search can work with GET and query string parameters, so when a page is converted to use API backed tables like this: {{EmbedCompatTable("feature-slug")}} the MDN page can link back to the importer page with this URL: https://browsercompat.herokuapp.com/importer/slug_search?slug=feature-slug
Commit pushed to master at https://github.com/mdn/browsercompat https://github.com/mdn/browsercompat/commit/da37d9ad05c0a50b1576b815d44fc81a4760e556 bug 1181140 - Use Reset after Commit if no Reparse If the Reparse form is unavailable, use a Reset after a Commit in order to load the new API IDs.
Commits pushed to master at https://github.com/mdn/browsercompat https://github.com/mdn/browsercompat/commit/6f6ef5fe414ddbfdd3ff05570ff7d30882078177 bug 1181140 - Handle unicode feature names Fix a 500 error when importing a feature with unicode characters, such as this page: cubic-bezier() w/ ordinate ∉[0,1] https://developer.mozilla.org/en-US/docs/Web/CSS/timing-function https://browsercompat.herokuapp.com/importer/840 https://github.com/mdn/browsercompat/commit/97fbd4a242fb6e84f467988cfed2dbbce924fad1 bug 1181140 - Focus load_compat branch tests Only test full response on the new/existing load_compat tests, and just focus on the important part of response for the branch tests. https://github.com/mdn/browsercompat/commit/21f28c49953900795cd6c033221ba5a8f57dc31d Merge pull request #79 from mdn/unicode_names_1181140 bug 1181140 - Handle unicode feature names r=groovecoder
Commits pushed to master at https://github.com/mdn/browsercompat https://github.com/mdn/browsercompat/commit/107232292c742c2de2f80823f2362be38413a809 bug 1181140 - Omit specification.sections When constructing the JSON for view_features in the importer, don't include specification.sections. The relation section.specification is enough to contruct the relationship, and the returned view_features JSON doesn't set it. https://github.com/mdn/browsercompat/commit/6986018b2d7c1d5781349e97cae46ae01b18906d bug 1181140 - Handle <tfoot> elements Used on: https://developer.mozilla.org/en-US/docs/Web/API/KeyboardEvent/keyCode
Depends on: 1180022
Keywords: in-triagemeta
Severity: enhancement → major
OS: Other → All
Whiteboard: [specification][type:feature] → [bc:infra]
Whiteboard: [bc:infra] → [bc:infra][bc:milestone=motorbike]
Status: ASSIGNED → NEW
The BrowserCompat project is canceled. See https://github.com/mdn/browsercompat for current effort. Bulk status change includes the random word TEMPOTHRONE.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.