Closed Bug 1171957 Opened 10 years ago Closed 10 years ago

Refactor scraper into sub-parsers

Categories

(developer.mozilla.org Graveyard :: BrowserCompat, enhancement)

All
Other
enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jwhitlock, Assigned: jwhitlock)

References

Details

(Whiteboard: [bc:infra])

What problem would this feature solve? ====================================== The scraper is about 5000 lines of code + tests, and is difficult to understand or improve. Who has this problem? ===================== Staff contributors to MDN How do you know that the users identified above have this problem? ================================================================== Scraper PRs are hard to code review. Contributors have tried to add features, with limited success. How are the users identified above solving this problem now? ============================================================ Superficial code reviews and procrastination. Do you have any suggestions for solving the problem? Please explain in detail. ============================================================================== Here's a better scraping procedure: 1. Perform an initial scrape of the MDN page with a generic HTML parser 2. Use data-specific parsers for specific sections, such as Specification names ({{SpecName(...)}}) and compatibility support cells (which may include version numbers, footnotes, etc.) 3. Use item-specific parsers for data components, such as KumaScript This will allow true unit-testing of low-level components, such as mapping {{SpecName(...)}} to existing specifications and sections, and allow the page-level parsing to be more flexible and generic. New features will potentially be more limited in scope, with the majority of the new code and tests at the level of the feature rather than the entire process. Is there anything else we should know? ====================================== I've started the refactor, and planned to do it as part of bug 1134373. I'm a few days in, and I think it will take another 5-10 days, so I didn't want to block the fix or other parser features on the refactor. I think the data-level parsers will be the foundation of the validators for the HTML subsets needed in bug 1170214, so the code will live on after the MDN parsing is complete.
Blocks: 996570
Blocks: 1181140
No longer blocks: 996570
Blocks: 1188092
Blocks: 1188096
Blocks: 1188106
Blocks: 1188112
Assignee: nobody → jwhitlock
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Component: General → BrowserCompat
Blocks: 1198746
Blocks: 1198749
Blocks: 1198751
Blocks: 1198753
Blocks: 1188049
Blocks: 1187927
Blocks: 1198761
Blocks: 1198762
Blocks: 1198767
Blocks: 1198770
Blocks: 1198777
Blocks: 1198781
Blocks: 1198782
Blocks: 1198784
Blocks: 1198788
Blocks: 1198791
Blocks: 1198793
Blocks: 1198799
Blocks: 1198801
Blocks: 1198802
Blocks: 1198806
Blocks: 1198812
Blocks: 1198818
Blocks: 1198822
Blocks: 1198858
Blocks: 1198860
Blocks: 1198862
Blocks: 1198865
Blocks: 1198868
Blocks: 1198870
Blocks: 1198873
Blocks: 1198879
Blocks: 1198881
Blocks: 1198896
Blocks: 1198907
Blocks: 1198910
Blocks: 1198912
Blocks: 1198919
Blocks: 1198977
Blocks: 1198985
Blocks: 1198989
Commits pushed to report_progress_1181140 at https://github.com/mdn/browsercompat https://github.com/mdn/browsercompat/commit/0e5f39b052968f1987b0de803274016f51d901f9 bug 1171957 - Add HTML fragment parser Add a parser and visitor that work with HTML fragments. https://github.com/mdn/browsercompat/commit/210a8ac6f127789edfe3e41d4c103db3c1d94ae4 bug 1171957 - Add KumaScript parsing KumaScript acts as a special HTML text. This is a skeleton, to be fleshed out as the section-specific parsers are added. https://github.com/mdn/browsercompat/commit/ab36783579fd00d25d8a0095452b1df4d32c5101 bug 1171957 - Add issue collection to Visitor - Issue declarations moved from mdn/models.py to mdn/issues.py - Instances now have .issues property - Visitor._to_cls is now .process, collects issues after processing - Added Visitor.add_issue, .add_raw_issue to aid collecting issues - 'unknown_kumascript' issue is now auto-collected https://github.com/mdn/browsercompat/commit/48b44c0cb643dbbc5593eac5b5400d230e9ec69e bug 1171957 - Make base TestCase for MDN tests Move from test_scrape.py to new base.py https://github.com/mdn/browsercompat/commit/f1baac220feeae02345e9a03c593a9a86364039a bug 1171957 - Add specification lookup to SpecName In SpecName class and the SpecNameVisitor, lookup the referenced mdn_key and add an issue for unknown keys. https://github.com/mdn/browsercompat/commit/05faaca4c6516620c69975dd2288fcb99c727300 bug 1171957 - Use SpecNameVisitor in scrape https://github.com/mdn/browsercompat/commit/0b0308cdc85743e22de95d00a14c69f3b1afdbe4 bug 1171957 - Drop redundant tests https://github.com/mdn/browsercompat/commit/0869940031b4c617635f6b60c6bb4a2cac6969bc bug 1171957 - Renaming specification module It makes sense for the visitors for SpecName, Spec2, and the specification description should be in the same module. https://github.com/mdn/browsercompat/commit/a07e26a45f00a6be1d6b9b911e3da648ed56d310 bug 1171957 - Add Spec2 parsing https://github.com/mdn/browsercompat/commit/bcdd07e7bcc5a40fbbcadbb0a407a8e246224e0c bug 1171957 - Flesh out KumaScript handling * Shared argument handing and validation, with generic 'kumascript_wrong_args' issue when the number of arguments are wrong. * Add at least one test for each KumaScript class * Where applicable, implement ks.to_html() * Replace issue 'spec2_arg_count' with 'kumascript_wrong_args' https://github.com/mdn/browsercompat/commit/62a91e8f304092b07c66df2fa8902b133d8df719 bug 1171957 - Refactor specification description Use new parsing method for specification description. Includes: * Parsing HTML elements with no content, like <td></td> * Add issue on using {{SpecName}} or {{Spec2}} in a description, but still convert to HTML. https://github.com/mdn/browsercompat/commit/4f98280a9f9da3c22030be6c3e7161f33ffa01cd bug 1171957 - Drop redundant spec tests Drop content parsing spec tests, but retain those that exercise the visitors, to stay at 100% coverage. https://github.com/mdn/browsercompat/commit/db33d288f7f5835fb547eace03cae40aff708201 bug 1171957 - Shuffle element initialization Prepare for adding a data source to all elements: * Change basic args to 'raw' and 'start' with defaults * Initialize parent classes by keyword argument * Change KumaScript naming to use 'canonical_name' class variable, defaulting to the class name, rather than init parameter * Add default for KumaScript params 'args' * Change KumaScript tree to explicit Known/UnknownKumaScript * Update tests for new initialization https://github.com/mdn/browsercompat/commit/67baf1e96bd3b1369ecfc028c72ed1b663719b04 bug 1171957 - Add mdn.utils.join_content Same as mdn.scrape.PageVisitor.join_content, but without the StripNextSpace hack needed because '3D' needs to be parsed one way for compat features (text "3D") and another way for compat support (version "3" followed by inline text "D"). This is the annoyance that made this refactor seem like a good idea, so enjoy. https://github.com/mdn/browsercompat/commit/6f94946c9d0442e0016e3cff645f35d06a18f252 bug 1171957 - Add mdn.data.Data Centralize database access into a Data class, use in loading specifications. https://github.com/mdn/browsercompat/commit/8c4c39ac7d0a3daac5a367af1d2bf34e95cd3ae4 bug 1171957 - Move slugify to mdn.utils https://github.com/mdn/browsercompat/commit/9949472654b07e64ff70bc19910814ca7f932acd bug 1171957 - Create compat feature subparser compat_feature_grammar is used to parse <td> elements containing compatibility feature data, and CompatBaseVisitor is used to extract the data and add feature-specific issues. The feature ID and slug lookup is converted to a method on the Data class. https://github.com/mdn/browsercompat/commit/28e27d2543c3bbae037332d1411fc287c2f3aa19 bug 1171957 - Use CompatFeatureVisitor in scrape PageVisitor.visit_compat_row_cell is still used for support cells, so most is retained. Add the raw contents as cell['raw'], and re-parse when identified as a feature cell. Drop a coverage test and some unused scrape code. https://github.com/mdn/browsercompat/commit/2e18435b1e0e2ed503ce63fed2f761beeb8d7b8c bug 1171957 - Remove redundant feature tests https://github.com/mdn/browsercompat/commit/d3f2e1e19dace606fdcd9e87a39e7fee4a2ab7ea bug 1171957 - QA from reinstalling Fix issues from reinstalling project on a new laptop: * Add ignores that were in my global .gitignore * Add link to install documentation on wiki * Fix reference to `make install-jslint` https://github.com/mdn/browsercompat/commit/824adc4ad31274f8dcc439daae477fbe9eac1831 bug 1171957 - Move feature dict to visitor https://github.com/mdn/browsercompat/commit/63d593c83739a7ea0a6ee73f57177bd8a6a97f08 bug 1171957 - Move to mdn.utils.is_new_id Move mdn.scrape.is_fake_id to utils.is_new_id, in preparation to using in mdn.data and mdn.compatibility https://github.com/mdn/browsercompat/commit/b27db340f33bd706742acdc0f98c09ea2d0f2cfb bug 1171957 - Start on compat support subparser Skeleton of compatability support cell subparser: - Custom grammar for components of support cells - Data lookup and normalization for versions and supports - Normalization of version numbers - Parsing of version-only cells https://github.com/mdn/browsercompat/commit/1011e4a6554b057a2c4cebbe077b3dec93ec7b9a bug 1171957 - Complete compat support subparser - Add {{CompatNightly}} KumaScript parser (no args version) - Flesh out rest of support subparser, port over tests from test_scrape https://github.com/mdn/browsercompat/commit/01bc9127700f6e572ddf0fce0ead25dbc80a1cfd bug 1171957 - Use CompatSupportVisitor in scrape Replace visitor.cell_to_support with reparsing and extracting versions and support with CompatSupportVisitor. Drop unused code and invalid tests. https://github.com/mdn/browsercompat/commit/d53698b7f1c959d69bf78864515c4dd8ffb90bee bug 1171957 - Drop redundant code and tests Simplify page_grammar, since it doesn't have to parse support data, and drop related tests. Drop most support cell tests. https://github.com/mdn/browsercompat/commit/c8951bc269ccdf9f68520781f8a40009ba0c7316 bug 1171957 - Fixing small annoyances Small fixes with small impact: - Move support grammar outside of test setup - Alpha-sort imports - Move visitor.scope to class-level variable - s/attritbutes/attributes https://github.com/mdn/browsercompat/commit/c636456d3461bcdfc8bb2239d5d06744f1a0e44d bug 1171957 - Add element attribute validation Using the specification in visitor._attribute_validation_by_tag, drop unexpected attributes and raise issues on select unexpected or missing attributes. Use a strict whitelist for MDN content visitors. https://github.com/mdn/browsercompat/commit/91bb2259039d1c493b0477abfcd64d7686535312 bug 1171957 - Add footnote subparser The footnote subparser ends up using the same grammar as compat features. The visitor has slightly different behaviour than the monolithic scraper: - Footnotes split by <br> tags are handled correctly - <span> tags are not dropped - to be fixed. https://github.com/mdn/browsercompat/commit/99324c3394b0ec8792f96ba72be4186a9b30aecb bug 1171957 - Rename HTMLStructure to HTMLElement - Rename HTML elements (<tag>content</tag>) from HTMLStructure to HTMLElement, to more closely match the standard naming. - In the grammars, rename html_tag and *name*_tag elements to html_element and *name*_elementi, to match the classes. - Rename KumaScript class HTMLElement to KumaHTMLElement, to reduce confusion. https://github.com/mdn/browsercompat/commit/aa02cdd94a6c730d6e88470917997b9b9301fd0e bug 1171957 - Move attr validation to HTMLOpenTag Move attribute validation from HTMLVisitor to HTMLOpenTag. The default policy is still defined in HTMLVisitor (_default_attribute_actions), but the interface is a lot simplier. Also, split KumaVisitor into BaseKumaVisitor, which parses KumaScript in text content, and KumaVisitor, which starts adding the policies for scraping data from MDN raw content. https://github.com/mdn/browsercompat/commit/bd1f2c32fed1850ba8f1d3ffb809fac8286e4aef bug 1171957 - Add option to drop tags from content HTMLOpenTag and HTMLElement take scope and drop_tag arguments. When these are set, HTMLOpenTag will add a tag_dropped issue, and HTMLElement's to_html() method will not include the tags. By default, the KumaVisitor adds drop_tag to <span> elements. CompatFeatureVisitor adds drop_tag to non-table elements. https://github.com/mdn/browsercompat/commit/e3571c2a84e7dd850c1c040b8c857f4ef70479bd bug 1171957 - Use CompatFootnoteVisitor in scrape Also, adjust the tests for the new scraping behaviour. https://github.com/mdn/browsercompat/commit/8684045cc56e56ab41832f804a63fe7c470582e6 bug 1171957 - Remove redundant footnote tests https://github.com/mdn/browsercompat/commit/ead7d2afb02caae04b97efea1efd916746be652c bug 1171957 - Drop footnotes from scrape grammar https://github.com/mdn/browsercompat/commit/a90677f72e2cb45b1f91448ac623d501d2638a2b bug 1171957 - Add parsing of <h#> and <div> https://github.com/mdn/browsercompat/commit/1d8514e315c73089830211c24d568698473a5734 bug 1171957 - Precompile the grammars https://github.com/mdn/browsercompat/commit/40bda79cf4253669f42666bf5d5caa76d393c6b7 bug 1171957 - Convert get_instance() to use string Instead of passing a model class (Specification) as the first argument to get_instance(), pass a string ('Specification') representing the model class. Reduces imports from the main project. https://github.com/mdn/browsercompat/commit/ee75200b8c478de6f325f42c09f1790ab9d31b5b bug 1171957 - Create base Extractor, Visitor The Extractor will be used after HTML is parsed into elements https://github.com/mdn/browsercompat/commit/330b2e906c5d21b991fd8eaad000f62844da8ea0 bug 1171957 - Rename to Data.lookup_X Use Data.lookup_X naming pattern rather than overly verbose method names. https://github.com/mdn/browsercompat/commit/2e87654563b67b312f5b6d9313a6949105c6cb47 bug 1171957 - Add SpecSectionExtractor SpecSectionExtractor takes a parsed HTML Specifciations section and extracts the specification data. Includes: - Validating the header - Warning about non-table content not wrapped in {{WhyNoSpecStart}}/{{WhyNoSpecEnd}} - Calling sub-parsers to extract data from table https://github.com/mdn/browsercompat/commit/741e5d9cc6ea124704b6ad00dd8a6a9c3b70c15d bug 1171957 - Add to_text method to parsed objects Used to strip tags from an HTML element that may include nested tags. https://github.com/mdn/browsercompat/commit/09505f02e99aab78ac788d0da56f198510d0094e bug 1171957 - Add CompatSectionExtractor This extractor takes a sequence of elements representing a Browser Compatability header and content and extracts the browsers, versions, features, supports, and footnotes, along with any content issues. https://github.com/mdn/browsercompat/commit/66a3d1759e21bbbe65b12ed40a3037ecd534b099 bug 1171957 - Use new extractors in main scrape The main scrape uses the HTML+KumaScript grammar and visitor, builds on the Extractor class to divide into sections, and passes sections to SpecSectionExtractor and CompatSectionExtractor to do the data extraction work. This means a lot of code and tests can be dropped. https://github.com/mdn/browsercompat/commit/6df3937c84ad58cfa38a25c3d8bb74217098a4e9 bug 1171957 - Handle whitespace in compat table The test cases had no whitespace around some HTML elements, however real pages like Web/CSS/display include whitespace. This updates the tests to include more whitespace, and the code to handle more whitespace. the https://github.com/mdn/browsercompat/commit/23c30e8c56be725949120dd5d607529786d50499 bug 1171957 - unicode literals Use unicode literals in mdn.data, so that new IDs will be recognized by mdn.utils.is_new_id. https://github.com/mdn/browsercompat/commit/69cb06c8ca9c21a6790d11b7f63791f5547d63f7 bug 1171957 - Add more HTML elements Add elements used on real MDN pages: dd, dl, dt, em, li, and ul. https://github.com/mdn/browsercompat/commit/4d4b2e92502fe31ac1c66a319c20efd829833136 bug 1171957 - Add self-closing elements class Convert HTMLSimpleTag to HTMLBaseTag, and instead derive HTMLSelfClosingElement from HTMLOpenTag. Move <br> to the new class, and add processing for <img>. https://github.com/mdn/browsercompat/commit/ab28458562e93a80b2cec4b09b9e78d5f899598c bug 1171957 - Whitespace handling in support cells Remove the trailing whitespace detector from <br> and <img> elements, to make them more like other elements. This required changing whitespace patterns in compatibility cells, to check for whitespace surrounding important elements. Also {{CompatNo}} should now be pre- and post- associative, and adjust merging of trailing post-support items into previous version. https://github.com/mdn/browsercompat/commit/58570b277f00149e358c0b19fc3d4f0a9bf53d81 bug 1171957 - Changes for Web/HTML/Element/input Web/HTML/Element/input required these changes: - Add bracket_text pattern for "[not footnote]" text - Add support for HTML elements dfn and input - Trim underscores from the end of slugs https://github.com/mdn/browsercompat/commit/a3a55275324bb08a6655cd8ddcaca41a35244439 bug 1171957 - Narrow parse error messages Most parse errors appear to be incorrectly nested HTML or unknown HTML tags. These get reported at the top-level tag that contains the problematic element. Adjust the message for halt_import, and try to find the inner element that is causing the parse error. https://github.com/mdn/browsercompat/commit/0d6813bc030b0444a6a2c48dc3c8fe158678d23f bug 1171957 - Warn on empty spec cells https://github.com/mdn/browsercompat/commit/556ef22a494cf03acd713a7c77902eda58d4f05b bug 1171957 - Add remaining HTML elements These are the HTML elements used on the remaining MDN feature pages. There may be more on non-feature pages. https://github.com/mdn/browsercompat/commit/8ba8c637f8824c93d5c7b2ae20982e0438862ed3 bug 1171957 - Handle escaped quotes Regex from http://stackoverflow.com/questions/249791/ https://github.com/mdn/browsercompat/commit/991f842a6c134d5eb0b4abbd0db126c567888d91 bug 1171957 - Handle { ... } text Before, rule text_item did not match { ... } because it was avoiding matching KumaScript. Now it uses a negative lookahead to just match single curly braces. https://github.com/mdn/browsercompat/commit/550afb24d67e6fdd58e1c5065258f35cc0dc33bd bug 1171957 - Handle boolean attributes This only appears on: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/select as: <option value="value2" selected>Value 2</option> https://github.com/mdn/browsercompat/commit/3a7fd79cf78a85704a31f0d81d43b5bfa364a519 bug 1171957 - Error if no content parsed If the page appears to have content but none was extracted, then add an issue. Was on https://developer.mozilla.org/en-US/docs/Web/API/Document/execCommand before I fixed the page. https://github.com/mdn/browsercompat/commit/450693c013413b807e671ab367f4fbe0d54c4bc4 bug 1171957 - Warn on <pre> element w/o footnote Before, code assumed there was always an active footnote ID when a <pre> section was encountered in the footnote area. https://github.com/mdn/browsercompat/commit/9a77697c8b6c6d6be973bef5592b160abac9c8a1 bug 1171957 - Warn if cell extends beyond table On https://developer.mozilla.org/en-US/docs/Web/API/KeyboardEvent, a colspan of 6 is used when only 5 columns remain in table. https://github.com/mdn/browsercompat/commit/fa639c7ddb66c05a5420a918f3ddf0793283c259 bug 1171957 - Warn if range excluded from footnote On https://developer.mozilla.org/en-US/docs/Web/API/AudioBufferSourceNode (and maybe others), the <div> containing the compatibility table is closed after the footnotes and an <h3>. This warns that something appears to be wrong. https://github.com/mdn/browsercompat/commit/4b968cb8fbafd7322d3ee71ebc53bc4afc93aa86 bug 1171957 - Whitelist HTML elements Instead of blacklisting HTML elements that should not be allowed, default to no elements allowed and whitelist the ones that are allowed. This adds the tag_dropped issue for unexpected elements. Also, handle a nested <table> inside a compatibility table, by moving finalization to .to_feature_dict() and warning about <table>, <tr>, and embedded <td> elements. https://github.com/mdn/browsercompat/commit/6b241b9173d10bd754e139ba846cd25207c4d653 bug 1171957 - Cleanup visitor initialization Optionally pass a Data() object into scrape_page, and visually check that it gets passed into sub-visitors and extractors (it does). Remove some unneeded *args. https://github.com/mdn/browsercompat/commit/732bb88a366887be011d591b4d31aa0411273678 bug 1171957 - Code cleanup in mdn/scrape.py Change to top-down function order, rewrite and add comments, simplify some code. https://github.com/mdn/browsercompat/commit/f68a24f9226e4f67dd538c01ffdb36cd3b8a96e1 bug 1171957 - Minor code cleanup https://github.com/mdn/browsercompat/commit/e1b813da7ae4ded479134d60e61eb17bf0d0ecfe bug 1171957 - Report on unexpected kumascript Add expected scopes to KumaScript processing, and add issue 'unexpected kumascript' if the parsing scope is unexpected. Remove some issue types that are now too specific. Also, change calling signature of KumaScript._make_issue to take a **kwargs argument. https://github.com/mdn/browsercompat/commit/0435aca74276e51a241fe445836567c05a27a2e2 bug 1171957 - Push section lookup into SpecName https://github.com/mdn/browsercompat/commit/2756c3befb8f6ca31be81ea9e23a7b4342951e09 bug 1171957 - Generate HTML grammar and handlers https://github.com/mdn/browsercompat/commit/7b96d08b3e2a70437a619c404a6cd0f4c46aafd0 fix bug 1171957 - Remove unused issues
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Sorry to add yet more spam today. This bug isn't resolved - I just pushed a branch that includes a 'fix bug' header. This worked when I was a subcontractor and couldn't push branches to the official repo, but now I see the github hook sees a branch push as a fixing a bug too. I'll change my process for the future, and leave the "fix bug" until the merge commit. Until then, it seems best to leave this as resolved, and not send another bunch of status spam. The code in question still needs to go through code review (https://github.com/mdn/browsercompat/pull/38) before it is merged and pushed to https://browsercompat.herokuapp.com.
Commits pushed to master at https://github.com/mdn/browsercompat https://github.com/mdn/browsercompat/commit/3c40d434b97c28b0a534d198c767f213948df723 bug 1171957 - Add HTML fragment parser Add a parser and visitor that work with HTML fragments. https://github.com/mdn/browsercompat/commit/7c4a278ec95536e6d001c6ab84dca9e53e44474b bug 1171957 - Add KumaScript parsing KumaScript acts as a special HTML text. This is a skeleton, to be fleshed out as the section-specific parsers are added. https://github.com/mdn/browsercompat/commit/ebc215fbd1e3b018ada9c0534627a6cc0c65b748 bug 1171957 - Add issue collection to Visitor - Issue declarations moved from mdn/models.py to mdn/issues.py - Instances now have .issues property - Visitor._to_cls is now .process, collects issues after processing - Added Visitor.add_issue, .add_raw_issue to aid collecting issues - 'unknown_kumascript' issue is now auto-collected https://github.com/mdn/browsercompat/commit/67a91fe4e09aa7531689ab926cfa376294d19c22 bug 1171957 - Make base TestCase for MDN tests Move from test_scrape.py to new base.py https://github.com/mdn/browsercompat/commit/d049e4efea6acd23f5fdc3649ec3f837bd17eacb bug 1171957 - Add specification lookup to SpecName In SpecName class and the SpecNameVisitor, lookup the referenced mdn_key and add an issue for unknown keys. https://github.com/mdn/browsercompat/commit/fb4fab4672c7ee7f7ff09b635af64d015573437f bug 1171957 - Use SpecNameVisitor in scrape https://github.com/mdn/browsercompat/commit/b3647d5ecedb8eb731cce989e0297137149e7eee bug 1171957 - Drop redundant tests https://github.com/mdn/browsercompat/commit/5e82b2b3ea718fa572abb146cb0c07bd3610f36c bug 1171957 - Renaming specification module It makes sense for the visitors for SpecName, Spec2, and the specification description should be in the same module. https://github.com/mdn/browsercompat/commit/d6de559e9ac4feb04c77e6647f327ee25ce46504 bug 1171957 - Add Spec2 parsing https://github.com/mdn/browsercompat/commit/f6eaa09e8c482d753c37cafd6b546bf426b0250f bug 1171957 - Flesh out KumaScript handling * Shared argument handing and validation, with generic 'kumascript_wrong_args' issue when the number of arguments are wrong. * Add at least one test for each KumaScript class * Where applicable, implement ks.to_html() * Replace issue 'spec2_arg_count' with 'kumascript_wrong_args' https://github.com/mdn/browsercompat/commit/d068f5884327c4fff9408ca63ce2c67188afcd6e bug 1171957 - Refactor specification description Use new parsing method for specification description. Includes: * Parsing HTML elements with no content, like <td></td> * Add issue on using {{SpecName}} or {{Spec2}} in a description, but still convert to HTML. https://github.com/mdn/browsercompat/commit/343ce0ad5d00ead4b19f84970b0232dc27a4f8b4 bug 1171957 - Drop redundant spec tests Drop content parsing spec tests, but retain those that exercise the visitors, to stay at 100% coverage. https://github.com/mdn/browsercompat/commit/484a9229bc35397d532b2f855616ea90c2ee8b5e bug 1171957 - Shuffle element initialization Prepare for adding a data source to all elements: * Change basic args to 'raw' and 'start' with defaults * Initialize parent classes by keyword argument * Change KumaScript naming to use 'canonical_name' class variable, defaulting to the class name, rather than init parameter * Add default for KumaScript params 'args' * Change KumaScript tree to explicit Known/UnknownKumaScript * Update tests for new initialization https://github.com/mdn/browsercompat/commit/ad85c9fb98ef10a32a719f19372b0d9c7c94459e bug 1171957 - Add mdn.utils.join_content Same as mdn.scrape.PageVisitor.join_content, but without the StripNextSpace hack needed because '3D' needs to be parsed one way for compat features (text "3D") and another way for compat support (version "3" followed by inline text "D"). This is the annoyance that made this refactor seem like a good idea, so enjoy. https://github.com/mdn/browsercompat/commit/f8d060856c943f7d09a64b3306856924b6397604 bug 1171957 - Add mdn.data.Data Centralize database access into a Data class, use in loading specifications. https://github.com/mdn/browsercompat/commit/9a5efb4a2d728ac3a94f512360d7a3324742a5a8 bug 1171957 - Move slugify to mdn.utils https://github.com/mdn/browsercompat/commit/088a8dfc6f99b543a905359429916e6d28507c4c bug 1171957 - Create compat feature subparser compat_feature_grammar is used to parse <td> elements containing compatibility feature data, and CompatBaseVisitor is used to extract the data and add feature-specific issues. The feature ID and slug lookup is converted to a method on the Data class. https://github.com/mdn/browsercompat/commit/61a09406d708aef5a706892cd2cb9ae1a655137a bug 1171957 - Use CompatFeatureVisitor in scrape PageVisitor.visit_compat_row_cell is still used for support cells, so most is retained. Add the raw contents as cell['raw'], and re-parse when identified as a feature cell. Drop a coverage test and some unused scrape code. https://github.com/mdn/browsercompat/commit/c7c65d31f4f47b9628ec1b66b8f649cebb2d5495 bug 1171957 - Remove redundant feature tests https://github.com/mdn/browsercompat/commit/be3cff363df79926e0fe64b46ceaafc3a58c09e9 bug 1171957 - QA from reinstalling Fix issues from reinstalling project on a new laptop: * Add ignores that were in my global .gitignore * Add link to install documentation on wiki * Fix reference to `make install-jslint` https://github.com/mdn/browsercompat/commit/33f6d79f454005c6dceefb49ca91205a5b85986f bug 1171957 - Move feature dict to visitor https://github.com/mdn/browsercompat/commit/3f63b8e8aa93a921bc80105b3b3dc1c905447ac1 bug 1171957 - Move to mdn.utils.is_new_id Move mdn.scrape.is_fake_id to utils.is_new_id, in preparation to using in mdn.data and mdn.compatibility https://github.com/mdn/browsercompat/commit/b2f200f452df68c02c1dc21e829c7287a4d1f1a5 bug 1171957 - Start on compat support subparser Skeleton of compatability support cell subparser: - Custom grammar for components of support cells - Data lookup and normalization for versions and supports - Normalization of version numbers - Parsing of version-only cells https://github.com/mdn/browsercompat/commit/691f9518f1bc3808b17a34502f961bf2ad95856e bug 1171957 - Complete compat support subparser - Add {{CompatNightly}} KumaScript parser (no args version) - Flesh out rest of support subparser, port over tests from test_scrape https://github.com/mdn/browsercompat/commit/de57a7f921af5307f8cce5a99de1d67516649e2b bug 1171957 - Use CompatSupportVisitor in scrape Replace visitor.cell_to_support with reparsing and extracting versions and support with CompatSupportVisitor. Drop unused code and invalid tests. https://github.com/mdn/browsercompat/commit/9807fccc9d15c06ee6bf5bb87199e2e4bdb311db bug 1171957 - Drop redundant code and tests Simplify page_grammar, since it doesn't have to parse support data, and drop related tests. Drop most support cell tests. https://github.com/mdn/browsercompat/commit/7004326666f33fbe72956156812c849ad63dddf8 bug 1171957 - Fixing small annoyances Small fixes with small impact: - Move support grammar outside of test setup - Alpha-sort imports - Move visitor.scope to class-level variable - s/attritbutes/attributes https://github.com/mdn/browsercompat/commit/0b19d2ac783bc636fc86a49f78fd3e599d76f247 bug 1171957 - Add element attribute validation Using the specification in visitor._attribute_validation_by_tag, drop unexpected attributes and raise issues on select unexpected or missing attributes. Use a strict whitelist for MDN content visitors. https://github.com/mdn/browsercompat/commit/9b58421fafd49b7f1f584a29f54dc508cd3f2765 bug 1171957 - Add footnote subparser The footnote subparser ends up using the same grammar as compat features. The visitor has slightly different behaviour than the monolithic scraper: - Footnotes split by <br> tags are handled correctly - <span> tags are not dropped - to be fixed. https://github.com/mdn/browsercompat/commit/36ed2ccdb20460fa18cd3bdd6204d9a83238c322 bug 1171957 - Rename HTMLStructure to HTMLElement - Rename HTML elements (<tag>content</tag>) from HTMLStructure to HTMLElement, to more closely match the standard naming. - In the grammars, rename html_tag and *name*_tag elements to html_element and *name*_elementi, to match the classes. - Rename KumaScript class HTMLElement to KumaHTMLElement, to reduce confusion. https://github.com/mdn/browsercompat/commit/0e28522101721bd535dd5ba7ad5d9446da3ffc90 bug 1171957 - Move attr validation to HTMLOpenTag Move attribute validation from HTMLVisitor to HTMLOpenTag. The default policy is still defined in HTMLVisitor (_default_attribute_actions), but the interface is a lot simplier. Also, split KumaVisitor into BaseKumaVisitor, which parses KumaScript in text content, and KumaVisitor, which starts adding the policies for scraping data from MDN raw content. https://github.com/mdn/browsercompat/commit/2ab20c11e4eb14ddb25f46021126f6decebcefc5 bug 1171957 - Add option to drop tags from content HTMLOpenTag and HTMLElement take scope and drop_tag arguments. When these are set, HTMLOpenTag will add a tag_dropped issue, and HTMLElement's to_html() method will not include the tags. By default, the KumaVisitor adds drop_tag to <span> elements. CompatFeatureVisitor adds drop_tag to non-table elements. https://github.com/mdn/browsercompat/commit/50ca3b1364b8fff95256e10802a2fea85ff78751 bug 1171957 - Use CompatFootnoteVisitor in scrape Also, adjust the tests for the new scraping behaviour. https://github.com/mdn/browsercompat/commit/c55c82133297e880783e94b5b425969b8df48e69 bug 1171957 - Remove redundant footnote tests https://github.com/mdn/browsercompat/commit/d468f0d6162a105ad0c895a9b5c157cf7152ac74 bug 1171957 - Drop footnotes from scrape grammar https://github.com/mdn/browsercompat/commit/17621018efbc54768b7bb8d57bb09e0987cf8ac0 bug 1171957 - Add parsing of <h#> and <div> https://github.com/mdn/browsercompat/commit/dbdf4d254438630f408a6aa5ff07553ab7c4879d bug 1171957 - Precompile the grammars https://github.com/mdn/browsercompat/commit/4673e778bb8304e555905abbc97b0e189bba9ee5 bug 1171957 - Convert get_instance() to use string Instead of passing a model class (Specification) as the first argument to get_instance(), pass a string ('Specification') representing the model class. Reduces imports from the main project. https://github.com/mdn/browsercompat/commit/159f247408d2ce7c586cbd5e593f8b6cf061dc40 bug 1171957 - Create base Extractor, Visitor The Extractor will be used after HTML is parsed into elements https://github.com/mdn/browsercompat/commit/7f294470eeef098e58ac9d82798db3dada836298 bug 1171957 - Rename to Data.lookup_X Use Data.lookup_X naming pattern rather than overly verbose method names. https://github.com/mdn/browsercompat/commit/dedc8f631552494a58b5def403e09318f1566ce6 bug 1171957 - Add SpecSectionExtractor SpecSectionExtractor takes a parsed HTML Specifciations section and extracts the specification data. Includes: - Validating the header - Warning about non-table content not wrapped in {{WhyNoSpecStart}}/{{WhyNoSpecEnd}} - Calling sub-parsers to extract data from table https://github.com/mdn/browsercompat/commit/2c509202e80ca940bbb50b607e0a6637b31295bf bug 1171957 - Add to_text method to parsed objects Used to strip tags from an HTML element that may include nested tags. https://github.com/mdn/browsercompat/commit/a9843e2f7821664d97994130bf6141feae69b83b bug 1171957 - Add CompatSectionExtractor This extractor takes a sequence of elements representing a Browser Compatability header and content and extracts the browsers, versions, features, supports, and footnotes, along with any content issues. https://github.com/mdn/browsercompat/commit/1399812a9454ae1ef1e591f23c255df3aadea868 bug 1171957 - Use new extractors in main scrape The main scrape uses the HTML+KumaScript grammar and visitor, builds on the Extractor class to divide into sections, and passes sections to SpecSectionExtractor and CompatSectionExtractor to do the data extraction work. This means a lot of code and tests can be dropped. https://github.com/mdn/browsercompat/commit/0ea447d0be18e50a76da3a311aea0fc0a2fc4619 bug 1171957 - Handle whitespace in compat table The test cases had no whitespace around some HTML elements, however real pages like Web/CSS/display include whitespace. This updates the tests to include more whitespace, and the code to handle more whitespace. the https://github.com/mdn/browsercompat/commit/88f083ebd0ef1929bda4a53d6002e01a5abb95b3 bug 1171957 - unicode literals Use unicode literals in mdn.data, so that new IDs will be recognized by mdn.utils.is_new_id. https://github.com/mdn/browsercompat/commit/2c9bcf6c043b68f456834908f4b8c4834645f906 bug 1171957 - Add more HTML elements Add elements used on real MDN pages: dd, dl, dt, em, li, and ul. https://github.com/mdn/browsercompat/commit/3cfcbd00b32ebd2388fa3fe25b13a76540d4eb97 bug 1171957 - Add self-closing elements class Convert HTMLSimpleTag to HTMLBaseTag, and instead derive HTMLSelfClosingElement from HTMLOpenTag. Move <br> to the new class, and add processing for <img>. https://github.com/mdn/browsercompat/commit/92d43b408d6c2e388f0ca9796aebd44be155b9ea bug 1171957 - Whitespace handling in support cells Remove the trailing whitespace detector from <br> and <img> elements, to make them more like other elements. This required changing whitespace patterns in compatibility cells, to check for whitespace surrounding important elements. Also {{CompatNo}} should now be pre- and post- associative, and adjust merging of trailing post-support items into previous version. https://github.com/mdn/browsercompat/commit/0d3b77876971f1adf83aa5dc71a40b6153306a26 bug 1171957 - Changes for Web/HTML/Element/input Web/HTML/Element/input required these changes: - Add bracket_text pattern for "[not footnote]" text - Add support for HTML elements dfn and input - Trim underscores from the end of slugs https://github.com/mdn/browsercompat/commit/75b9ad6840312571ed0667d0e15320db8b3d0bd4 bug 1171957 - Narrow parse error messages Most parse errors appear to be incorrectly nested HTML or unknown HTML tags. These get reported at the top-level tag that contains the problematic element. Adjust the message for halt_import, and try to find the inner element that is causing the parse error. https://github.com/mdn/browsercompat/commit/659fba1c21ca0c2f3e1aa5ce0f8593b6b1e63270 bug 1171957 - Warn on empty spec cells https://github.com/mdn/browsercompat/commit/960b104a750135a103cb887225fd6ead35dea10c bug 1171957 - Add remaining HTML elements These are the HTML elements used on the remaining MDN feature pages. There may be more on non-feature pages. https://github.com/mdn/browsercompat/commit/252dcd24426d4d7b5fdae5a702780fe4851fb893 bug 1171957 - Handle escaped quotes Regex from http://stackoverflow.com/questions/249791/ https://github.com/mdn/browsercompat/commit/39222e3e1943339da23a9f00aadb55c62296b745 bug 1171957 - Handle { ... } text Before, rule text_item did not match { ... } because it was avoiding matching KumaScript. Now it uses a negative lookahead to just match single curly braces. https://github.com/mdn/browsercompat/commit/595bae37c78e46573cba3dc00479f68fb6805eef bug 1171957 - Handle boolean attributes This only appears on: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/select as: <option value="value2" selected>Value 2</option> https://github.com/mdn/browsercompat/commit/edeeada4861a7752468ba709ebe01b53d3ca7e5d bug 1171957 - Error if no content parsed If the page appears to have content but none was extracted, then add an issue. Was on https://developer.mozilla.org/en-US/docs/Web/API/Document/execCommand before I fixed the page. https://github.com/mdn/browsercompat/commit/3717acb026f0d64ea41631177c1fd704d080ba0d bug 1171957 - Warn on <pre> element w/o footnote Before, code assumed there was always an active footnote ID when a <pre> section was encountered in the footnote area. https://github.com/mdn/browsercompat/commit/26292d737e0d1da52b12c06437cf30e2b3e51972 bug 1171957 - Warn if cell extends beyond table On https://developer.mozilla.org/en-US/docs/Web/API/KeyboardEvent, a colspan of 6 is used when only 5 columns remain in table. https://github.com/mdn/browsercompat/commit/779939b69edb85244bee1d53c592454a2b3e8d2a bug 1171957 - Warn if range excluded from footnote On https://developer.mozilla.org/en-US/docs/Web/API/AudioBufferSourceNode (and maybe others), the <div> containing the compatibility table is closed after the footnotes and an <h3>. This warns that something appears to be wrong. https://github.com/mdn/browsercompat/commit/10a04f55214b2e36f3967bba719bfcc45d9ac9b3 bug 1171957 - Whitelist HTML elements Instead of blacklisting HTML elements that should not be allowed, default to no elements allowed and whitelist the ones that are allowed. This adds the tag_dropped issue for unexpected elements. Also, handle a nested <table> inside a compatibility table, by moving finalization to .to_feature_dict() and warning about <table>, <tr>, and embedded <td> elements. https://github.com/mdn/browsercompat/commit/d56405e8a0cc05662f04255e3bb78601cae81ede bug 1171957 - Cleanup visitor initialization Optionally pass a Data() object into scrape_page, and visually check that it gets passed into sub-visitors and extractors (it does). Remove some unneeded *args. https://github.com/mdn/browsercompat/commit/c41e8f291bfc60f5f5d1a5c16b15753fc7305f74 bug 1171957 - Code cleanup in mdn/scrape.py Change to top-down function order, rewrite and add comments, simplify some code. https://github.com/mdn/browsercompat/commit/50231f30c427aeaeedcec220f094ca30c443adda bug 1171957 - Minor code cleanup https://github.com/mdn/browsercompat/commit/dce9e006b5d5876d0157d3317f996a7f18571f16 bug 1171957 - Report on unexpected kumascript Add expected scopes to KumaScript processing, and add issue 'unexpected kumascript' if the parsing scope is unexpected. Remove some issue types that are now too specific. Also, change calling signature of KumaScript._make_issue to take a **kwargs argument. https://github.com/mdn/browsercompat/commit/5941e11f4c779b1cc59c2795adec65e71fd46346 bug 1171957 - Push section lookup into SpecName https://github.com/mdn/browsercompat/commit/de7be5a615a64022f4aeb0c745826995ea23ef1a bug 1171957 - Generate HTML grammar and handlers https://github.com/mdn/browsercompat/commit/62f7a19b2f4a111e89aa53186e077cbe167148da bug 1171957 - Remove unused issues https://github.com/mdn/browsercompat/commit/c68d6a3ad981ca348744e9817b358a5930bd075f bug 1171957 - Rename BaseVisitor to Recorder Use a more neutral name for base class or Visitor and Extractor https://github.com/mdn/browsercompat/commit/9a8a193fa39b09afa054dc2de5e47b67e72683fe fix bug 1171957 - Use NotImplementedError Ensure that classes derived from Extractor implement required methods https://github.com/mdn/browsercompat/commit/fdc12cbaaeded87c1d1662e2ae90afe6c973a596 Merge pull request #38 from jwhitlock/1171957_refactor_scraper Fix bug 1171957 - Rewrite scraper to use section importers
This second commit dump is the real one. Bug is closed, code is running on https://browsercompat.herokuapp.com, and the re-import should be done in a few hours.
Keywords: in-triage
Summary: [Compat Data][Importer] Refactor scraper into sub-parsers → Refactor scraper into sub-parsers
Whiteboard: [specification][type:feature] → [bc:infra]
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.