Closed Bug 373864 (html5-parsing) Opened 17 years ago Closed 14 years ago

Replace HTML parser with an HTML5 parser

Categories

(Core :: DOM: HTML Parser, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.9.3a5

People

(Reporter: sayrer, Assigned: hsivonen)

References

(Depends on 2 open bugs, Blocks 4 open bugs, )

Details

Attachments

(4 files, 6 obsolete files)

No idea on a target for this, but we have many failing tests from the html5lib suite.
Summary: convert the html parser web applications 1.0 algorithm → convert the html parser to the web applications 1.0 algorithm
mrbkap had looked into this last summer.

/be
(In reply to comment #1)
>
> mrbkap had looked into this last summer.

I also pinged him on bug 328930, but after I signed off irc, I realized I wasn't clear if he wanted to do that himself, or just wanted it done. This will make a fine tracking bug if working on the parser becomes a priority.
I certainly want to work on this. The problem is that I simply don't have time to do so between school and my other bugs (the new residual style stuff ended up as P3 for Firefox 3).
Depends on: 308145
Depends on: 385776
Summary: convert the html parser to the web applications 1.0 algorithm → Convert the HTML parser to the Web Applications 1.0 (HTML 5) algorithm
Blocks: 449176
Blocks: 436600
Mozilla Corporation has contracted me to translate the Validator.nu HTML parser into Gecko-compatible C++.
Assigning to me and adjusting summary to plan.
Alias: html5-parsing
Assignee: sayrer → hsivonen
Summary: Convert the HTML parser to the Web Applications 1.0 (HTML 5) algorithm → Replace HTML parser with an HTML5 parser
Instead of making this a tracking bug, bugs pertaining to the HTML5 parser repo have "[HTML5] " at the start of their summary.
Henri, if you don't mind, I'm going to make HTML parser bugs that will be resolved (either as FIXED or WONTFIX) as dependencies of this one so they don't get lost.
(In reply to comment #7)
> Henri, if you don't mind, I'm going to make HTML parser bugs that will be
> resolved (either as FIXED or WONTFIX) as dependencies of this one so they don't
> get lost.

OK. Makes sense.
Blocks: 178258
Depends on: html5-keygen
Depends on: html5-parsing-land
Depends on: 501345
Blocks: 501846
Depends on: 503292, 502804
Depends on: 503357
Depends on: 502984
Blocks: 431074
Depends on: 503632
Depends on: 503702
Blocks: 22480
Blocks: 390210
These are 2,722 html5 tokenizer tests from the html5lib suite, converted to mochitest using the parser harness from bug 366936.  There are 18 failures, attached in separate result file.  Please let me know if you'd like these submitted as bugs and/or added to the known failures list. 

I've included most of the tests from the tokenizer suite, but omitted the following:

- xmlViolation.test
- contentModelFlags.test
- tests involving invalid unicode chars
- tests involving multi-byte unicode chars

The latter two classes were excluded because they cause exceptions in the parser sub-harness, in the call to btoa().

I've made a few changes to the parser sub-harness to enable it to work with these files, and have verified that the changes don't interfere with the original treebuilder tests.
Attachment #389048 - Flags: review?(hsivonen)
(In reply to comment #9)
> Created an attachment (id=389048) [details]
> mochitest test cases for html5lib tokenizer tests
> 
> These are 2,722 html5 tokenizer tests from the html5lib suite, converted to
> mochitest using the parser harness from bug 366936.

Great! Thank you!

> There are 18 failures,
> attached in separate result file.  Please let me know if you'd like these
> submitted as bugs and/or added to the known failures list. 

The tests that fail and whose description mentions [R]CDATA (or just CDATA) haven't been converted to tree builder tests correctly. The original tokenizer tests have additional JSON items for initializing the tokenizer to the RCDATA or the CDATA state and for setting up the element name that the tokenizer should look for to end the [R]CDATA run.

To convert these tests (from escapeFlag.test, it seems) to tree builder tests correctly, you need to add an <xmp> start tag. This tests the CDATA case. To test the RCDATA case, you need to replace "xmp" with "textarea" everywhere in the tests.

I don't see why "U+0080 in lookahead region" failed. Odd.

The other failures are genuine bugs that definitely need addressing. I'm particularly surprised at the failures related to CR and and the failures related to bogus entities.
Depends on: 504941
Blocks: 490605
Attachment #389049 - Attachment is obsolete: true
This is a much better implementation of the html5lib tokenizer tests.  It correctly handles unicode characters and other escaped characters.  I omitted only tests for 0xd800 and 0xdfff, as the server-side js that produces the test markup contains an algorithm for generating utf-8 that doesn't correctly handle these invalid characters.

I've made some modifications to a few of the tests (e.g., the escapeFlag and contentModelFlags tests) to account for the fact that the tokenizer output is run through the tree builder, as Henri notes above.

There are 3,375 tests in total with only 4 failures; see next attachment.
Attachment #389048 - Attachment is obsolete: true
Attachment #390501 - Flags: review?(hsivonen)
Attachment #389048 - Flags: review?(hsivonen)
correct attachment
Attachment #390502 - Attachment is obsolete: true
I've added to this patch a refresh of the tree builder tests which already existed in the repo.  I've refreshed the existing tests from html5lib, and added new tests which weren't being run.  I've also updated the parser sub-harness to better handle MathML and SVG elements, and to know to skip #document-fragment tests.  There doesn't seem to be any clean way of running fragment tests without direct access to the parser; if anyone can think of a way, let me know and I'll add them.  Test results attached separately.
Attachment #390501 - Attachment is obsolete: true
Attachment #391186 - Flags: review?(hsivonen)
Attachment #390501 - Flags: review?(hsivonen)
Attached are the tree builder test results obtained from running the updated tree builder tests against the HTML5 parser.  The results can be summarized as follows:

- only 1 known failure still fails (an <isindex> case)
- 7 tests fail from regressions.txt, but they all appear to be correct according to the HTML5 parsing algorithm.  Henri, can you review these?  If they are correct I will update them in regressions.txt
- 1 failure involving <keygen>; should this element appear in the DOM?  Maybe not, see bug 101019.
- 8 failures in MathML and SVG tests involving xlink:href and xml:lang attributes.  Henri, can you review and let me know if the tests need updating or if a bug should be filed?
Blocks: 507093
Here's another update of the html5lib tests.  This update allows Java->C++ tests to be run; test_html5_tree_construction_js_compare.html runs all of the tree construction tests by comparing their output in Gecko vs the JS validator from Henri.  The js_compare mode ignores comments and doctypes, since the JS version of the parser does not emit those.  Results of this are attached separately; there are currently 13 failures.  Most of the failures are due to the fact that the JS parser doesn't seem to be able to handle xml:lang or xlink:href correctly, but the first two failures are potentially interesting.
Attachment #391186 - Attachment is obsolete: true
Attachment #391458 - Flags: review?(hsivonen)
Attachment #391186 - Flags: review?(hsivonen)
Blocks: 507853
Depends on: 508075
Depends on: 508663
Blocks: 189051
Depends on: 508867
Comment on attachment 391458 [details] [diff] [review]
mochitest test cases for html5lib parser tests (v4)

I feel I'm too inexperienced with Mochitest coding patterns to review this properly. sayrer wrote the code originally, so I think he'd be a better reviewer than me.

However, here are some observations:
 * The dumpTree() method looks a bit odd. Why the global state variable? Is the purpose to defer the action somehow?

 * Is reorderToMatchExpected() for the old parser only? I'd expect it to be unnecessary with the new parser.

 * In principtle, (walker.currentNode.namespaceURI.toLowerCase().indexOf("math") != -1) should be (walker.currentNode.namespaceURI == "http://www.w3.org/1998/Math/MathML") and likewise (walker.currentNode.namespaceURI == "http://www.w3.org/2000/svg") for SVG. Probably doesn't matter in practice now, but if code happened to somehow break in a way that makes it matter, it would be good to have the right namespace tests in the harness.
 
 * html5lib_license.txt should probably be updated to contain a fresh copy of the upstream copyright year and contributor list.
Attachment #391458 - Flags: review?(hsivonen) → review?(sayrer)
Attachment #391458 - Flags: review?(sayrer) → review-
Comment on attachment 391458 [details] [diff] [review]
mochitest test cases for html5lib parser tests (v4)

we shouldn't have this stuff conditioned on a bunch of global variables. please refactor to avoid them. looks good otherwise.

iirc, reorderToMatchExpected was required because the original test suite from html5lib depended on python's dictionary ordering details.
Blocks: 328930
No longer depends on: 328930
Blocks: 385776
No longer depends on: 385776
Completely refactored code in parser*.js to avoid global variables.  Also, I verified that reorderToMatchExpected() has no effect on current tests so I removed it.
Attachment #391458 - Attachment is obsolete: true
Attachment #393791 - Flags: review?
Attachment #393791 - Flags: review? → review?(sayrer)
Attachment #393791 - Flags: review?(sayrer) → review+
Depends on: 501886
How should we handle the current test failures?  Shall I flag them as 'todo' so that they won't show as failures on tinderbox, and check them in, so that the tests can start running and catch regressions?  Or would you prefer to leave them out of the repo for now?
I'd prefer checking in with todo flags.
I've updated the tests to mark current failures as 'todo' and pushed as http://hg.mozilla.org/mozilla-central/rev/fc7d931fd75b.
Depends on: 510648
(In reply to comment #24)
> I've updated the tests to mark current failures as 'todo' and pushed as
> http://hg.mozilla.org/mozilla-central/rev/fc7d931fd75b.

Backed out because of persistent leaks. E.g. http://tinderbox.mozilla.org/showlog.cgi?log=Firefox-Unittest/1250318073.1250321057.30450.gz
Blocks: 100474
Blocks: 109152
Blocks: 321484
Blocks: 305242
Blocks: 261892
Blocks: 474670
Blocks: 147566
Blocks: 268442
Depends on: 512639
Depends on: 513194
Depends on: 513923
Depends on: 515816
Blocks: 515819
Blocks: 516666
Depends on: 483015
See Also: → 529503
Depends on: 529587
Depends on: 513383
Depends on: 520581
Depends on: 533085
Blocks: 533149
Depends on: 534458
Depends on: 535499
Depends on: 535740
Blocks: 497846
Blocks: 536383
Depends on: 537683
Depends on: 538193
Blocks: SGMLComment
No longer blocks: 516666
Depends on: 539237
No longer depends on: 539237
Depends on: 431559
Blocks: 541918
Depends on: 507119
Is there a bug on file about turning this on by default?  I can't find one and just wanted to see what the plan was for turning this on by default.  A few users (including myself) agree that we all haven't seen any major issues with the HTML5 parser is a long time and perhaps this could be enabled by default to get wider testing.
(In reply to comment #26)
> Is there a bug on file about turning this on by default?

None other than this one, though one could argue that this bug as currently worded is about more than flipping the pref by default.

> I can't find one and
> just wanted to see what the plan was for turning this on by default.

I have a shared bugzilla query named "HTML5 Parsing Blocker". I'm working on the assumption that the length of that list needs to go to zero before the pref can be flipped by default.

In addition to items on that list, there needs to be a new "slow tp" test suite that tests tp4 over a simulated slow network. I don't know how that item is progressing. Then the parser core needs to pass code review (blocking on me to write more documentation).

> A few
> users (including myself) agree that we all haven't seen any major issues with
> the HTML5 parser is a long time and perhaps this could be enabled by default to
> get wider testing.

I think the HTML5 parser on the trunk is in good enough shape to be turned on by trunk testers. The problem is that it isn't in good enough shape for Mochitest.
Depends on: 483209
Depends on: 543652
Depends on: 525960
Depends on: 516422
Depends on: 546665
Blocks: 547963
No longer depends on: html5-keygen
Depends on: 551047
No longer depends on: 551047
Depends on: 553167
Depends on: 554513
(In reply to comment #27)
> (In reply to comment #26)
> > Is there a bug on file about turning this on by default?
> 
> None other than this one, though one could argue that this bug as currently
> worded is about more than flipping the pref by default.
> 
> I think the HTML5 parser on the trunk is in good enough shape to be turned on
> by trunk testers. The problem is that it isn't in good enough shape for
> Mochitest.

So, I sonder if what we need here is for a bug to be opened on flipping the preference with all the mochitest issues being in bugs that block it.

It seems without that, there is no real direction to developers on what needs ot be done to get this enabled by default.
Depends on: 558302
Blocks: 560064
Blocks: 507701
Blocks: 514122
Blocks: 489178
Blocks: 546289
Blocks: 558640
Enabled by default now:
http://hg.mozilla.org/mozilla-central/rev/129e19d979f0

I'm leaving this open as a tracking bug.
Depends on: 562328
Depends on: 562326
Depends on: 562631
Depends on: 562635
Blocks: 562699
Blocks: 491789
Blocks: 324875
Enabled by default again:
http://hg.mozilla.org/mozilla-central/rev/358113b3642e
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Depends on: 563536
Blocks: 508867
No longer depends on: 508867
Depends on: 565938
Target Milestone: --- → mozilla1.9.3a5
Depends on: 568228
Depends on: 571389
Depends on: 572208
Blocks: 572215
Depends on: 572713
Blocks: 592219
No longer depends on: 593807
Blocks: 593809
Depends on: 605373
Depends on: 608373
Depends on: 611398
Depends on: 614194
Blocks: 563978
Depends on: 616970
No longer depends on: 616970
Depends on: 620106
Depends on: 620188
Depends on: 627729
Depends on: 595270
Depends on: 638487
Depends on: 639295
Depends on: 643923
Depends on: 644144
Depends on: 645115
Depends on: 644971
Depends on: 650643
Depends on: 651441
Depends on: 644540
Depends on: 655434
Depends on: 659763
Depends on: 659946
Depends on: 644959
Depends on: 689893
Depends on: 693271
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: