Last Comment Bug 373864 - (html5-parsing) Replace HTML parser with an HTML5 parser
(html5-parsing)
: Replace HTML parser with an HTML5 parser
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: HTML: Parser (show other bugs)
: Trunk
: All All
: -- normal with 8 votes (vote)
: mozilla1.9.3a5
Assigned To: Henri Sivonen (:hsivonen)
:
: Andrew Overholt [:overholt]
Mentors:
http://www.whatwg.org/specs/web-apps/...
Depends on: 431559 655434 689893 693271 308145 366936 483015 483209 html5-parsing-land 501345 501886 502804 502984 503292 503357 503632 503702 504941 507119 508075 508663 510648 512639 513194 513383 513923 515816 516422 520581 525960 529587 533085 534458 535499 535740 537683 538193 543652 546665 553167 553795 554513 558302 562326 562328 562631 562635 563536 565938 568228 571389 572208 572713 588462 595270 605373 608373 611398 614194 620106 620188 627729 638487 639295 643923 644144 644540 644959 644971 645115 650643 651441 659763 659946
Blocks: 515819 536383 562699 563978 22480 100474 109152 147566 178258 189051 SGMLComment 261892 268442 305242 321484 324875 328930 353926 385776 390210 431074 436600 449176 474670 477528 485870 489178 490605 491789 497846 501846 507093 507701 507853 508867 514122 533149 535448 541918 546289 547963 558640 560064 569993 572215 577111 592219 593809
  Show dependency treegraph
 
Reported: 2007-03-13 22:14 PDT by Robert Sayre
Modified: 2011-10-10 23:25 PDT (History)
46 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
mochitest test cases for html5lib tokenizer tests (521.73 KB, patch)
2009-07-16 16:54 PDT, Jonathan Griffin (:jgriffin)
no flags Details | Diff | Splinter Review
tokenizer test results for tests in comment 9 (272.51 KB, text/html)
2009-07-16 16:56 PDT, Jonathan Griffin (:jgriffin)
no flags Details
mochitest test cases for html5lib tokenizer tests (v2) (397.14 KB, patch)
2009-07-24 11:03 PDT, Jonathan Griffin (:jgriffin)
no flags Details | Diff | Splinter Review
tokenizer test results for tests in comment 12 (381 bytes, text/html)
2009-07-24 11:05 PDT, Jonathan Griffin (:jgriffin)
no flags Details
tokenizer test results for tests in comment 12 (1.94 KB, text/html)
2009-07-24 11:06 PDT, Jonathan Griffin (:jgriffin)
no flags Details
mochitest test cases for html5lib parser tests (v3) (585.23 KB, patch)
2009-07-28 14:08 PDT, Jonathan Griffin (:jgriffin)
no flags Details | Diff | Splinter Review
tree builder test results for tests in comment 15 (8.71 KB, text/html)
2009-07-28 14:17 PDT, Jonathan Griffin (:jgriffin)
no flags Details
mochitest test cases for html5lib parser tests (v4) (876.16 KB, patch)
2009-07-29 15:23 PDT, Jonathan Griffin (:jgriffin)
sayrer: review-
Details | Diff | Splinter Review
test results for Java->C++ parser comparison (5.73 KB, text/html)
2009-07-29 15:24 PDT, Jonathan Griffin (:jgriffin)
no flags Details
mochitest test cases for html5lib parser tests (v5) (885.08 KB, patch)
2009-08-11 10:03 PDT, Jonathan Griffin (:jgriffin)
sayrer: review+
Details | Diff | Splinter Review

Description Robert Sayre 2007-03-13 22:14:06 PDT
No idea on a target for this, but we have many failing tests from the html5lib suite.
Comment 1 Brendan Eich [:brendan] 2007-03-14 22:21:28 PDT
mrbkap had looked into this last summer.

/be
Comment 2 Robert Sayre 2007-03-14 22:32:39 PDT
(In reply to comment #1)
>
> mrbkap had looked into this last summer.

I also pinged him on bug 328930, but after I signed off irc, I realized I wasn't clear if he wanted to do that himself, or just wanted it done. This will make a fine tracking bug if working on the parser becomes a priority.
Comment 3 Blake Kaplan (:mrbkap) 2007-03-14 23:11:26 PDT
I certainly want to work on this. The problem is that I simply don't have time to do so between school and my other bugs (the new residual style stuff ended up as P3 for Firefox 3).
Comment 4 Henri Sivonen (:hsivonen) 2008-09-26 00:38:19 PDT
Mozilla Corporation has contracted me to translate the Validator.nu HTML parser into Gecko-compatible C++.
Comment 5 Henri Sivonen (:hsivonen) 2008-12-12 12:34:54 PST
Assigning to me and adjusting summary to plan.
Comment 6 Henri Sivonen (:hsivonen) 2009-03-26 09:03:12 PDT
Instead of making this a tracking bug, bugs pertaining to the HTML5 parser repo have "[HTML5] " at the start of their summary.
Comment 7 Blake Kaplan (:mrbkap) 2009-03-30 16:35:41 PDT
Henri, if you don't mind, I'm going to make HTML parser bugs that will be resolved (either as FIXED or WONTFIX) as dependencies of this one so they don't get lost.
Comment 8 Henri Sivonen (:hsivonen) 2009-03-31 04:34:55 PDT
(In reply to comment #7)
> Henri, if you don't mind, I'm going to make HTML parser bugs that will be
> resolved (either as FIXED or WONTFIX) as dependencies of this one so they don't
> get lost.

OK. Makes sense.
Comment 9 Jonathan Griffin (:jgriffin) 2009-07-16 16:54:37 PDT
Created attachment 389048 [details] [diff] [review]
mochitest test cases for html5lib tokenizer tests

These are 2,722 html5 tokenizer tests from the html5lib suite, converted to mochitest using the parser harness from bug 366936.  There are 18 failures, attached in separate result file.  Please let me know if you'd like these submitted as bugs and/or added to the known failures list. 

I've included most of the tests from the tokenizer suite, but omitted the following:

- xmlViolation.test
- contentModelFlags.test
- tests involving invalid unicode chars
- tests involving multi-byte unicode chars

The latter two classes were excluded because they cause exceptions in the parser sub-harness, in the call to btoa().

I've made a few changes to the parser sub-harness to enable it to work with these files, and have verified that the changes don't interfere with the original treebuilder tests.
Comment 10 Jonathan Griffin (:jgriffin) 2009-07-16 16:56:55 PDT
Created attachment 389049 [details]
tokenizer test results for tests in comment 9
Comment 11 Henri Sivonen (:hsivonen) 2009-07-17 02:58:18 PDT
(In reply to comment #9)
> Created an attachment (id=389048) [details]
> mochitest test cases for html5lib tokenizer tests
> 
> These are 2,722 html5 tokenizer tests from the html5lib suite, converted to
> mochitest using the parser harness from bug 366936.

Great! Thank you!

> There are 18 failures,
> attached in separate result file.  Please let me know if you'd like these
> submitted as bugs and/or added to the known failures list. 

The tests that fail and whose description mentions [R]CDATA (or just CDATA) haven't been converted to tree builder tests correctly. The original tokenizer tests have additional JSON items for initializing the tokenizer to the RCDATA or the CDATA state and for setting up the element name that the tokenizer should look for to end the [R]CDATA run.

To convert these tests (from escapeFlag.test, it seems) to tree builder tests correctly, you need to add an <xmp> start tag. This tests the CDATA case. To test the RCDATA case, you need to replace "xmp" with "textarea" everywhere in the tests.

I don't see why "U+0080 in lookahead region" failed. Odd.

The other failures are genuine bugs that definitely need addressing. I'm particularly surprised at the failures related to CR and and the failures related to bogus entities.
Comment 12 Jonathan Griffin (:jgriffin) 2009-07-24 11:03:51 PDT
Created attachment 390501 [details] [diff] [review]
mochitest test cases for html5lib tokenizer tests (v2)

This is a much better implementation of the html5lib tokenizer tests.  It correctly handles unicode characters and other escaped characters.  I omitted only tests for 0xd800 and 0xdfff, as the server-side js that produces the test markup contains an algorithm for generating utf-8 that doesn't correctly handle these invalid characters.

I've made some modifications to a few of the tests (e.g., the escapeFlag and contentModelFlags tests) to account for the fact that the tokenizer output is run through the tree builder, as Henri notes above.

There are 3,375 tests in total with only 4 failures; see next attachment.
Comment 13 Jonathan Griffin (:jgriffin) 2009-07-24 11:05:16 PDT
Created attachment 390502 [details]
tokenizer test results for tests in comment 12
Comment 14 Jonathan Griffin (:jgriffin) 2009-07-24 11:06:12 PDT
Created attachment 390503 [details]
tokenizer test results for tests in comment 12

correct attachment
Comment 15 Jonathan Griffin (:jgriffin) 2009-07-28 14:08:45 PDT
Created attachment 391186 [details] [diff] [review]
mochitest test cases for html5lib parser tests (v3)

I've added to this patch a refresh of the tree builder tests which already existed in the repo.  I've refreshed the existing tests from html5lib, and added new tests which weren't being run.  I've also updated the parser sub-harness to better handle MathML and SVG elements, and to know to skip #document-fragment tests.  There doesn't seem to be any clean way of running fragment tests without direct access to the parser; if anyone can think of a way, let me know and I'll add them.  Test results attached separately.
Comment 16 Jonathan Griffin (:jgriffin) 2009-07-28 14:17:05 PDT
Created attachment 391187 [details]
tree builder test results for tests in comment 15

Attached are the tree builder test results obtained from running the updated tree builder tests against the HTML5 parser.  The results can be summarized as follows:

- only 1 known failure still fails (an <isindex> case)
- 7 tests fail from regressions.txt, but they all appear to be correct according to the HTML5 parsing algorithm.  Henri, can you review these?  If they are correct I will update them in regressions.txt
- 1 failure involving <keygen>; should this element appear in the DOM?  Maybe not, see bug 101019.
- 8 failures in MathML and SVG tests involving xlink:href and xml:lang attributes.  Henri, can you review and let me know if the tests need updating or if a bug should be filed?
Comment 17 Jonathan Griffin (:jgriffin) 2009-07-29 15:23:09 PDT
Created attachment 391458 [details] [diff] [review]
mochitest test cases for html5lib parser tests (v4)

Here's another update of the html5lib tests.  This update allows Java->C++ tests to be run; test_html5_tree_construction_js_compare.html runs all of the tree construction tests by comparing their output in Gecko vs the JS validator from Henri.  The js_compare mode ignores comments and doctypes, since the JS version of the parser does not emit those.  Results of this are attached separately; there are currently 13 failures.  Most of the failures are due to the fact that the JS parser doesn't seem to be able to handle xml:lang or xlink:href correctly, but the first two failures are potentially interesting.
Comment 18 Jonathan Griffin (:jgriffin) 2009-07-29 15:24:43 PDT
Created attachment 391460 [details]
test results for Java->C++ parser comparison
Comment 19 Henri Sivonen (:hsivonen) 2009-08-07 06:46:14 PDT
Comment on attachment 391458 [details] [diff] [review]
mochitest test cases for html5lib parser tests (v4)

I feel I'm too inexperienced with Mochitest coding patterns to review this properly. sayrer wrote the code originally, so I think he'd be a better reviewer than me.

However, here are some observations:
 * The dumpTree() method looks a bit odd. Why the global state variable? Is the purpose to defer the action somehow?

 * Is reorderToMatchExpected() for the old parser only? I'd expect it to be unnecessary with the new parser.

 * In principtle, (walker.currentNode.namespaceURI.toLowerCase().indexOf("math") != -1) should be (walker.currentNode.namespaceURI == "http://www.w3.org/1998/Math/MathML") and likewise (walker.currentNode.namespaceURI == "http://www.w3.org/2000/svg") for SVG. Probably doesn't matter in practice now, but if code happened to somehow break in a way that makes it matter, it would be good to have the right namespace tests in the harness.
 
 * html5lib_license.txt should probably be updated to contain a fresh copy of the upstream copyright year and contributor list.
Comment 20 Robert Sayre 2009-08-07 14:11:38 PDT
Comment on attachment 391458 [details] [diff] [review]
mochitest test cases for html5lib parser tests (v4)

we shouldn't have this stuff conditioned on a bunch of global variables. please refactor to avoid them. looks good otherwise.

iirc, reorderToMatchExpected was required because the original test suite from html5lib depended on python's dictionary ordering details.
Comment 21 Jonathan Griffin (:jgriffin) 2009-08-11 10:03:09 PDT
Created attachment 393791 [details] [diff] [review]
mochitest test cases for html5lib parser tests (v5)

Completely refactored code in parser*.js to avoid global variables.  Also, I verified that reorderToMatchExpected() has no effect on current tests so I removed it.
Comment 22 Jonathan Griffin (:jgriffin) 2009-08-13 13:34:23 PDT
How should we handle the current test failures?  Shall I flag them as 'todo' so that they won't show as failures on tinderbox, and check them in, so that the tests can start running and catch regressions?  Or would you prefer to leave them out of the repo for now?
Comment 23 Henri Sivonen (:hsivonen) 2009-08-14 01:35:26 PDT
I'd prefer checking in with todo flags.
Comment 24 Jonathan Griffin (:jgriffin) 2009-08-14 15:07:05 PDT
I've updated the tests to mark current failures as 'todo' and pushed as http://hg.mozilla.org/mozilla-central/rev/fc7d931fd75b.
Comment 25 Dão Gottwald [:dao] 2009-08-15 03:24:14 PDT
(In reply to comment #24)
> I've updated the tests to mark current failures as 'todo' and pushed as
> http://hg.mozilla.org/mozilla-central/rev/fc7d931fd75b.

Backed out because of persistent leaks. E.g. http://tinderbox.mozilla.org/showlog.cgi?log=Firefox-Unittest/1250318073.1250321057.30450.gz
Comment 26 u88484 2010-01-30 14:41:57 PST
Is there a bug on file about turning this on by default?  I can't find one and just wanted to see what the plan was for turning this on by default.  A few users (including myself) agree that we all haven't seen any major issues with the HTML5 parser is a long time and perhaps this could be enabled by default to get wider testing.
Comment 27 Henri Sivonen (:hsivonen) 2010-01-31 23:55:00 PST
(In reply to comment #26)
> Is there a bug on file about turning this on by default?

None other than this one, though one could argue that this bug as currently worded is about more than flipping the pref by default.

> I can't find one and
> just wanted to see what the plan was for turning this on by default.

I have a shared bugzilla query named "HTML5 Parsing Blocker". I'm working on the assumption that the length of that list needs to go to zero before the pref can be flipped by default.

In addition to items on that list, there needs to be a new "slow tp" test suite that tests tp4 over a simulated slow network. I don't know how that item is progressing. Then the parser core needs to pass code review (blocking on me to write more documentation).

> A few
> users (including myself) agree that we all haven't seen any major issues with
> the HTML5 parser is a long time and perhaps this could be enabled by default to
> get wider testing.

I think the HTML5 parser on the trunk is in good enough shape to be turned on by trunk testers. The problem is that it isn't in good enough shape for Mochitest.
Comment 28 Bill Gianopoulos [:WG9s] 2010-03-28 13:19:10 PDT
(In reply to comment #27)
> (In reply to comment #26)
> > Is there a bug on file about turning this on by default?
> 
> None other than this one, though one could argue that this bug as currently
> worded is about more than flipping the pref by default.
> 
> I think the HTML5 parser on the trunk is in good enough shape to be turned on
> by trunk testers. The problem is that it isn't in good enough shape for
> Mochitest.

So, I sonder if what we need here is for a bug to be opened on flipping the preference with all the mochitest issues being in bugs that block it.

It seems without that, there is no real direction to developers on what needs ot be done to get this enabled by default.
Comment 29 Henri Sivonen (:hsivonen) 2010-04-28 05:18:56 PDT
Enabled by default now:
http://hg.mozilla.org/mozilla-central/rev/129e19d979f0

I'm leaving this open as a tracking bug.
Comment 30 Henri Sivonen (:hsivonen) 2010-04-28 22:57:39 PDT
Had to revert this yesterday:
http://hg.mozilla.org/mozilla-central/rev/ccb50d524490
Comment 31 Henri Sivonen (:hsivonen) 2010-05-03 03:49:47 PDT
Enabled by default again:
http://hg.mozilla.org/mozilla-central/rev/358113b3642e

Note You need to log in before you can comment on or make changes to this bug.