Note: There are a few cases of duplicates in user autocompletion which are being worked on.
Bug 373864 (html5-parsing)

Replace HTML parser with an HTML5 parser

RESOLVED FIXED in mozilla1.9.3a5

Status

()

Core
HTML: Parser
RESOLVED FIXED
11 years ago
6 years ago

People

(Reporter: Robert Sayre, Assigned: hsivonen)

Tracking

(Depends on: 4 bugs, Blocks: 4 bugs)

Firefox Tracking Flags

(Not tracked)

Details

(URL)

Attachments

(4 attachments, 6 obsolete attachments)

(Reporter)

Description

11 years ago
No idea on a target for this, but we have many failing tests from the html5lib suite.
(Reporter)

Updated

11 years ago
Summary: convert the html parser web applications 1.0 algorithm → convert the html parser to the web applications 1.0 algorithm
mrbkap had looked into this last summer.

/be
(Reporter)

Comment 2

11 years ago
(In reply to comment #1)
>
> mrbkap had looked into this last summer.

I also pinged him on bug 328930, but after I signed off irc, I realized I wasn't clear if he wanted to do that himself, or just wanted it done. This will make a fine tracking bug if working on the parser becomes a priority.
I certainly want to work on this. The problem is that I simply don't have time to do so between school and my other bugs (the new residual style stuff ended up as P3 for Firefox 3).
(Reporter)

Updated

11 years ago
Depends on: 308145

Updated

9 years ago
Depends on: 385776
Summary: convert the html parser to the web applications 1.0 algorithm → Convert the HTML parser to the Web Applications 1.0 (HTML 5) algorithm

Updated

9 years ago
Blocks: 449176

Updated

9 years ago
Blocks: 436600
Mozilla Corporation has contracted me to translate the Validator.nu HTML parser into Gecko-compatible C++.
Assigning to me and adjusting summary to plan.
Alias: html5-parsing
Assignee: sayrer → hsivonen
Summary: Convert the HTML parser to the Web Applications 1.0 (HTML 5) algorithm → Replace HTML parser with an HTML5 parser
Instead of making this a tracking bug, bugs pertaining to the HTML5 parser repo have "[HTML5] " at the start of their summary.
Henri, if you don't mind, I'm going to make HTML parser bugs that will be resolved (either as FIXED or WONTFIX) as dependencies of this one so they don't get lost.

Updated

8 years ago
Blocks: 485870
(In reply to comment #7)
> Henri, if you don't mind, I'm going to make HTML parser bugs that will be
> resolved (either as FIXED or WONTFIX) as dependencies of this one so they don't
> get lost.

OK. Makes sense.

Updated

8 years ago
Blocks: 178258
Depends on: 101019
Depends on: 487949

Updated

8 years ago
Depends on: 501345
Blocks: 501846

Updated

8 years ago
Depends on: 503292, 502804

Updated

8 years ago
Depends on: 503357

Updated

8 years ago
Depends on: 502984
Blocks: 431074

Updated

8 years ago
Depends on: 503632

Updated

8 years ago
Depends on: 503702

Updated

8 years ago
Blocks: 22480

Updated

8 years ago
Blocks: 390210
Created attachment 389048 [details] [diff] [review]
mochitest test cases for html5lib tokenizer tests

These are 2,722 html5 tokenizer tests from the html5lib suite, converted to mochitest using the parser harness from bug 366936.  There are 18 failures, attached in separate result file.  Please let me know if you'd like these submitted as bugs and/or added to the known failures list. 

I've included most of the tests from the tokenizer suite, but omitted the following:

- xmlViolation.test
- contentModelFlags.test
- tests involving invalid unicode chars
- tests involving multi-byte unicode chars

The latter two classes were excluded because they cause exceptions in the parser sub-harness, in the call to btoa().

I've made a few changes to the parser sub-harness to enable it to work with these files, and have verified that the changes don't interfere with the original treebuilder tests.
Attachment #389048 - Flags: review?(hsivonen)
Created attachment 389049 [details]
tokenizer test results for tests in comment 9
(In reply to comment #9)
> Created an attachment (id=389048) [details]
> mochitest test cases for html5lib tokenizer tests
> 
> These are 2,722 html5 tokenizer tests from the html5lib suite, converted to
> mochitest using the parser harness from bug 366936.

Great! Thank you!

> There are 18 failures,
> attached in separate result file.  Please let me know if you'd like these
> submitted as bugs and/or added to the known failures list. 

The tests that fail and whose description mentions [R]CDATA (or just CDATA) haven't been converted to tree builder tests correctly. The original tokenizer tests have additional JSON items for initializing the tokenizer to the RCDATA or the CDATA state and for setting up the element name that the tokenizer should look for to end the [R]CDATA run.

To convert these tests (from escapeFlag.test, it seems) to tree builder tests correctly, you need to add an <xmp> start tag. This tests the CDATA case. To test the RCDATA case, you need to replace "xmp" with "textarea" everywhere in the tests.

I don't see why "U+0080 in lookahead region" failed. Odd.

The other failures are genuine bugs that definitely need addressing. I'm particularly surprised at the failures related to CR and and the failures related to bogus entities.

Updated

8 years ago
Depends on: 504941

Updated

8 years ago
Blocks: 490605
Attachment #389049 - Attachment is obsolete: true
Created attachment 390501 [details] [diff] [review]
mochitest test cases for html5lib tokenizer tests (v2)

This is a much better implementation of the html5lib tokenizer tests.  It correctly handles unicode characters and other escaped characters.  I omitted only tests for 0xd800 and 0xdfff, as the server-side js that produces the test markup contains an algorithm for generating utf-8 that doesn't correctly handle these invalid characters.

I've made some modifications to a few of the tests (e.g., the escapeFlag and contentModelFlags tests) to account for the fact that the tokenizer output is run through the tree builder, as Henri notes above.

There are 3,375 tests in total with only 4 failures; see next attachment.
Attachment #389048 - Attachment is obsolete: true
Attachment #390501 - Flags: review?(hsivonen)
Attachment #389048 - Flags: review?(hsivonen)
Created attachment 390502 [details]
tokenizer test results for tests in comment 12
Created attachment 390503 [details]
tokenizer test results for tests in comment 12

correct attachment
Attachment #390502 - Attachment is obsolete: true
Created attachment 391186 [details] [diff] [review]
mochitest test cases for html5lib parser tests (v3)

I've added to this patch a refresh of the tree builder tests which already existed in the repo.  I've refreshed the existing tests from html5lib, and added new tests which weren't being run.  I've also updated the parser sub-harness to better handle MathML and SVG elements, and to know to skip #document-fragment tests.  There doesn't seem to be any clean way of running fragment tests without direct access to the parser; if anyone can think of a way, let me know and I'll add them.  Test results attached separately.
Attachment #390501 - Attachment is obsolete: true
Attachment #391186 - Flags: review?(hsivonen)
Attachment #390501 - Flags: review?(hsivonen)
Created attachment 391187 [details]
tree builder test results for tests in comment 15

Attached are the tree builder test results obtained from running the updated tree builder tests against the HTML5 parser.  The results can be summarized as follows:

- only 1 known failure still fails (an <isindex> case)
- 7 tests fail from regressions.txt, but they all appear to be correct according to the HTML5 parsing algorithm.  Henri, can you review these?  If they are correct I will update them in regressions.txt
- 1 failure involving <keygen>; should this element appear in the DOM?  Maybe not, see bug 101019.
- 8 failures in MathML and SVG tests involving xlink:href and xml:lang attributes.  Henri, can you review and let me know if the tests need updating or if a bug should be filed?
Blocks: 507093
Created attachment 391458 [details] [diff] [review]
mochitest test cases for html5lib parser tests (v4)

Here's another update of the html5lib tests.  This update allows Java->C++ tests to be run; test_html5_tree_construction_js_compare.html runs all of the tree construction tests by comparing their output in Gecko vs the JS validator from Henri.  The js_compare mode ignores comments and doctypes, since the JS version of the parser does not emit those.  Results of this are attached separately; there are currently 13 failures.  Most of the failures are due to the fact that the JS parser doesn't seem to be able to handle xml:lang or xlink:href correctly, but the first two failures are potentially interesting.
Attachment #391186 - Attachment is obsolete: true
Attachment #391458 - Flags: review?(hsivonen)
Attachment #391186 - Flags: review?(hsivonen)
Created attachment 391460 [details]
test results for Java->C++ parser comparison
Blocks: 507853

Updated

8 years ago
Depends on: 508075

Updated

8 years ago
Depends on: 508663
Blocks: 189051
Depends on: 508867
Comment on attachment 391458 [details] [diff] [review]
mochitest test cases for html5lib parser tests (v4)

I feel I'm too inexperienced with Mochitest coding patterns to review this properly. sayrer wrote the code originally, so I think he'd be a better reviewer than me.

However, here are some observations:
 * The dumpTree() method looks a bit odd. Why the global state variable? Is the purpose to defer the action somehow?

 * Is reorderToMatchExpected() for the old parser only? I'd expect it to be unnecessary with the new parser.

 * In principtle, (walker.currentNode.namespaceURI.toLowerCase().indexOf("math") != -1) should be (walker.currentNode.namespaceURI == "http://www.w3.org/1998/Math/MathML") and likewise (walker.currentNode.namespaceURI == "http://www.w3.org/2000/svg") for SVG. Probably doesn't matter in practice now, but if code happened to somehow break in a way that makes it matter, it would be good to have the right namespace tests in the harness.
 
 * html5lib_license.txt should probably be updated to contain a fresh copy of the upstream copyright year and contributor list.
Attachment #391458 - Flags: review?(hsivonen) → review?(sayrer)
(Reporter)

Updated

8 years ago
Attachment #391458 - Flags: review?(sayrer) → review-
(Reporter)

Comment 20

8 years ago
Comment on attachment 391458 [details] [diff] [review]
mochitest test cases for html5lib parser tests (v4)

we shouldn't have this stuff conditioned on a bunch of global variables. please refactor to avoid them. looks good otherwise.

iirc, reorderToMatchExpected was required because the original test suite from html5lib depended on python's dictionary ordering details.

Updated

8 years ago
Blocks: 328930
No longer depends on: 328930

Updated

8 years ago
Blocks: 385776
No longer depends on: 385776
Created attachment 393791 [details] [diff] [review]
mochitest test cases for html5lib parser tests (v5)

Completely refactored code in parser*.js to avoid global variables.  Also, I verified that reorderToMatchExpected() has no effect on current tests so I removed it.
Attachment #391458 - Attachment is obsolete: true
Attachment #393791 - Flags: review?
Attachment #393791 - Flags: review? → review?(sayrer)
(Reporter)

Updated

8 years ago
Attachment #393791 - Flags: review?(sayrer) → review+

Updated

8 years ago
Depends on: 501886
How should we handle the current test failures?  Shall I flag them as 'todo' so that they won't show as failures on tinderbox, and check them in, so that the tests can start running and catch regressions?  Or would you prefer to leave them out of the repo for now?
I'd prefer checking in with todo flags.
I've updated the tests to mark current failures as 'todo' and pushed as http://hg.mozilla.org/mozilla-central/rev/fc7d931fd75b.

Updated

8 years ago
Depends on: 510648
(In reply to comment #24)
> I've updated the tests to mark current failures as 'todo' and pushed as
> http://hg.mozilla.org/mozilla-central/rev/fc7d931fd75b.

Backed out because of persistent leaks. E.g. http://tinderbox.mozilla.org/showlog.cgi?log=Firefox-Unittest/1250318073.1250321057.30450.gz
Blocks: 100474
Blocks: 109152
Blocks: 321484
Blocks: 305242
Blocks: 261892
Blocks: 474670
Blocks: 147566
Blocks: 268442

Updated

8 years ago
Depends on: 512639

Updated

8 years ago
Depends on: 513194

Updated

8 years ago
Depends on: 513923

Updated

8 years ago
Depends on: 515816

Updated

8 years ago
Blocks: 515819
Blocks: 516666

Updated

8 years ago
Depends on: 483015
See Also: → bug 529503

Updated

8 years ago
Depends on: 529587

Updated

8 years ago
Depends on: 513383

Updated

8 years ago
Depends on: 520581

Updated

8 years ago
Depends on: 533085
Blocks: 533149

Updated

8 years ago
Depends on: 534458

Updated

8 years ago
Depends on: 535499
Depends on: 535740
Blocks: 497846
Blocks: 536383

Updated

8 years ago
Depends on: 537683

Updated

8 years ago
Depends on: 538193

Updated

8 years ago
Blocks: 214476
No longer blocks: 516666

Updated

8 years ago
Depends on: 539237
No longer depends on: 539237
Depends on: 431559
Blocks: 541918

Updated

8 years ago
Depends on: 507119

Comment 26

8 years ago
Is there a bug on file about turning this on by default?  I can't find one and just wanted to see what the plan was for turning this on by default.  A few users (including myself) agree that we all haven't seen any major issues with the HTML5 parser is a long time and perhaps this could be enabled by default to get wider testing.
(In reply to comment #26)
> Is there a bug on file about turning this on by default?

None other than this one, though one could argue that this bug as currently worded is about more than flipping the pref by default.

> I can't find one and
> just wanted to see what the plan was for turning this on by default.

I have a shared bugzilla query named "HTML5 Parsing Blocker". I'm working on the assumption that the length of that list needs to go to zero before the pref can be flipped by default.

In addition to items on that list, there needs to be a new "slow tp" test suite that tests tp4 over a simulated slow network. I don't know how that item is progressing. Then the parser core needs to pass code review (blocking on me to write more documentation).

> A few
> users (including myself) agree that we all haven't seen any major issues with
> the HTML5 parser is a long time and perhaps this could be enabled by default to
> get wider testing.

I think the HTML5 parser on the trunk is in good enough shape to be turned on by trunk testers. The problem is that it isn't in good enough shape for Mochitest.
Depends on: 483209
Depends on: 543652

Updated

8 years ago
Depends on: 525960

Updated

8 years ago
Blocks: 353926

Updated

8 years ago
Depends on: 516422

Updated

8 years ago
Depends on: 546665
Blocks: 547963
No longer depends on: 101019

Updated

8 years ago
Depends on: 551047
No longer depends on: 551047

Updated

7 years ago
Depends on: 553167
Depends on: 553795

Updated

7 years ago
Depends on: 554513
(In reply to comment #27)
> (In reply to comment #26)
> > Is there a bug on file about turning this on by default?
> 
> None other than this one, though one could argue that this bug as currently
> worded is about more than flipping the pref by default.
> 
> I think the HTML5 parser on the trunk is in good enough shape to be turned on
> by trunk testers. The problem is that it isn't in good enough shape for
> Mochitest.

So, I sonder if what we need here is for a bug to be opened on flipping the preference with all the mochitest issues being in bugs that block it.

It seems without that, there is no real direction to developers on what needs ot be done to get this enabled by default.

Updated

7 years ago
Depends on: 558302

Updated

7 years ago
Blocks: 560064

Updated

7 years ago
Blocks: 507701

Updated

7 years ago
Blocks: 514122

Updated

7 years ago
Blocks: 489178

Updated

7 years ago
Blocks: 546289

Updated

7 years ago
Blocks: 558640
Enabled by default now:
http://hg.mozilla.org/mozilla-central/rev/129e19d979f0

I'm leaving this open as a tracking bug.
Had to revert this yesterday:
http://hg.mozilla.org/mozilla-central/rev/ccb50d524490
Depends on: 562328
Depends on: 562326
Depends on: 562631
Depends on: 562635

Updated

7 years ago
Blocks: 562699

Updated

7 years ago
Blocks: 491789

Updated

7 years ago
Blocks: 324875
Enabled by default again:
http://hg.mozilla.org/mozilla-central/rev/358113b3642e
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED

Updated

7 years ago
Depends on: 563536
Blocks: 508867
No longer depends on: 508867

Updated

7 years ago
Depends on: 565938

Updated

7 years ago
Target Milestone: --- → mozilla1.9.3a5

Updated

7 years ago
Depends on: 568228
Blocks: 569993

Updated

7 years ago
Depends on: 571389

Updated

7 years ago
Depends on: 572208

Updated

7 years ago
Blocks: 572215

Updated

7 years ago
Depends on: 572713

Updated

7 years ago
Blocks: 577111
Depends on: 588462

Updated

7 years ago
Blocks: 592219
Depends on: 593807
No longer depends on: 593807

Updated

7 years ago
Blocks: 593809
Depends on: 605373
Depends on: 608373

Updated

7 years ago
Depends on: 611398

Updated

7 years ago
Depends on: 614194

Updated

7 years ago
Blocks: 563978

Updated

7 years ago
Depends on: 616970

Updated

7 years ago
No longer depends on: 616970

Updated

7 years ago
Depends on: 620106

Updated

7 years ago
Depends on: 620188

Updated

7 years ago
Blocks: 535448
Depends on: 627729
Depends on: 595270

Updated

7 years ago
Depends on: 638487

Updated

7 years ago
Depends on: 639295

Updated

6 years ago
Depends on: 643923

Updated

6 years ago
Depends on: 644144

Updated

6 years ago
Depends on: 645115

Updated

6 years ago
Depends on: 644971

Updated

6 years ago
Depends on: 650643

Updated

6 years ago
Depends on: 651441

Updated

6 years ago
Depends on: 644540

Updated

6 years ago
Depends on: 655434

Updated

6 years ago
Depends on: 659763

Updated

6 years ago
Depends on: 659946

Updated

6 years ago
Depends on: 644959
Blocks: 477528

Updated

6 years ago
Depends on: 689893

Updated

6 years ago
Depends on: 693271
You need to log in before you can comment on or make changes to this bug.