Import upstream patch that makes html5lib tokenizer tests runnable without the tree builder
Categories
(Core :: DOM: HTML Parser, task)
Tracking
()
Tracking | Status | |
---|---|---|
firefox93 | --- | fixed |
People
(Reporter: hsivonen, Assigned: hsivonen)
Details
Attachments
(1 file)
Import https://github.com/validator/htmlparser/commit/2f843768d2f10dc56ebce715faaf8dcb411febc4 in order to get the repos in sync.
No Web-exposed changes expected.
Assignee | ||
Comment 1•3 years ago
|
||
This change brings the tokenizer’s handling of U+0000 NUL characters in
the DATA state and the CDATA section state into conformance with the
requirements in the HTML spec — for the case where only tokenization is
being performed, without tree construction; that is, the case where the
tokenizer() method is called, rather than parse() or parseFragment().
Specifically, the tokenization steps defined in the spec require that
when a U+0000 NUL is consumed in the DATA state or in the CDATA section
state, the parser must then emit a U+0000 NUL. But when performing tree
construction, the spec requires that when a U+0000 NUL is consumed, the
parser must instead emit a U+FFFD REPLACEMENT CHARACTER.
Without this change, the parser always emits a U+FFFD REPLACEMENT
CHARACTER — even when only tokenization is being performed. That causes
us to fail a number of tests in html5lib-tests suite.
For more background on the relevant behavior, see the following:
- https://www.w3.org/Bugs/Public/show_bug.cgi?id=9659
- https://github.com/whatwg/html/commit/d98f83e
- https://github.com/validator/htmlparser/commit/9b9c263
Relates to https://github.com/validator/htmlparser/issues/35
Assignee | ||
Comment 2•3 years ago
|
||
Comment 4•3 years ago
|
||
bugherder |
Assignee | ||
Comment 5•3 years ago
|
||
Description
•