Closed
Bug 634541
Opened 13 years ago
Closed 6 years ago
HTML parser should remove only one BOM
Categories
(Core :: DOM: HTML Parser, defect)
Core
DOM: HTML Parser
Tracking
()
RESOLVED
FIXED
mozilla63
Tracking | Status | |
---|---|---|
firefox63 | --- | fixed |
People
(Reporter: hsivonen, Assigned: hsivonen)
References
(Blocks 1 open bug)
Details
Attachments
(6 files)
By code inspection, it seems that asking for a "UTF-16BE" or "UTF-16LE" decoder returns a decoder that has fixed endianness but still swallows the first character if it is a U+FEFF. This is wrong, because only "UTF-16" should interpret U+FEFF as a BOM and the explicit LE and BE variant should treat it as a ZWNBSP. This is bad, because code that has done its own BOM sniffing and wants a decoder without BOM swallowing state (the HTML5 parser) can't get one.
Assignee | ||
Comment 1•13 years ago
|
||
Assignee | ||
Comment 2•13 years ago
|
||
Assignee | ||
Comment 3•13 years ago
|
||
Assignee | ||
Comment 4•13 years ago
|
||
Chrome swallows one BOM as well. Opera 11 seems to swallow any number of BOMs. Let's assume the Web might require this and deal on the parser side.
Assignee: smontagu → nobody
Component: Internationalization → HTML: Parser
QA Contact: i18n → parser
Summary: UTF-16BE and UTF-16LE decoders swallow an initial BOM → Since UTF-16BE and UTF-16LE decoders swallow an initial BOM, the HTML5 parser should feed them a BOM to swallow
Comment 5•13 years ago
|
||
Comment 6•13 years ago
|
||
All browsers seem to swallow UTF-16LE BOM even for plain text.
Comment 7•7 years ago
|
||
Is this still a bug? Per the Encoding Standard the UTF-16 variants work a bit differently and since we now implement that...
Assignee | ||
Comment 8•6 years ago
|
||
(In reply to Anne (:annevk) from comment #7) > Is this still a bug? The test cases with more than one BOM still demonstrate a bug: Two BOMs are removed from the start of the stream.
Summary: Since UTF-16BE and UTF-16LE decoders swallow an initial BOM, the HTML5 parser should feed them a BOM to swallow → HTML parser should remove only one BOM
Assignee | ||
Comment 9•6 years ago
|
||
MozReview-Commit-ID: 1zoGFxx9MCm
Assignee | ||
Comment 10•6 years ago
|
||
Let's see if there are already tests that test this: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8aab8ed5a98fe52ed87698981ee0dbe1d1510526
Assignee | ||
Updated•6 years ago
|
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
Assignee | ||
Comment 11•6 years ago
|
||
This patch leaves an unswallowed BOM at the start of reloaded document.open()ed docs. parser/htmlparser/tests/mochitest/test_bug715739.html
Assignee | ||
Comment 12•6 years ago
|
||
Trying again: https://treeherder.mozilla.org/#/jobs?repo=try&revision=305408dd3f44a79f9f052fba1d7ffd8c3d92021f
Assignee | ||
Comment 13•6 years ago
|
||
WPT lacked a pre-existing test, so adding one: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1a17225a6a347514f30ebad04cc012df3c41f0c7
Comment 14•6 years ago
|
||
Comment on attachment 8998478 [details] Bug 634541 - Make the HTML parser remove only one BOM when the input starts with multiple BOMs. Boris Zbarsky [:bzbarsky, bz on IRC] (vacation Aug 16-27) has approved the revision.
Attachment #8998478 -
Flags: review+
Comment 15•6 years ago
|
||
Pushed by hsivonen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/aeb2e2eaf0c4 Make the HTML parser remove only one BOM when the input starts with multiple BOMs. r=bzbarsky
Comment 16•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/aeb2e2eaf0c4
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
status-firefox63:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/12454 for changes under testing/web-platform/tests
Upstream PR merged
You need to log in
before you can comment on or make changes to this bug.
Description
•