Closed Bug 634541 Opened 13 years ago Closed 6 years ago

HTML parser should remove only one BOM

Categories

(Core :: DOM: HTML Parser, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla63
Tracking Status
firefox63 --- fixed

People

(Reporter: hsivonen, Assigned: hsivonen)

References

(Blocks 1 open bug)

Details

Attachments

(6 files)

Attached file Test case
By code inspection, it seems that asking for a "UTF-16BE" or "UTF-16LE" decoder returns a decoder that has fixed endianness but still swallows the first character if it is a U+FEFF.

This is wrong, because only "UTF-16" should interpret U+FEFF as a BOM and the explicit LE and BE variant should treat it as a ZWNBSP.

This is bad, because code that has done its own BOM sniffing and wants a decoder without BOM swallowing state (the HTML5 parser) can't get one.
Attached file Test case with doctype
Chrome swallows one BOM as well. Opera 11 seems to swallow any number of BOMs. Let's assume the Web might require this and deal on the parser side.
Assignee: smontagu → nobody
Component: Internationalization → HTML: Parser
QA Contact: i18n → parser
Summary: UTF-16BE and UTF-16LE decoders swallow an initial BOM → Since UTF-16BE and UTF-16LE decoders swallow an initial BOM, the HTML5 parser should feed them a BOM to swallow
All browsers seem to swallow UTF-16LE BOM even for plain text.
Blocks: encoding
Blocks: 1102679
Is this still a bug? Per the Encoding Standard the UTF-16 variants work a bit differently and since we now implement that...
(In reply to Anne (:annevk) from comment #7)
> Is this still a bug?

The test cases with more than one BOM still demonstrate a bug: Two BOMs are removed from the start of the stream.
Summary: Since UTF-16BE and UTF-16LE decoders swallow an initial BOM, the HTML5 parser should feed them a BOM to swallow → HTML parser should remove only one BOM
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
This patch leaves an unswallowed BOM at the start of reloaded document.open()ed docs. parser/htmlparser/tests/mochitest/test_bug715739.html
Comment on attachment 8998478 [details]
Bug 634541 - Make the HTML parser remove only one BOM when the input starts with multiple BOMs.

Boris Zbarsky [:bzbarsky, bz on IRC] (vacation Aug 16-27) has approved the revision.
Attachment #8998478 - Flags: review+
Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/aeb2e2eaf0c4
Make the HTML parser remove only one BOM when the input starts with multiple BOMs. r=bzbarsky
https://hg.mozilla.org/mozilla-central/rev/aeb2e2eaf0c4
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/12454 for changes under testing/web-platform/tests
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: