Closed Bug 634541 Opened 13 years ago Closed 6 years ago

HTML parser should remove only one BOM

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla63

Tracking Flags:

Tracking

Status

firefox63

---

fixed

People

(Reporter: hsivonen, Assigned: hsivonen)

References

(Blocks 1 open bug)

Details

Attachments

(6 files)

Test case 13 years ago Henri Sivonen (:hsivonen) 112 bytes, text/html; charset=UTF-16LE		Details
Test case with doctype 13 years ago Henri Sivonen (:hsivonen) 142 bytes, text/html; charset=UTF-16LE		Details
Two BOMs, no explicit LE label 13 years ago Henri Sivonen (:hsivonen) 144 bytes, text/html		Details
Three BOMs no explicit label 13 years ago Henri Sivonen (:hsivonen) 146 bytes, text/html		Details
plain text testcase 13 years ago Masatoshi Kimura [:emk] 2 bytes, text/plain; charset=UTF-16LE		Details
Bug 634541 - Make the HTML parser remove only one BOM when the input starts with multiple BOMs. 6 years ago Henri Sivonen (:hsivonen) 46 bytes, text/x-phabricator-request	bzbarsky : review+	Details \| Review

Henri Sivonen (:hsivonen)

Assignee

Description

•

13 years ago

Attached file Test case — Details

By code inspection, it seems that asking for a "UTF-16BE" or "UTF-16LE" decoder returns a decoder that has fixed endianness but still swallows the first character if it is a U+FEFF.

This is wrong, because only "UTF-16" should interpret U+FEFF as a BOM and the explicit LE and BE variant should treat it as a ZWNBSP.

This is bad, because code that has done its own BOM sniffing and wants a decoder without BOM swallowing state (the HTML5 parser) can't get one.

Henri Sivonen (:hsivonen)

Assignee

Comment 1

•

13 years ago

Attached file Test case with doctype — Details

Henri Sivonen (:hsivonen)

Assignee

Comment 2

•

13 years ago

Attached file Two BOMs, no explicit LE label — Details

Henri Sivonen (:hsivonen)

Assignee

Comment 3

•

13 years ago

Attached file Three BOMs no explicit label — Details

Henri Sivonen (:hsivonen)

Assignee

Comment 4

•

13 years ago

Chrome swallows one BOM as well. Opera 11 seems to swallow any number of BOMs. Let's assume the Web might require this and deal on the parser side.

Assignee: smontagu → nobody

Component: Internationalization → HTML: Parser

QA Contact: i18n → parser

Summary: UTF-16BE and UTF-16LE decoders swallow an initial BOM → Since UTF-16BE and UTF-16LE decoders swallow an initial BOM, the HTML5 parser should feed them a BOM to swallow

Masatoshi Kimura [:emk]

Comment 5

•

13 years ago

Attached file plain text testcase — Details

Masatoshi Kimura [:emk]

Comment 6

•

13 years ago

All browsers seem to swallow UTF-16LE BOM even for plain text.

Masatoshi Kimura [:emk]

Updated

•

12 years ago

Blocks: encoding

Anne (:annevk)

Updated

•

10 years ago

Blocks: 1102679

Anne (:annevk)

Comment 7

•

7 years ago

Is this still a bug? Per the Encoding Standard the UTF-16 variants work a bit differently and since we now implement that...

Henri Sivonen (:hsivonen)

Assignee

Comment 8

•

6 years ago

(In reply to Anne (:annevk) from comment #7)
> Is this still a bug?

The test cases with more than one BOM still demonstrate a bug: Two BOMs are removed from the start of the stream.

Summary: Since UTF-16BE and UTF-16LE decoders swallow an initial BOM, the HTML5 parser should feed them a BOM to swallow → HTML parser should remove only one BOM

Henri Sivonen (:hsivonen)

Assignee

Comment 9

•

6 years ago

Attached file Bug 634541 - Make the HTML parser remove only one BOM when the input starts with multiple BOMs. — Details

MozReview-Commit-ID: 1zoGFxx9MCm

Henri Sivonen (:hsivonen)

Assignee

Comment 10

•

6 years ago

Let's see if there are already tests that test this:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8aab8ed5a98fe52ed87698981ee0dbe1d1510526

Henri Sivonen (:hsivonen)

Assignee

Updated

•

6 years ago

Assignee: nobody → hsivonen

Status: NEW → ASSIGNED

Henri Sivonen (:hsivonen)

Assignee

Comment 11

•

6 years ago

This patch leaves an unswallowed BOM at the start of reloaded document.open()ed docs. parser/htmlparser/tests/mochitest/test_bug715739.html

Henri Sivonen (:hsivonen)

Assignee

Comment 12

•

6 years ago

Trying again:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=305408dd3f44a79f9f052fba1d7ffd8c3d92021f

Henri Sivonen (:hsivonen)

Assignee

Comment 13

•

6 years ago

WPT lacked a pre-existing test, so adding one:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1a17225a6a347514f30ebad04cc012df3c41f0c7

Boris Zbarsky [:bzbarsky]

Comment 14

•

6 years ago

Comment on attachment 8998478 [details]
Bug 634541 - Make the HTML parser remove only one BOM when the input starts with multiple BOMs.

Boris Zbarsky [:bzbarsky, bz on IRC] (vacation Aug 16-27) has approved the revision.

Attachment #8998478 - Flags: review+

Pulsebot

Comment 15

•

6 years ago

Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/aeb2e2eaf0c4
Make the HTML parser remove only one BOM when the input starts with multiple BOMs. r=bzbarsky

Natalia Csoregi [:nataliaCs]

Comment 16

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/aeb2e2eaf0c4

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

status-firefox63: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla63

Web Platform Test Sync Bot (Matrix: #interop:mozilla.org)

Comment 17

•

6 years ago

Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/12454 for changes under testing/web-platform/tests

Web Platform Test Sync Bot (Matrix: #interop:mozilla.org)

Comment 18

•

6 years ago

Upstream PR merged

You need to log in before you can comment on or make changes to this bug.