Closed Bug 811363 Opened 13 years ago Closed 13 years ago

Autodetection doesn't recognize supplementary (four-byte) characters as UTF-8

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla19

People

(Reporter: smontagu, Assigned: smontagu)

Details

Attachments

(3 files, 5 obsolete files)

Testcase 13 years ago Simon Montagu :smontagu 966 bytes, text/html		Details
Patch 13 years ago Simon Montagu :smontagu 2.65 KB, patch		Details \| Diff \| Splinter Review
Patch 2, remove data for UTF-8 beyond F4 8F BF BF (U+10FFFF) 13 years ago Simon Montagu :smontagu 3.58 KB, patch		Details \| Diff \| Splinter Review
Patch 3, clean up and consolidate the tables 13 years ago Simon Montagu :smontagu 5.30 KB, patch		Details \| Diff \| Splinter Review
Tests 13 years ago Simon Montagu :smontagu 3.93 KB, patch		Details \| Diff \| Splinter Review
Get the same results by patching genutf8.pl 13 years ago Simon Montagu :smontagu 16.89 KB, patch	jfkthame : review+	Details \| Diff \| Splinter Review
More comprehensive tests 13 years ago Simon Montagu :smontagu 7.51 KB, patch		Details \| Diff \| Splinter Review
Tests split up further 13 years ago Simon Montagu :smontagu 10.17 KB, patch	jfkthame : review+	Details \| Diff \| Splinter Review

Simon Montagu :smontagu

Assignee

Description

•

13 years ago

Attached file Testcase — Details

Raw characters in data: URIs are normally interpreted as expected. For example, all of the following work correctly: data:text/plain,ᏣᎳᎩ data:text/plain,中文 data:text/plain,עברית However when the raw characters are outside the BMP they are incorrectly decoded. I can't give an example in a bugzilla comment because of bug 405011, but see the attachment.

Jonathan Kew [:jfkthame]

Comment 1

•

13 years ago

I have to explicitly select utf-8 for any of these (BMP or supplementary) to work; I'm not seeing a BMP vs non-BMP distinction.

Simon Montagu :smontagu

Assignee

Comment 2

•

13 years ago

This is apparently a bug in encoding auto-detection

Component: Networking → Internationalization

Simon Montagu :smontagu

Assignee

Updated

•

13 years ago

Summary: Supplementary characters in data URIs corrupted → Autodetection doesn't recognize supplementary (four-byte) characters as UTF-8

Simon Montagu :smontagu

Assignee

Comment 3

•

13 years ago

Attached patch Patch (obsolete) — Details — Splinter Review

This is an error in the data in the UTF-8 detection state machine tables. The second byte after 0xF0 in a four-bye UTF-8 sequence can be 0x90-0xBF, but we were only accepting 0xA0-0xBF.

Assignee: nobody → smontagu

Attachment #681272 - Flags: review?(jfkthame)

Simon Montagu :smontagu

Assignee

Comment 4

•

13 years ago

Attached patch Patch 2, remove data for UTF-8 beyond F4 8F BF BF (U+10FFFF) (obsolete) — Details — Splinter Review

While I'm here....

Attachment #681274 - Flags: review?(jfkthame)

Simon Montagu :smontagu

Assignee

Comment 5

•

13 years ago

Attached patch Patch 3, clean up and consolidate the tables (obsolete) — Details — Splinter Review

Attachment #681275 - Flags: review?(jfkthame)

Simon Montagu :smontagu

Assignee

Comment 6

•

13 years ago

Attached patch Tests (obsolete) — Details — Splinter Review

Simon Montagu :smontagu

Assignee

Updated

•

13 years ago

Attachment #681274 - Attachment is patch: true

Simon Montagu :smontagu

Assignee

Comment 7

•

13 years ago

Attached patch Get the same results by patching genutf8.pl — Details — Splinter Review

Attachment #681272 - Attachment is obsolete: true

Attachment #681274 - Attachment is obsolete: true

Attachment #681275 - Attachment is obsolete: true

Attachment #681272 - Flags: review?(jfkthame)

Attachment #681274 - Flags: review?(jfkthame)

Attachment #681275 - Flags: review?(jfkthame)

Attachment #681620 - Flags: review?(jfkthame)

Simon Montagu :smontagu

Assignee

Comment 8

•

13 years ago

Attached patch More comprehensive tests (obsolete) — Details — Splinter Review

Attachment #681276 - Attachment is obsolete: true

Attachment #681623 - Flags: review?(jfkthame)

Jonathan Kew [:jfkthame]

Comment 9

•

13 years ago

Comment on attachment 681620 [details] [diff] [review] Get the same results by patching genutf8.pl Review of attachment 681620 [details] [diff] [review]: ----------------------------------------------------------------- Looks good, but I believe we can trim a bit more from the state table in genutf8, as noted below. That will of course result in changes to the generated data in nsMBCSSM.cpp, too. r=me with that, assuming it passes tests OK. ::: intl/chardet/tools/genutf8.pl @@ +176,5 @@ > + 1, 1, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, # state 5 > + 1, 1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, # state 6 > + 1, 1, 5, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, # state 7 > + 1, 1, 5, 5, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, # state 8 > + 1, 1, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, # state 9 The last 4 columns here are all identical (and the same as column 0), which suggests they're redundant; and sure enough, the highest class actually used is 11. So drop those columns, and we'll get correspondingly smaller generated tables. @@ +181,5 @@ > ); > > > > $utf8_ver = genverifier::GenVerifier("UTF8", "UTF-8", \@utf8_cls, 16, \@utf8_st); (The 16 will then need to change to 12 here, too.) ::: intl/chardet/tools/genverifier.pm @@ +58,5 @@ > + " eIdxSft" . $bits . "bits,\n" . > + " eSftMsk" . $bits . "bits,\n" . > + " eBitSft" . $bits . "bits,\n" . > + " eUnitMsk" . $bits . "bits,\n" . > + " " . $name . $tbl . "\n" . While you're here, might as well remove the trailing spaces here in the tool, too. @@ +140,5 @@ > $ret .= ") "; > } else { > $ret .= "),"; > } > + $ret .= sprintf("//%02x-%02x\n", $i, ($i+7)); Please add spaces to the generated comment, similar to the version in Gen4BitsClass above.

Attachment #681620 - Flags: review?(jfkthame) → review+

Jonathan Kew [:jfkthame]

Comment 10

•

13 years ago

Comment on attachment 681623 [details] [diff] [review] More comprehensive tests Review of attachment 681623 [details] [diff] [review]: ----------------------------------------------------------------- ::: extensions/universalchardet/tests/bug811363-invalid.text @@ +6,5 @@ > +3-byte sequence with last byte missing (U+0000): à > +4-byte sequence with last byte missing (U+0000): ð > +Overlong encodings: À¯ à¯ ð¯ > +Isolated surrogates: í í¿¿ > +Surrogate pairs: í í° í¯¿í¿¿ Wouldn't it be better to split these into multiple files (one for each kind of invalidity), so that they get tested individually?

Simon Montagu :smontagu

Assignee

Comment 11

•

13 years ago

Attached patch Tests split up further — Details — Splinter Review

Attachment #681623 - Attachment is obsolete: true

Attachment #681623 - Flags: review?(jfkthame)

Attachment #682191 - Flags: review?(jfkthame)

Jonathan Kew [:jfkthame]

Comment 12

•

13 years ago

Comment on attachment 682191 [details] [diff] [review] Tests split up further Review of attachment 682191 [details] [diff] [review]: ----------------------------------------------------------------- Looks good, thanks.

Attachment #682191 - Flags: review?(jfkthame) → review+

Simon Montagu :smontagu

Assignee

Comment 13

•

13 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/c5c9d7951b47 https://hg.mozilla.org/integration/mozilla-inbound/rev/2366a926ad3c

Flags: in-testsuite+

Ryan VanderMeulen (PTO, back 6-April)

Comment 14

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/c5c9d7951b47 https://hg.mozilla.org/mozilla-central/rev/2366a926ad3c

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla19

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Autodetection doesn't recognize supplementary (four-byte) characters as UTF-8

Categories

(Core :: Internationalization, defect)

Tracking

()

People

(Reporter: smontagu, Assigned: smontagu)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files, 5 obsolete files)

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Attachment

General

Description

File Name

Content Type