Closed
Bug 811363
Opened 12 years ago
Closed 12 years ago
Autodetection doesn't recognize supplementary (four-byte) characters as UTF-8
Categories
(Core :: Internationalization, defect)
Tracking
()
RESOLVED
FIXED
mozilla19
People
(Reporter: smontagu, Assigned: smontagu)
Details
Attachments
(3 files, 5 obsolete files)
966 bytes,
text/html
|
Details | |
16.89 KB,
patch
|
jfkthame
:
review+
|
Details | Diff | Splinter Review |
10.17 KB,
patch
|
jfkthame
:
review+
|
Details | Diff | Splinter Review |
Raw characters in data: URIs are normally interpreted as expected. For example, all of the following work correctly:
data:text/plain,ᏣᎳᎩ
data:text/plain,中文
data:text/plain,עברית
However when the raw characters are outside the BMP they are incorrectly decoded. I can't give an example in a bugzilla comment because of bug 405011, but see the attachment.
Comment 1•12 years ago
|
||
I have to explicitly select utf-8 for any of these (BMP or supplementary) to work; I'm not seeing a BMP vs non-BMP distinction.
Assignee | ||
Comment 2•12 years ago
|
||
This is apparently a bug in encoding auto-detection
Component: Networking → Internationalization
Assignee | ||
Updated•12 years ago
|
Summary: Supplementary characters in data URIs corrupted → Autodetection doesn't recognize supplementary (four-byte) characters as UTF-8
Assignee | ||
Comment 3•12 years ago
|
||
This is an error in the data in the UTF-8 detection state machine tables. The second byte after 0xF0 in a four-bye UTF-8 sequence can be 0x90-0xBF, but we were only accepting 0xA0-0xBF.
Assignee: nobody → smontagu
Attachment #681272 -
Flags: review?(jfkthame)
Assignee | ||
Comment 5•12 years ago
|
||
Attachment #681275 -
Flags: review?(jfkthame)
Assignee | ||
Comment 6•12 years ago
|
||
Assignee | ||
Updated•12 years ago
|
Attachment #681274 -
Attachment is patch: true
Assignee | ||
Comment 7•12 years ago
|
||
Attachment #681272 -
Attachment is obsolete: true
Attachment #681274 -
Attachment is obsolete: true
Attachment #681275 -
Attachment is obsolete: true
Attachment #681272 -
Flags: review?(jfkthame)
Attachment #681274 -
Flags: review?(jfkthame)
Attachment #681275 -
Flags: review?(jfkthame)
Attachment #681620 -
Flags: review?(jfkthame)
Assignee | ||
Comment 8•12 years ago
|
||
Attachment #681276 -
Attachment is obsolete: true
Attachment #681623 -
Flags: review?(jfkthame)
Comment 9•12 years ago
|
||
Comment on attachment 681620 [details] [diff] [review]
Get the same results by patching genutf8.pl
Review of attachment 681620 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good, but I believe we can trim a bit more from the state table in genutf8, as noted below. That will of course result in changes to the generated data in nsMBCSSM.cpp, too.
r=me with that, assuming it passes tests OK.
::: intl/chardet/tools/genutf8.pl
@@ +176,5 @@
> + 1, 1, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, # state 5
> + 1, 1, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, # state 6
> + 1, 1, 5, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, # state 7
> + 1, 1, 5, 5, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, # state 8
> + 1, 1, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, # state 9
The last 4 columns here are all identical (and the same as column 0), which suggests they're redundant; and sure enough, the highest class actually used is 11. So drop those columns, and we'll get correspondingly smaller generated tables.
@@ +181,5 @@
> );
>
>
>
> $utf8_ver = genverifier::GenVerifier("UTF8", "UTF-8", \@utf8_cls, 16, \@utf8_st);
(The 16 will then need to change to 12 here, too.)
::: intl/chardet/tools/genverifier.pm
@@ +58,5 @@
> + " eIdxSft" . $bits . "bits,\n" .
> + " eSftMsk" . $bits . "bits,\n" .
> + " eBitSft" . $bits . "bits,\n" .
> + " eUnitMsk" . $bits . "bits,\n" .
> + " " . $name . $tbl . "\n" .
While you're here, might as well remove the trailing spaces here in the tool, too.
@@ +140,5 @@
> $ret .= ") ";
> } else {
> $ret .= "),";
> }
> + $ret .= sprintf("//%02x-%02x\n", $i, ($i+7));
Please add spaces to the generated comment, similar to the version in Gen4BitsClass above.
Attachment #681620 -
Flags: review?(jfkthame) → review+
Comment 10•12 years ago
|
||
Comment on attachment 681623 [details] [diff] [review]
More comprehensive tests
Review of attachment 681623 [details] [diff] [review]:
-----------------------------------------------------------------
::: extensions/universalchardet/tests/bug811363-invalid.text
@@ +6,5 @@
> +3-byte sequence with last byte missing (U+0000): à
> +4-byte sequence with last byte missing (U+0000): ð
> +Overlong encodings: À¯ ௠ð¯
> +Isolated surrogates: í í¿¿
> +Surrogate pairs: í í° í¯¿í¿¿
Wouldn't it be better to split these into multiple files (one for each kind of invalidity), so that they get tested individually?
Assignee | ||
Comment 11•12 years ago
|
||
Attachment #681623 -
Attachment is obsolete: true
Attachment #681623 -
Flags: review?(jfkthame)
Attachment #682191 -
Flags: review?(jfkthame)
Comment 12•12 years ago
|
||
Comment on attachment 682191 [details] [diff] [review]
Tests split up further
Review of attachment 682191 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good, thanks.
Attachment #682191 -
Flags: review?(jfkthame) → review+
Assignee | ||
Comment 13•12 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/c5c9d7951b47
https://hg.mozilla.org/integration/mozilla-inbound/rev/2366a926ad3c
Flags: in-testsuite+
Comment 14•12 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/c5c9d7951b47
https://hg.mozilla.org/mozilla-central/rev/2366a926ad3c
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
You need to log in
before you can comment on or make changes to this bug.
Description
•