Closed
Bug 1169248
Opened 9 years ago
Closed 9 years ago
GBK encoder doesn't encode U+20AC and U+FFFD correctly
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
RESOLVED
FIXED
mozilla41
Tracking | Status | |
---|---|---|
firefox41 | --- | fixed |
People
(Reporter: emk, Assigned: emk)
References
Details
Attachments
(1 file, 1 obsolete file)
10.23 KB,
patch
|
smontagu
:
review+
|
Details | Diff | Splinter Review |
I found this when I'm finding encoding-only mappings. GBK has the following two broken mappings. U+20AC => [0x00,0x80] (not [0x80]) U+FFFD => [0x81,0x7F] GB18030 encoder also misconverts U+FFFD.
Assignee | ||
Comment 1•9 years ago
|
||
gbkuniq2b.uf only contains U+20AC=>0x80 mapping which is a single-byte mapping. gGBKToUnicodeTable contains UCS2_NO_MAPPING, so UCS2_NO_MAPPING must be checked before searching the table.
Assignee | ||
Comment 2•9 years ago
|
||
By the way, GBK/GB18030 encoders will convert unmappable characters to empty. Is this expected?
Assignee | ||
Comment 3•9 years ago
|
||
This patch also fixed one web-platform test failure. https://treeherder.mozilla.org/#/jobs?repo=try&revision=4d67a3a07013
Attachment #8612560 -
Attachment is obsolete: true
Attachment #8612560 -
Flags: review?(smontagu)
Attachment #8613168 -
Flags: review?(smontagu)
Updated•9 years ago
|
Attachment #8613168 -
Flags: review?(smontagu) → review+
Comment 4•9 years ago
|
||
(In reply to Masatoshi Kimura [:emk] from comment #2) > By the way, GBK/GB18030 encoders will convert unmappable characters to > empty. Is this expected? No, they should use a replacement character or return an error depending on SetOutputErrorBehavior, like other encoders. Do you have a testcase for this? And what characters are unmappable in GBK/GB18030 anyway?
Assignee | ||
Comment 5•9 years ago
|
||
(In reply to Simon Montagu :smontagu from comment #4) > No, they should use a replacement character or return an error depending on > SetOutputErrorBehavior, like other encoders. Most legacy encodings does not have the replacement character, so we can't use it for encoders. (In reply to Simon Montagu :smontagu from comment #4) > Do you have a testcase for this? data:text/html;charset=gbk,<script>a=document.createElement('a');a.href='https://example.com/?\ufffd';document.write(a.href)</script> It will display https://example.com/?%EF%BF%BD if "gbk" is changed to a different encoding. > And what characters are unmappable in > GBK/GB18030 anyway? Ah, GB18030 has no unmappable character. But GBK has many unmappable characters such as U+FFFD.
Comment 6•9 years ago
|
||
(In reply to Masatoshi Kimura [:emk] from comment #5) > (In reply to Simon Montagu :smontagu from comment #4) > > No, they should use a replacement character or return an error depending on > > SetOutputErrorBehavior, like other encoders. > > Most legacy encodings does not have the replacement character, so we can't > use it for encoders. *A* replacement character, e.g. '?', not necessarily the Unicode REPLACEMENT CHARACTER. For example, http://mxr.mozilla.org/mozilla-central/source/dom/base/nsDocumentEncoder.cpp#1207. I need to investigate what the caller is doing in your example.
Comment 7•9 years ago
|
||
I'll open a separate bug about the issue in comment 2. I believe that the problem is with assumptions about the length of the output, not with the mapping itself.
Comment 9•9 years ago
|
||
Bug 1170932 GBK encoder silently omits unmapped characters
Comment 10•9 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/b38c7d19b1d6
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
status-firefox41:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla41
You need to log in
before you can comment on or make changes to this bug.
Description
•