Closed Bug 1169248 Opened 9 years ago Closed 9 years ago

GBK encoder doesn't encode U+20AC and U+FFFD correctly

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla41
Tracking Status
firefox41 --- fixed

People

(Reporter: emk, Assigned: emk)

References

Details

Attachments

(1 file, 1 obsolete file)

I found this when I'm finding encoding-only mappings.
GBK has the following two broken mappings.
U+20AC => [0x00,0x80] (not [0x80])
U+FFFD => [0x81,0x7F]
GB18030 encoder also misconverts U+FFFD.
Attached patch Fix GBK/GB18030 encoders (obsolete) — Splinter Review
gbkuniq2b.uf only contains U+20AC=>0x80 mapping which is a single-byte mapping.
gGBKToUnicodeTable contains UCS2_NO_MAPPING, so UCS2_NO_MAPPING must be checked before searching the table.
Assignee: nobody → VYV03354
Status: NEW → ASSIGNED
Attachment #8612560 - Flags: review?(smontagu)
By the way, GBK/GB18030 encoders will convert unmappable characters to empty. Is this expected?
This patch also fixed one web-platform test failure.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4d67a3a07013
Attachment #8612560 - Attachment is obsolete: true
Attachment #8612560 - Flags: review?(smontagu)
Attachment #8613168 - Flags: review?(smontagu)
Attachment #8613168 - Flags: review?(smontagu) → review+
(In reply to Masatoshi Kimura [:emk] from comment #2)
> By the way, GBK/GB18030 encoders will convert unmappable characters to
> empty. Is this expected?

No, they should use a replacement character or return an error depending on SetOutputErrorBehavior, like other encoders. 

Do you have a testcase for this? And what characters are unmappable in GBK/GB18030 anyway?
(In reply to Simon Montagu :smontagu from comment #4)
> No, they should use a replacement character or return an error depending on
> SetOutputErrorBehavior, like other encoders. 

Most legacy encodings does not have the replacement character, so we can't use it for encoders.

(In reply to Simon Montagu :smontagu from comment #4)
> Do you have a testcase for this?

data:text/html;charset=gbk,<script>a=document.createElement('a');a.href='https://example.com/?\ufffd';document.write(a.href)</script>

It will display https://example.com/?%EF%BF%BD if "gbk" is changed to a different encoding.

> And what characters are unmappable in
> GBK/GB18030 anyway?

Ah, GB18030 has no unmappable character. But GBK has many unmappable characters such as U+FFFD.
(In reply to Masatoshi Kimura [:emk] from comment #5)
> (In reply to Simon Montagu :smontagu from comment #4)
> > No, they should use a replacement character or return an error depending on
> > SetOutputErrorBehavior, like other encoders. 
> 
> Most legacy encodings does not have the replacement character, so we can't
> use it for encoders.

*A* replacement character, e.g. '?', not necessarily the Unicode REPLACEMENT CHARACTER. For example, http://mxr.mozilla.org/mozilla-central/source/dom/base/nsDocumentEncoder.cpp#1207.

I need to investigate what the caller is doing in your example.
I'll open a separate bug about the issue in comment 2. I believe that the problem is with assumptions about the length of the output, not with the mapping itself.
Bug 1170932 GBK encoder silently omits unmapped characters
https://hg.mozilla.org/mozilla-central/rev/b38c7d19b1d6
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla41
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: