Investigate Shift_JIS decoder changes of Encoding Standard

RESOLVED FIXED in mozilla56

Status

()

Core
Internationalization
RESOLVED FIXED
6 years ago
10 months ago

People

(Reporter: emk, Assigned: hsivonen)

Tracking

(Blocks: 1 bug)

Trunk
mozilla56
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [fixed by encoding_rs])

Attachments

(1 attachment)

(Reporter)

Description

6 years ago
Encoding Standard removed some "Gecko quirks" from the Shift_JIS decoder.
http://dvcs.w3.org/hg/encoding/rev/7c876db1159c
1. The fallback code point is no longer U+30FB.
2. 0xA0 and 0xFD to 0xFF do no longer emit PUA code points
3. EUDC code ranges are no longer supported.
Removing 1. and 2. may be fine. WebKit doesn't support them either. But I'm not sure it is acceptable to remove EUDC code ranges. EUDC code ranges are extensively used by Web pages for Japanese mobile phone.
(Reporter)

Comment 1

6 years ago
Created attachment 617323 [details]
Test case

Only Opera did not support EUDC code ranges.

Comment 2

6 years ago
Can we do something better than mapping to PUA (e.g. Unicode emoji)? Do Android and iOS have special fonts for the EUDC range?
(Reporter)

Comment 3

6 years ago
> Can we do something better than mapping to PUA (e.g. Unicode emoji)?
It's impossible because PUA-to-Unicode mapping is different between careers :(
> Do Android and iOS have special fonts for the EUDC range?
Dunno. Web pages for Smart phones should use UTF-8 from the start. My main concern is about Japanese feature phones which support only Shift_JIS.

Comment 4

6 years ago
Is Gecko shipping on those phones? What is the end user scenario here?
(Reporter)

Comment 5

6 years ago
There is a Firefox add-on to develop Japanese mobile-phone sites on PC.
http://firemobilesimulator.org/
Removing EUDC mappings will affect the add-on.
(Reporter)

Comment 6

6 years ago
BTW, KDDI, a Japanese mobile-phone career, provides Opera Mobile to their feature-phones (which is called "PC site viewer.") It seems to support KDDI emoji.

Comment 7

6 years ago
Yeah for select Japanese products it seems Opera has some mappings to PUA, but it's not exhaustive and limited to those products. It seems kind of weird to keep this given that the content can only be consumed on those specific phones.
(Reporter)

Comment 8

6 years ago
(In reply to Anne van Kesteren from comment #2)
> Can we do something better than mapping to PUA (e.g. Unicode emoji)? Do
> Android and iOS have special fonts for the EUDC range?
It looks like Softbank iPhone supports Softbank-emoji in the EUDC range.
(Reporter)

Comment 9

6 years ago
(In reply to Anne van Kesteren from comment #7)
>  It seems kind of
> weird to keep this given that the content can only be consumed on those
> specific phones.
Mobile-phone pages are also viewable from PC browsers. Some people even prefer mobile version rather than full of ads.
(Reporter)

Comment 10

6 years ago
Docomo mobile-phones pages can be served with application/xhtml+xml [1]. If those pages contain an emoji, normal browsers (including smart phones' one) can not view them at all because the fallback code point is fatal on XML.
But Opera will not be affected because Opera decided to violate the spec [2].
[1] http://www.nttdocomo.co.jp/service/developer/make/content/browser/xhtml/notice/basis/index.html
[2] http://my.opera.com/ODIN/blog/2011/09/28/no-more-xml-parsing-failed-errors
(Reporter)

Comment 11

6 years ago
I'm inclined to agree with Shawn. Legacy encodings are included to support legacy contents which is unlikely to be updated. Any innocent-looking changes to encodings will break some of them.

Comment 12

6 years ago
As can be seen from e.g. http://www.unicode.org/~scherer/emoji4unicode/snapshot/utc.html the KDDI mapping is different from the algorithm Gecko has employed (search for "U+EB" which is the start of a PUA the algorithm you have cannot generate). Each vendor has its own conversion table, including to PUA. Not supporting this at all seems better for the end user since we have no idea what the page meant.
(Reporter)

Comment 13

6 years ago
Some mobile-phones do not even support NEC/IBM extensions. Softbank-emoji has overlapped mappings with IBM extensions [1].
I don't care about extensions which are incompatible with Microsoft Codepage 932.
Microsoft used to publish their EUDC-to-PUA mappings [2]. Although they removed the document, they did not (and will never) change their implementation. It should be documented elsewhere.
[1] http://d.hatena.ne.jp/NAOI/20120423/1335164541
[2] http://web.archive.org/web/*/http://microsoft.com/typography/unicode/932.txt
(Reporter)

Comment 14

6 years ago
(In reply to Masatoshi Kimura [:emk] from comment #13)
> Microsoft used to publish their EUDC-to-PUA mappings [2]. Although they
> removed the document,
I found that they published the data file again.
https://www.microsoft.com/en-us/download/details.aspx?id=10921
They also published their algorithm.
http://msdn.microsoft.com/en-us/library/cc248976%28v=prot.10%29.aspx
(Reporter)

Comment 15

6 years ago
(In reply to Masatoshi Kimura [:emk] from comment #14)
> They also published their algorithm.
> http://msdn.microsoft.com/en-us/library/cc248976%28v=prot.10%29.aspx
This algorithm does not match what actually IE does after MS11-057. It always eats DBCS second bytes. So we need an algorithm to handle invalid sequences for security reason.
(Assignee)

Comment 16

a year ago
(In reply to Masatoshi Kimura [:emk] from comment #0)
> Encoding Standard removed some "Gecko quirks" from the Shift_JIS decoder.
> http://dvcs.w3.org/hg/encoding/rev/7c876db1159c
> 1. The fallback code point is no longer U+30FB.
> 2. 0xA0 and 0xFD to 0xFF do no longer emit PUA code points
> 3. EUDC code ranges are no longer supported.
> Removing 1. and 2. may be fine. WebKit doesn't support them either. But I'm
> not sure it is acceptable to remove EUDC code ranges. EUDC code ranges are
> extensively used by Web pages for Japanese mobile phone.

Proceeding with removal of quirks #1 and #2. (Not supported by Blink or Presto, either.)

EUDC was restored in the Encoding Standard and is supported by encoding_rs.
Depends on: 1261841
(Assignee)

Comment 17

10 months ago
Bug 1261841 removed quirks #1 and #2.
Assignee: nobody → hsivonen
Status: NEW → RESOLVED
Last Resolved: 10 months ago
Resolution: --- → FIXED
Whiteboard: [fixed by encoding_rs]
Target Milestone: --- → mozilla56
You need to log in before you can comment on or make changes to this bug.