Closed Bug 747762 Opened 12 years ago Closed 7 years ago

Investigate Shift_JIS decoder changes of Encoding Standard

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla56

People

(Reporter: emk, Assigned: hsivonen)

References

(Blocks 1 open bug)

Details

(Whiteboard: [fixed by encoding_rs])

Attachments

(1 file)

Encoding Standard removed some "Gecko quirks" from the Shift_JIS decoder.
http://dvcs.w3.org/hg/encoding/rev/7c876db1159c
1. The fallback code point is no longer U+30FB.
2. 0xA0 and 0xFD to 0xFF do no longer emit PUA code points
3. EUDC code ranges are no longer supported.
Removing 1. and 2. may be fine. WebKit doesn't support them either. But I'm not sure it is acceptable to remove EUDC code ranges. EUDC code ranges are extensively used by Web pages for Japanese mobile phone.
Attached file Test case
Only Opera did not support EUDC code ranges.
Can we do something better than mapping to PUA (e.g. Unicode emoji)? Do Android and iOS have special fonts for the EUDC range?
> Can we do something better than mapping to PUA (e.g. Unicode emoji)?
It's impossible because PUA-to-Unicode mapping is different between careers :(
> Do Android and iOS have special fonts for the EUDC range?
Dunno. Web pages for Smart phones should use UTF-8 from the start. My main concern is about Japanese feature phones which support only Shift_JIS.
Is Gecko shipping on those phones? What is the end user scenario here?
There is a Firefox add-on to develop Japanese mobile-phone sites on PC.
http://firemobilesimulator.org/
Removing EUDC mappings will affect the add-on.
BTW, KDDI, a Japanese mobile-phone career, provides Opera Mobile to their feature-phones (which is called "PC site viewer.") It seems to support KDDI emoji.
Yeah for select Japanese products it seems Opera has some mappings to PUA, but it's not exhaustive and limited to those products. It seems kind of weird to keep this given that the content can only be consumed on those specific phones.
(In reply to Anne van Kesteren from comment #2)
> Can we do something better than mapping to PUA (e.g. Unicode emoji)? Do
> Android and iOS have special fonts for the EUDC range?
It looks like Softbank iPhone supports Softbank-emoji in the EUDC range.
(In reply to Anne van Kesteren from comment #7)
>  It seems kind of
> weird to keep this given that the content can only be consumed on those
> specific phones.
Mobile-phone pages are also viewable from PC browsers. Some people even prefer mobile version rather than full of ads.
Docomo mobile-phones pages can be served with application/xhtml+xml [1]. If those pages contain an emoji, normal browsers (including smart phones' one) can not view them at all because the fallback code point is fatal on XML.
But Opera will not be affected because Opera decided to violate the spec [2].
[1] http://www.nttdocomo.co.jp/service/developer/make/content/browser/xhtml/notice/basis/index.html
[2] http://my.opera.com/ODIN/blog/2011/09/28/no-more-xml-parsing-failed-errors
I'm inclined to agree with Shawn. Legacy encodings are included to support legacy contents which is unlikely to be updated. Any innocent-looking changes to encodings will break some of them.
As can be seen from e.g. http://www.unicode.org/~scherer/emoji4unicode/snapshot/utc.html the KDDI mapping is different from the algorithm Gecko has employed (search for "U+EB" which is the start of a PUA the algorithm you have cannot generate). Each vendor has its own conversion table, including to PUA. Not supporting this at all seems better for the end user since we have no idea what the page meant.
Some mobile-phones do not even support NEC/IBM extensions. Softbank-emoji has overlapped mappings with IBM extensions [1].
I don't care about extensions which are incompatible with Microsoft Codepage 932.
Microsoft used to publish their EUDC-to-PUA mappings [2]. Although they removed the document, they did not (and will never) change their implementation. It should be documented elsewhere.
[1] http://d.hatena.ne.jp/NAOI/20120423/1335164541
[2] http://web.archive.org/web/*/http://microsoft.com/typography/unicode/932.txt
(In reply to Masatoshi Kimura [:emk] from comment #13)
> Microsoft used to publish their EUDC-to-PUA mappings [2]. Although they
> removed the document,
I found that they published the data file again.
https://www.microsoft.com/en-us/download/details.aspx?id=10921
They also published their algorithm.
http://msdn.microsoft.com/en-us/library/cc248976%28v=prot.10%29.aspx
(In reply to Masatoshi Kimura [:emk] from comment #14)
> They also published their algorithm.
> http://msdn.microsoft.com/en-us/library/cc248976%28v=prot.10%29.aspx
This algorithm does not match what actually IE does after MS11-057. It always eats DBCS second bytes. So we need an algorithm to handle invalid sequences for security reason.
(In reply to Masatoshi Kimura [:emk] from comment #0)
> Encoding Standard removed some "Gecko quirks" from the Shift_JIS decoder.
> http://dvcs.w3.org/hg/encoding/rev/7c876db1159c
> 1. The fallback code point is no longer U+30FB.
> 2. 0xA0 and 0xFD to 0xFF do no longer emit PUA code points
> 3. EUDC code ranges are no longer supported.
> Removing 1. and 2. may be fine. WebKit doesn't support them either. But I'm
> not sure it is acceptable to remove EUDC code ranges. EUDC code ranges are
> extensively used by Web pages for Japanese mobile phone.

Proceeding with removal of quirks #1 and #2. (Not supported by Blink or Presto, either.)

EUDC was restored in the Encoding Standard and is supported by encoding_rs.
Depends on: encoding_rs
Bug 1261841 removed quirks #1 and #2.
Assignee: nobody → hsivonen
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Whiteboard: [fixed by encoding_rs]
Target Milestone: --- → mozilla56
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: