Closed Bug 912470 Opened 11 years ago Closed 9 years ago

Merge big5-hkscs and big5 as per the Encoding Standard

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla43
Tracking Status
firefox43 --- fixed
relnote-firefox --- 43+

People

(Reporter: annevk, Assigned: hsivonen)

References

(Blocks 1 open bug)

Details

(Whiteboard: [If you believe this change regressed something, please read comment 88 about intentional changes and patchable XP breakage.])

Attachments

(6 files, 13 obsolete files)

95.78 KB, text/plain
Details
301.02 KB, text/plain
Details
285.35 KB, application/zip
Details
1.72 MB, patch
hsivonen
: review+
Details | Diff | Splinter Review
384.79 KB, patch
emk
: review+
Details | Diff | Splinter Review
1.66 KB, patch
emk
: review+
Details | Diff | Splinter Review
We should reevaluate bug 845743 per https://www.w3.org/Bugs/Public/show_bug.cgi?id=21146 as we did not implement the merge correctly. Data suggests we can try this again and get better results.
Anne, can you summarize what's happened with Gecko recently with regards to Big5? Was there a change in the mapping(s) that still doesn't match your spec?
Many versions ago we aliased the labels I think, but did not change the index, which obviously is a non-starter.
OK, so you've been using Mozilla's Big5-UAO-like mapping for both labels? That sounds really, really bad.
I think the plan is dropping the Big5-UAO mapping. So I said we would need an acknowledgement from MozTW community.
As evidenced by dev.platform threads localization teams have made many bad judgment calls with respect to encodings. We have some amount of data here as well as competing implementations not having the Big5-UAO mapping showing we should move away from it. An acknowledgment would be good, but I don't think that's required or in any way final.
(In reply to Anne (:annevk) from comment #2)
> Many versions ago we aliased the labels I think, but did not change the
> index, which obviously is a non-starter.

We undid the aliasing, because it broke someone's intranet and they threatened to switch to Chrome: bug 845743.

Before trying aliasing again, we should audit the decoders to see that the big5  decoder matches the spec.
Blocks: encoding
Hi Anne, Philip,

I want to thank WitchFire for bring this bug to attention in the thread.
https://groups.google.com/forum/#!topic/moztw-general/d5nRslb9h1k

As there is no definite statistic on usage of big5-uao or big5-eten on the web, if we drop the B2U (Big5 to Unicode) support of these mappings, the impact might be migratable.

However, from a practical point of view, there will be serious problems if we use big5-hkscs as the U2B (Unicode to Unicode) mapping. The smallest and safest set version of big5 is Windows CP950 and we should use that for U2B. If we use hkscs, the hkscs characters represented in big5 bytes will not be shown correctly in any other browsers out there.

Does that make sense?
(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #9)
> Hi Anne, Philip,
> 
> I want to thank WitchFire for bring this bug to attention in the thread.
> https://groups.google.com/forum/#!topic/moztw-general/d5nRslb9h1k
> 
> As there is no definite statistic on usage of big5-uao or big5-eten on the
> web, if we drop the B2U (Big5 to Unicode) support of these mappings, the
> impact might be migratable.

More research and statistics would be great. Lacking that, in <http://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/0046.html> I concluded that "Using Big5-UAO for Taiwanese sites would give mixed results. Correctly encoded Big5-UAO is very rare, so the tested mapping (Firefox) introduces almost as many user-visible misencodings as it fixes and masks many others."

> However, from a practical point of view, there will be serious problems if
> we use big5-hkscs as the U2B (Unicode to Unicode) mapping. The smallest and
> safest set version of big5 is Windows CP950 and we should use that for U2B.
> If we use hkscs, the hkscs characters represented in big5 bytes will not be
> shown correctly in any other browsers out there.
> 
> Does that make sense?

Does CP950 match what any browser already does for the label "big5" and in what sense is it the safest? If it is (more or less) a subset of Big5-HKSCS then it's unlikely to be an improvement, but one would have to compare how many sites Big5-HKSCS fixes vs how many bogus characters it produces.

Which "big5 bytes will not be shown correctly in any other browsers out there"?
(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #9)
> However, from a practical point of view, there will be serious problems if
> we use big5-hkscs as the U2B (Unicode to Unicode) mapping. The smallest and
> safest set version of big5 is Windows CP950 and we should use that for U2B.
> If we use hkscs, the hkscs characters represented in big5 bytes will not be
> shown correctly in any other browsers out there.
> 
> Does that make sense?

Per the spec, the big5 encoder will not emit hkscs specific characters.
http://encoding.spec.whatwg.org/#big5-encoder
Oops, I didn't understand that Tim was talking about the encoder, please disregard my comment.
(In reply to philip from comment #12)
> Oops, I didn't understand that Tim was talking about the encoder, please
> disregard my comment.

Sorry, my English is very very poor.

https://bugzilla.mozilla.org/show_bug.cgi?id=310299#c0

Maybe this explain got more clear.
As a Chinese user in Hong Kong, I recommends to maintain big5-hkscs  and big5 encoding separately.

If you merge big5-hkscs and big5, it will make few non-utf8 Taiwan or Hong Kong web pages won't not display properly.
=== from 310299 ===
The problem is, if a user browsing non-Big5 pages (e.g., sjis or utf8) 
copied some characters not in CP950 (e.g, Japanese hitakana) and pasted to 
Big5 websites then other users with pure CP950 environment (e.g, a Japanese
using Japanese Windows and Internet Explorer) cannot see these characters
correctly. They will mostly get blank display. But if we use real CP950 table
then they will be encoded as HTML entity form so that everybody (even with
original CP950+IE) can read it correctly.
===============

If a codepoint is not inside u2b table, browser will encode it to "HTML Escape Characters - &#unicode;", and those html escape characters will be decode correct by any other browser.

If people make a character that codepoont is inside big5-hkscs, when site character map is hkscs, u2b table will send out a big5-pua character. 

Finally those big5-pua character cannot view by IE & opera because they don't support big5-hkscs.

If we drop hkscs u2b table, and use CP950 u2b table to replace it. 

The same situation will got different result.

If people make a character that codepoont is inside big5-hkscs, when site character map is hkscs, u2b table now is CP950, those hkscs character become "no correspondence", and browser will encode those hkscs character with "html escape character", finally send out a "&#unicode;" - IE & opera can decode it correctly.

Above description is what bug 403564 really want to do.  

And this is the reason why there is very less TW webpage use big5-uao, because bug 310299 use this method to prevent it since 2006.
If I understand comment 11 correctly, Encodeing Standard already states that the u2b encoder should only output non-hkscs bytes. I am not sure if that makes the u2b mapping specified exactly equal to cp950 we are using right now, but it's worthy to consider to amend the standard than the code base, given the reasons above.
The Encoding Standard includes the Windows extensions already as far as I know. So it sounds like we are all in agreement it specifies what we should try to implement here.
(In reply to Anne (:annevk) from comment #17)
> The Encoding Standard includes the Windows extensions already as far as I
> know. So it sounds like we are all in agreement it specifies what we should
> try to implement here.

We need to confirm that the big5 encoder does NOT include extensions other than CP950. (The big5 decoder can contain non-CP950 extensions.)
Again, as far as I can tell that's how it is specified per discussion of about a year ago now with Philip et al.
(In reply to Anne (:annevk) from comment #19)
> Again, as far as I can tell that's how it is specified per discussion of
> about a year ago now with Philip et al.

Great, it looks like we have reach an agreement on the big5 encoder (u2b) part.

Is there any further concern (In reply to philip from comment #10) 
> More research and statistics would be great. Lacking that, in
> <http://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/0046.html> I
> concluded that "Using Big5-UAO for Taiwanese sites would give mixed results.
> Correctly encoded Big5-UAO is very rare, so the tested mapping (Firefox)
> introduces almost as many user-visible misencodings as it fixes and masks
> many others."
> 

For big5 decoder (b2u), does anyone reach any conclusions other than that Philip said above?
I took a brief look at what it would take to implement this.

big5.ut says:
>  Big5 to Unicode table is based on Big5-2003 plus UA0
>
>  Mapping tables used to generate the file:
>
>  Big5-2003: http://moztw.org/docs/big5/table/big5_2003-b2u.txt
>  UAO2.41:   http://moztw.org/docs/big5/table/uao241-b2u.txt

It seems to me that neither of those tables maps any Big5 two-byte sequence to a UTF-16 code unit sequence longer than 1. However, http://encoding.spec.whatwg.org/index-big5.txt has plenty of astral mappings. Also, http://encoding.spec.whatwg.org/#big5 specifies four big5 pointers that maps to two UTF-16 code units while staying on the BMP by mapping to a base character followed by a combining character. Considering these four mappings and the astral mappings, we need a decoder where a single big5 pointer can result in two UTF-16 code units.

nsBIG5ToUnicode doesn't implement the big5 math to get from bytes to a big5 pointer. Instead, it seems to use an abstraction called CreateMultiTableDecoder that maps two-byte byte sequences directly to UTF-16 code units.

Can the existing abstraction deal with two input bytes decoding into two UTF-16 code units? If not, should we just implement the spec (potentially with some custom optimizations to pack the long ranges of single-UTF-16-unit pointer mappings more densely than ranges that contain two-UTF-16-unit pointer mappings) and abandon CreateMultiTableDecoder for big5?

Considering the history of bug 310299, are we now OK with using the new index from the spec for the *encoder*? Does the HKSCS avoidance provision in the encoder algorithm in the spec happen to avoid all cases where two UTF-16 code units would map to a single big5 pointer value? That is, can we use the existing *encoder* abstraction if we regenerate big5.uf and the corresponding C code to avoid UAO?
Needinfoing Tim for the question about CreateMultiTableDecoder in the previous comment.
Flags: needinfo?(timdream)
(In reply to Henri Sivonen (:hsivonen) from comment #21)
> I took a brief look at what it would take to implement this.
> 
> big5.ut says:
> >  Big5 to Unicode table is based on Big5-2003 plus UA0
> >
> >  Mapping tables used to generate the file:
> >
> >  Big5-2003: http://moztw.org/docs/big5/table/big5_2003-b2u.txt
> >  UAO2.41:   http://moztw.org/docs/big5/table/uao241-b2u.txt
> 
> It seems to me that neither of those tables maps any Big5 two-byte sequence
> to a UTF-16 code unit sequence longer than 1. However,
> http://encoding.spec.whatwg.org/index-big5.txt has plenty of astral
> mappings. Also, http://encoding.spec.whatwg.org/#big5 specifies four big5
> pointers that maps to two UTF-16 code units while staying on the BMP by
> mapping to a base character followed by a combining character. Considering
> these four mappings and the astral mappings, we need a decoder where a
> single big5 pointer can result in two UTF-16 code units.

Yes, you are talking about bug 162431, mentioned in bug 343129 comment 12. These characters are re-mapped from Unicode PUA to Plane 2 in the HKSCS 2004 update, and we have never been able to support that.

As of the four astral mappings, I have no idea on where they come from. Maybe "historians" like But or Witch Five can give us more insight on that, needinfo'ing.

> 
> nsBIG5ToUnicode doesn't implement the big5 math to get from bytes to a big5
> pointer. Instead, it seems to use an abstraction called
> CreateMultiTableDecoder that maps two-byte byte sequences directly to UTF-16
> code units.
> 
> Can the existing abstraction deal with two input bytes decoding into two
> UTF-16 code units? If not, should we just implement the spec (potentially
> with some custom optimizations to pack the long ranges of single-UTF-16-unit
> pointer mappings more densely than ranges that contain two-UTF-16-unit
> pointer mappings) and abandon CreateMultiTableDecoder for big5?
> 
> Considering the history of bug 310299, are we now OK with using the new
> index from the spec for the *encoder*? Does the HKSCS avoidance provision in
> the encoder algorithm in the spec happen to avoid all cases where two UTF-16
> code units would map to a single big5 pointer value? That is, can we use the
> existing *encoder* abstraction if we regenerate big5.uf and the
> corresponding C code to avoid UAO?

I have no understanding of how Gecko works internally so I will not be helpful on implementation discussion. Sorry about that. I didn't check the index itself but the encoder part of the spec looks good (with a note stating avoiding emitting HKSCS extensions literally).

For the decoder it is indeed desirable to map HKSCS characters to Plane 2, in accordance to the spec.
Flags: needinfo?(timdream) → needinfo?(s793016)
(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #23)
> Yes, you are talking about bug 162431, mentioned in bug 343129 comment 12.
> These characters are re-mapped from Unicode PUA to Plane 2 in the HKSCS 2004
> update, and we have never been able to support that.
> 

bug 403564 is the HKSCS 2004 mapping bug blocked by bug 162431.
(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #23)
> Yes, you are talking about bug 162431, mentioned in bug 343129 comment 12.
> These characters are re-mapped from Unicode PUA to Plane 2 in the HKSCS 2004
> update, and we have never been able to support that.

I see. Thanks.

Looking at the Encoding Standard, Big5 is the only encoding that includes astral characters in non-algorithmic way. (That is, astral characters in UTF-8, UTF-16[BE|LE] and GB18030 don't need a lookup table of any kind.) Since we're not going to support EUC-TW in mozilla-central, I think it doesn't make sense to extend the existing multi-byte encoder and decoder machinery to deal with astral characters in lookup tables.

Instead, I suggest that we implement Big5 as specified in the Encoding Standard as a one-off special case with the data structures suitable for that one case without trying to shoehorn it into the existing encoder and decoder machinery.
Suggested implementation

(Deal with Basic Latin / ASCII in the obvious way.)

Decoder:

Types:

char16_t

struct pair {
  char16_t first;
  char16_t second;
};

Arrays:
static char16_t const NARROW[] = { ... };

static pair const WIDE = { ... };

NARROW consists of the UTF-16 code units for the index ranges
5024...11204 11254...18996
packed contiguously so that the code unit for index 5024 is at array index 0 and the code unit for index 11254 comes right after the code unit for index 11204.

WIDE consists of pairs for the for the index ranges
942...5023 11205...11213 18997...19781
likewise packed contiguously and where each pair is formatted as follows:
 * For BMP characters, the character is in |first| and |second| is zero.
 * For astral characters, the high surrogate is in |first| and the low surrogate in |second|.
 * For the four base character plus combining mark sequences, the base character is in |first| and the combining character in |second|.

Implement the algorithm for getting from two bytes to index from the spec. Check which of the five ranges the index falls into, adjust by the appropriate offset to get an array offset and read from the appropriate array. If |second| in a pair is zero, ignore |second| and only use |first|.


Encoder:

Discard index entries below 5024 (HKSCS). Separate {index, character} pairs into BMP characters and plane 2 characters. Discard the plane number (subtract 0x20000) from the plane 2 characters.

Sort the BMP pairs by character and the plane 2 pairs by character.

In the pairs, replace index by two bytes that are the Big5 bytes for that index. (Might as well precompute this, since we'd need 16 bits of space for the index if we used that.)

Create four arrays:

 * BMP characters (16-bit array entries)
 * Big5 bytes for those characters in the same order. (16-bit struct array entries consisting of two 8-bit fields)
 * Plane 2 characters without plane number (16-bit array entries)
 * Big5 bytes for those characters in the same order. (16-bit struct array entries consisting of two 8-bit fields)

If the input is a non-surrogate, perform a binary search on the BMP array. If not found, emit error. Otherwise, output the Big5 bytes at the same index from the second array.

If input is a high surrogate, consume the following low surrogate and do the math into a code point. If not on plane 2, emit error. Otherwise, discard the plane number and a binary search on the plane 2 array. If not found, emit error.  Otherwise, otput the Big5 bytes at the same index from the fourth array.

- -

Does that seem reasonable?
(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #23)
> As of the four astral mappings, I have no idea on where they come from.
> Maybe "historians" like But or Witch Five can give us more insight on that,
> needinfo'ing.

Assuming you mean the 4 combining characters instead, they have been proposed to includsion in ISO 10646 as combined sequence since day one:

http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2807.pdfβ€Ž

So they will very likely remain so forever, which I think is odd, since similar characters like U+1EBE do exist as single code point.
On the encoder, there are some more discussion in the local community worthy of being translated and brought up here.

https://groups.google.com/d/msg/moztw-general/d5nRslb9h1k/Dssjpng70oEJ

(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #23)
> I have no understanding of how Gecko works internally so I will not be
> helpful on implementation discussion. Sorry about that. I didn't check the
> index itself but the encoder part of the spec looks good (with a note
> stating avoiding emitting HKSCS extensions literally).

So one of the concern from a Hong Kong user is that the website might not response well in site search forms. The current Big5-HKSCS mapping in Gecko is symmetric, and encodes HKSCS-chars-in-BMP to its intended bytes. The new Encoding Standard will not encode these characters and send NCRs to the server, resulting 0 search results. The STR is as follows:

1. On a Big5-HKSCS encoded news website / archive
2. Search "θ‘žη”Ÿη½²" (Department of Health) (Note: not θ‘›η”Ÿη½², which used to be the agency of the same function in Taiwan).

With the current implementation:

1. The three characters will be encoded to their HKSCS specified bytes, and correct search result will return, e.g. "Search for θ‘žη”Ÿη½², got N results."

With the new implementation:

1. The first character will be converted to "&#34910;" and no search result will be returned. The search result page will read "Search for θ‘žη”Ÿη½², got 0 results." but the source code will be different (it will be the ncr we sent)


I did mentioned that the same issue should have already happened for non-BMP HKSCS chars since we did not fix bug 403564, resulting these characters being sent out as NCRs, but I think he tried to say having *some* of the HKSCS chars send as NCRs is better than send all of them as NCRs.

- * - * - * -

What is the original intentions to come up a combined, asymmetric Big5 encoding index in the Encoding Standard spec at first place? I personally understand for decoder it make sense to move away from UAO since there is no creditable usage on the web, but for encoder this issue will unfortunately bad enough to be considered as a blocker.
(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #28)
> With the current implementation:
> 
> 1. The three characters will be encoded to their HKSCS specified bytes, and
> correct search result will return, e.g. "Search for θ‘žη”Ÿη½², got N results."

But doesn't it conflict with the expectation of Taiwan people that 0x8F 0xC0 will be converted to "ηΌ³"?
We have decided that we will have only one Big5 converter, so the only neutral solution will be to encode the bytes to either PUA or NCR.

FYI, we, Japnese people, were suffered from gazillions of EUC-JP dialects for a long time, so we have decided to just encode non-standard characters to NCR. At least Google understands the query.
Hi, Masatoshi,

It's conflict.

Because I saw your are writing "We have decided that we will have only one Big5 converter", so I should make a quick reply first.

According to new discussion started in Hong Kong and Taiwan in last weekend (Tim posted in comment #28) , as I heard, It is possible that the merge of big5* encoders/decoders and big5-ucs (Taiwan standard on big5 extension) and big5-hkscs (Hong Kong standard on big5 extension) are conflict to each others. And it is possible to make over 4000 Hong Kong Chinese characters (some of them are popular used in Hong Kong as district name) not displaying correctly in the result of merge.

In Hong Kong, big5-hkscs are still in use in public service such as book enquiry system of public libraries. 

And different standards of Big5* are completely different cases to Japanese encoding.

(In reply to Masatoshi Kimura [:emk] from comment #29)
> (In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from
> comment #28)
> > With the current implementation:
> > 
> > 1. The three characters will be encoded to their HKSCS specified bytes, and
> > correct search result will return, e.g. "Search for θ‘žη”Ÿη½², got N results."
> 
> But doesn't it conflict with the expectation of Taiwan people that 0x8F 0xC0
> will be converted to "ηΌ³"?
> We have decided that we will have only one Big5 converter, so the only
> neutral solution will be to encode the bytes to either PUA or NCR.
> 
> FYI, we, Japnese people, were suffered from gazillions of EUC-JP dialects
> for a long time, so we have decided to just encode non-standard characters
> to NCR. At least Google understands the query.
(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #28)
> 1. On a Big5-HKSCS encoded news website / archive

To what extent do Big5-HKSCS sites declare Big5-HKSCS.

> What is the original intentions to come up a combined, asymmetric Big5
> encoding index in the Encoding Standard spec at first place?

I suppose Anne can speak about his intentions more precisely, but my understanding is this:

In general, in the Encoding Standard, if one encoding is a superset of another, the subset encoding  is not specified on its own but its labels are made labels of the superset encoding, since bytes that aren't printable characters in the subset virtually always are the result of encoding as the superset. This works very well for single-byte encodings, because in those cases IE already did such aliasing and the additional characters tend to be punctuation and non-letter symbols, so search queries not matching byte-wise is not a problem as long as people search for words rather than punctuation.

The idea of having an asymmetric encoder comes from what Firefox did with EUC-JP: There are there is dialects of EUC-JP that don't have distinct labels, so even if with TIS-620 and windows-874 you *could* take the position that one isn't merely an alias of the other, with the EUC-JP dialects, you don't have that option. The solution is to apply Postel's Law and make the decoder accepts more dialects than the encoder produces.

Then those precedents get applied to Big5.

But there are a couple of additional twists. First, IE doesn't properly expose the MS flavor of Big5-HKSCS, Windows-951, as an encoding with its own label. Instead, when a patch is applied to Windows, systems with the patch applied start silently treating Big5/Windows-950 as Big5-HKSCS/Windows-951. Having an encoding mean different things depending on whether the system has been patched seems like a terrible idea to me, but here we are. This handling of the situation in the case of IE suggests that we are dealing with a case similar to EUC-JP: that all Big5/Big5-HKSCS labels really go into one mix and Firefox/Chrome insisting on the labels mapping to two distinct encodings might not be that useful. See http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-March/035232.html in particular for the reasons to believe that we should make HK content work when labeled just "Big5".

The second twist is that Gecko's Big5-HKSCS implementation has been broken for a decade, which suggests that there's an opportunity to change things.

Philip JΓ€genstedt took a look at content that's out there (http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-April/035370.html), and it seemed like it was possible to treat the actually used part of Big5-HKSCS as a superset of Big5 (as opposed to conflicting overlap). See comment 6.

On this basis, it seemed sensible to come up with a unified *decoder*. Since, as noted above, HK content can be labeled just "Big5", it doesn't seem useful to try to maintain separation on the *encoder* side, either.

The actual encoder spec seems less researched to me but the general idea is that it tries to address the problem described in bug 310299 comment 0, which is analogous to the EUC-JP problem.

Maybe we could try a symmetric encoder instead and see what breaks.

> In Hong Kong, big5-hkscs are still in use in public service such as book enquiry system of public libraries. 

URL? What label does that system use? (Considering that Big5-HKSCS implementations aren't currently consistent among browsers, it seems like a very bad idea for an actively maintained site to stick to Big5-HKSCS instead of migrating to UTF-8...)

(In reply to Masatoshi Kimura [:emk] from comment #29)
> so the only
> neutral solution will be to encode the bytes to either PUA or NCR.

Having PUA mappings seems like a bad situation. I think our goal should not be neutrality when having to decide what to do with a byte sequence but maximal success given the legacy content that's out there. (Firefox/Chrome-specific HK intranets that use the label "Big5-HKSCS" are the probable losers here. We know of one such intranet existing.)
File is generated from joining moztw big5-hkscs2004-b2u.txt and uao241-b2u.txt
(In reply to Abel Cheung from comment #32)
> File is generated from joining moztw big5-hkscs2004-b2u.txt and
> uao241-b2u.txt

Have you compared these with how IE decodes those byte sequences without the HKSCS patch? Or how Chrome or Safari decode those byte sequences?
(In reply to Henri Sivonen (:hsivonen) from comment #33)
> Have you compared these with how IE decodes those byte sequences without the
> HKSCS patch? Or how Chrome or Safari decode those byte sequences?

Yup, planning to do that later tonight when back home, my Linux laptop has no Safari / IE. For the CP951 patch, IMHO it's just a fragment of insignificant history. I wonder how many people actually have installed it, not to mention how many such systems survived till today.

Along with those comparisons will come with my (longish) comment too.
For now, some little piece to read -- I have written about what CP951 is about in the past, location here: http://me.abelcheung.org/articles/research/what-is-cp951/
Hi Henri,

(In reply to Henri Sivonen (:hsivonen) from comment #31)
> 
> > In Hong Kong, big5-hkscs are still in use in public service such as book enquiry system of public libraries. 
> 
> URL? What label does that system use? (Considering that Big5-HKSCS
> implementations aren't currently consistent among browsers, it seems like a
> very bad idea for an actively maintained site to stick to Big5-HKSCS instead
> of migrating to UTF-8...)

It was reported from others on discussion by HK Chinese users.

BTW, there are a number of examples as well. Hong Kong Observatory is still using Big5-HKSCS in their hourly weather report.

http://www.weather.gov.hk/textonly/forecast/chinesewx.htm

You may find more examples from:

http://www.w3.org/html/ig/zh/wiki/Big5-hkscs-vs-uao-in-hk
Environment: Windows 7, with "language for non-Unicode program" set to Chinese (Traditional, Hong Kong).

3 characters shown are among the most frequently used HKSCS chars. All pages listed do claim content charset = Big5. To determine how the chars are decoded, they are copied and pasted to Notepad++, then converted to hex to show their individual UTF-8 bytes.

IE show all correct glyphs, though HKSCS are all mapped to Unicode PUA. Webkit based ones are consistent but failed to render with correct font. I'm surprised at how Firefox comes to U+57D7 for 3rd test since moztw UAO table text file says it should map to U+E88C.
Comment on attachment 8356928 [details]
Comparison of real world data: HKSCS char decoding & rendering in 5 browsers in Win7

Please disregard the comparison image for now, looks like some websites generate different pages with different Accept-Language order.
Attachment #8356928 - Attachment is obsolete: true
The decoding of HKSCS characters in Firefox/IE/Webkit shows interesting results.
Result for content charsets (charset=Big5 and charset=Big5-HKSCS) are listed here.

Looks like IE is the only browser having done the unifying of both charsets. Result is identical for both charset names. All except 41 HKSCS characters are mapped to Unicode PUA.

Webkit maintains separate decoder for each charset. The decoder for charset=Big5 case is identical to IE. Big5HKSCS decoder closely resembles official HKSCS mapping -- even all astral chars are done, except for 36 chars (4 composite sequences + 32 BMP chars) which are mapped to PUA.

Suppose I don't need to say much about the Firefox Big5 decoders -- Big5 one decodes HKSCS chars to something else, and Big5HKSCS one is approximately 2004 version with astral characters mapped to PUA.
Here are approx steps to generate testing result:

(1) Use gen-hkscs-test-in-b5.pl to generate 2 HTML files (with different content charset, I'm lazy), then upload the HTML files to web server
(2) Cut and paste rendered page to Notepad++, and save them as UTF-8 text without BOM
    (results might need a little format fixes, some lacks ending LF at EOF)
(3) The official-2004.txt is stripped down version of http://moztw.org/docs/big5/table/hkscs2004.txt which in turn comes from Hong Kong Government. It contains official char list of HKSCS 2004.
(4) Use print-unify-result.pl to merge results together.
(In reply to Masatoshi Kimura [:emk] from comment #29)
> (In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from
> comment #28)
> > With the current implementation:
> > 
> > 1. The three characters will be encoded to their HKSCS specified bytes, and
> > correct search result will return, e.g. "Search for θ‘žη”Ÿη½², got N results."
> 
> But doesn't it conflict with the expectation of Taiwan people that 0x8F 0xC0
> will be converted to "ηΌ³"?

It is, but in this case, the mapping come from UAO and is falling out of use on the Web, so I am not too concerned we no longer support that (in return, come up with a unified big5 decoder).

> We have decided that we will have only one Big5 converter, so the only
> neutral solution will be to encode the bytes to either PUA or NCR.

I would think so too, but again, such encoder *might* provide worse experience on sites label them as big5-hkscs.

For sites label them as Big5 the characters will be already decoded incorrectly w/o user manually switching it to big5-hkscs. The new unified impl actually provides better experience here.

> FYI, we, Japnese people, were suffered from gazillions of EUC-JP dialects
> for a long time, so we have decided to just encode non-standard characters
> to NCR. At least Google understands the query.

Yeah, and arguably no (Unicode) information is lost in the process. So for live sites that still want to use Big5-HKSCS and not switching to UTF-8, they could provide band-aid fix in their server-side code.

(In reply to Henri Sivonen (:hsivonen) from comment #31)
> (In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from
> comment #28)
> > 1. On a Big5-HKSCS encoded news website / archive
> 
> To what extent do Big5-HKSCS sites declare Big5-HKSCS.
> 
> > What is the original intentions to come up a combined, asymmetric Big5
> > encoding index in the Encoding Standard spec at first place?
> 
> I suppose Anne can speak about his intentions more precisely, but my
> understanding is this:
> 

Thanks Herni for the re-cap. I strongly recommend people reading it if the rationale is not clear to you.

> On this basis, it seemed sensible to come up with a unified *decoder*.
> Since, as noted above, HK content can be labeled just "Big5", it doesn't
> seem useful to try to maintain separation on the *encoder* side, either.

Even through I translated the post in comment 28, I do not know the impact it would be on live Big5-HKSCS websites.

> > In Hong Kong, big5-hkscs are still in use in public service such as book enquiry system of public libraries. 
> 
> URL? What label does that system use? (Considering that Big5-HKSCS
> implementations aren't currently consistent among browsers, it seems like a
> very bad idea for an actively maintained site to stick to Big5-HKSCS instead
> of migrating to UTF-8...)
> 
> (In reply to Masatoshi Kimura [:emk] from comment #29)
> > so the only
> > neutral solution will be to encode the bytes to either PUA or NCR.
> 
> Having PUA mappings seems like a bad situation. I think our goal should not
> be neutrality when having to decide what to do with a byte sequence but
> maximal success given the legacy content that's out there.
> (Firefox/Chrome-specific HK intranets that use the label "Big5-HKSCS" are
> the probable losers here. We know of one such intranet existing.)

I agree. PUA mapping is bad and HK govt fix that a decade ago in their spec. We shouldn't support that anymore.

(In reply to Sammy Fung from comment #36)
> 
> BTW, there are a number of examples as well. Hong Kong Observatory is still
> using Big5-HKSCS in their hourly weather report.
> 
> http://www.weather.gov.hk/textonly/forecast/chinesewx.htm
> 
> You may find more examples from:
> 
> http://www.w3.org/html/ig/zh/wiki/Big5-hkscs-vs-uao-in-hk

Can we find any of these sites that still search with bytes in their databases?

I played around the Hong Kong Observatory website and it's search page is in UTF-8, and it could search it's own Big5-HKSCS pages correctly.

Is comment 28 a real concern?
(In reply to Henri Sivonen (:hsivonen) from comment #31)
> But there are a couple of additional twists. First, IE doesn't properly
> expose the MS flavor of Big5-HKSCS, Windows-951, as an encoding with its own
> label. Instead, when a patch is applied to Windows, systems with the patch
> applied start silently treating Big5/Windows-950 as Big5-HKSCS/Windows-951.
> Having an encoding mean different things depending on whether the system has
> been patched seems like a terrible idea to me, but here we are.

Entirely correct about the CP951 thingie, until Microsoft effectively throw up their hands since 2004 and decided all Big5-HKSCS users shall migrate to Unicode instead, which may make sense since Big5-HKSCS 2004 characters were all included in Unicode 4.1 at that time ( http://www.microsoft.com/hk/hkscs/ ). The CP951 patch mentioned above, which implements Big5HKSCS 2001, only applies to Windows 2000 / XP.


> Philip JΓ€genstedt took a look at content that's out there
> (http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-April/035370.html),
> and it seemed like it was possible to treat the actually used part of
> Big5-HKSCS as a superset of Big5 (as opposed to conflicting overlap). See
> comment 6.

More twist with Big5-HKSCS part: HKSCS spec indeed do say it is a superset of Big5, without touching the core Big5 part at all -- without mentioning *which* Big5 is the basis. Merging them would prevent problem seen in all Linux systems, where they are actually incompatible, because Big5 converters receive patches while same changes were not applied to Big5HKSCS counterpart.


> On this basis, it seemed sensible to come up with a unified *decoder*.
> Since, as noted above, HK content can be labeled just "Big5", it doesn't
> seem useful to try to maintain separation on the *encoder* side, either.

Is encoder most frequently used for cases like sending content with specific charset to backend system such as database? If necessary, I can do more testing on that as well.

Though I suppose decoder is of higher important here, which accounts for majority of browser users.


> > In Hong Kong, big5-hkscs are still in use in public service such as book enquiry system of public libraries. 
> 
> URL? What label does that system use? (Considering that Big5-HKSCS
> implementations aren't currently consistent among browsers, it seems like a
> very bad idea for an actively maintained site to stick to Big5-HKSCS instead
> of migrating to UTF-8...)

The URL is https://webcat.hkpl.gov.hk/?theme=WEB&locale=en , however looks like it has actually migrated to UTF-8 recently, along with some other government departments. Hong Kong public library web pages were still using Big5 as of August (according to Internet Archive). I'd like to do all the blaming and finger pointing at Hong Kong Government, but seems can't now :)
(In reply to Henri Sivonen (:hsivonen) from comment #26)

I'll comment on the decoder part only here.

> NARROW consists of the UTF-16 code units for the index ranges
> 5024...11204 11254...18996
> WIDE consists of pairs for the for the index ranges
> 942...5023 11205...11213 18997...19781
> likewise packed contiguously

Is the intention to place the three Big5 private areas into wide arrays
 and two Big5 defined areas into char16_t arrays? If this is the case,
my calculation of indexes is almost exactly the same. Is the start of 4th
range (11254) a typo of 11214?


>  * For BMP characters, the character is in |first| and |second| is zero.
>  * For astral characters, the high surrogate is in |first| and the low
> surrogate in |second|.
>  * For the four base character plus combining mark sequences, the base
> character is in |first| and the combining character in |second|.
> 
> Implement the algorithm for getting from two bytes to index from the spec.
> Check which of the five ranges the index falls into, adjust by the
> appropriate offset to get an array offset and read from the appropriate
> array. If |second| in a pair is zero, ignore |second| and only use |first|.

For the 4 combining sequences, I wonder if mapping them to PUA is a possibility,
when considering that no fonts seen so far can display the resulting glyph
correctly (either diacritical marks overlapping, or the extra diacritical mark
running off). But then it's not of high importance -- personally I've never
seen that character ever used beyond test pages, and it's more like a glyph
rendering problem rather than decoding issue.
(In reply to Henri Sivonen (:hsivonen) from comment #26)

For the encoder side, the algorithm sounds ok in general (without delving into efficiency questions). But there is one point I can't understand:

> Discard index entries below 5024 (HKSCS).

Why?
(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from comment #41)
> (In reply to Masatoshi Kimura [:emk] from comment #29)
> > (In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) (please ni?) from
> > comment #28)
> > > With the current implementation:
> > > 
> > > 1. The three characters will be encoded to their HKSCS specified bytes, and
> > > correct search result will return, e.g. "Search for θ‘žη”Ÿη½², got N results."
> > 
> > But doesn't it conflict with the expectation of Taiwan people that 0x8F 0xC0
> > will be converted to "ηΌ³"?
> 
> It is, but in this case, the mapping come from UAO and is falling out of use
> on the Web, so I am not too concerned we no longer support that (in return,
> come up with a unified big5 decoder).

Great. My understanding is (again, would need Anne to confirm) is that UAO didn't seem be in much use on the Web, so it seemed same to interpret the Big5 byte sequences where HKSCS and UAO overlap as HKSCS.

> I agree. PUA mapping is bad and HK govt fix that a decade ago in their spec.
> We shouldn't support that anymore.

Great.

However, in the *encoder* it might make sense to support encoding the old PUA mappings to Big5 byte sequences in case input methods can produce PUA code points. Can they? That is, two Unicode code point,: one PUA and one astral, would map to the same Big5 bytes. Anne, is there a reason why the Big5 *encoder* in the Encoding Standard doesn't know about the old PUA stuff? (There are astral mapping above the 0xA1 Big5 lead, after all.)

(In reply to Abel Cheung from comment #42)
> > On this basis, it seemed sensible to come up with a unified *decoder*.
> > Since, as noted above, HK content can be labeled just "Big5", it doesn't
> > seem useful to try to maintain separation on the *encoder* side, either.
> 
> Is encoder most frequently used for cases like sending content with specific
> charset to backend system such as database? If necessary, I can do more
> testing on that as well.

The encoder is used for submitting forms, so the main problem cases are:
 1) The form is a search query and the server use the bytes as submitted by the browser to search a database of Big5-ish data.
 2) The form is a discussion forum posting form and the server takes the bytes as submitted by Firefox and echoes them to IE. If there are byte sequences that IE doesn't understand, sending NCRs instead might have worked better.

> The URL is https://webcat.hkpl.gov.hk/?theme=WEB&locale=en , however looks
> like it has actually migrated to UTF-8 recently, along with some other
> government departments.

Nice!

(In reply to Abel Cheung from comment #43)
> (In reply to Henri Sivonen (:hsivonen) from comment #26)
> 
> I'll comment on the decoder part only here.
> 
> > NARROW consists of the UTF-16 code units for the index ranges
> > 5024...11204 11254...18996
> > WIDE consists of pairs for the for the index ranges
> > 942...5023 11205...11213 18997...19781
> > likewise packed contiguously
> 
> Is the intention to place the three Big5 private areas into wide arrays
>  and two Big5 defined areas into char16_t arrays? If this is the case,
> my calculation of indexes is almost exactly the same. Is the start of 4th
> range (11254) a typo of 11214?

Not a typo. http://encoding.spec.whatwg.org/index-big5.txt is discontinuous at that point.

The relevant lines in the index are:
11213	0x27607	
Flags: needinfo?(annevk)
(In reply to Henri Sivonen (:hsivonen) from comment #45)
> > I agree. PUA mapping is bad and HK govt fix that a decade ago in their spec.
> > We shouldn't support that anymore.
> 
> However, in the *encoder* it might make sense to support encoding the old
> PUA mappings to Big5 byte sequences in case input methods can produce PUA
> code points. Can they? That is, two Unicode code point,: one PUA and one
> astral, would map to the same Big5 bytes. Anne, is there a reason why the
> Big5 *encoder* in the Encoding Standard doesn't know about the old PUA
> stuff? (There are astral mapping above the 0xA1 Big5 lead, after all.)

FYI, there is spec containing well defined mapping between PUA, Unicode
official code points and Big5 bytes.

Text version: http://www.ogcio.gov.hk/en/business/tech_promotion/ccli/terms/doc/2003cmp_2008.txt
PDF version: http://www.ogcio.gov.hk/en/business/tech_promotion/ccli/terms/doc/e_annex1_2008.pdf

PDF version contains extra chars where Hong Kong government failed to submit to ISO 10646, so text version is the right one to implement if it is decided so.


> > my calculation of indexes is almost exactly the same. Is the start of 4th
> > range (11254) a typo of 11214?
> 
> Not a typo. http://encoding.spec.whatwg.org/index-big5.txt is discontinuous
> at that point.
> 
> The relevant lines in the index are:
> 11213	0x27607

You're right, I skimmed through that part too quickly.
Yes, Philip JΓ€genstedt and I researched UAO usage. Combined with non-support from all other browsers that hold equivalent or better market share we determined UAO to be bad from an interoperability point of view.

Philip's research is located here: https://gitorious.org/whatwg/big5 Searching for his full name and "big5" will also yield emails containing other details.

I do not think we looked into supporting PUA for the encoder.
Flags: needinfo?(annevk) → needinfo?(philip)
I've been following this thread with some interest, since I researched how to decode Big5 in April 2012. The Git repository contains all of the scripts used, but some of the mails I wrote are easier to follow:

http://lists.w3.org/Archives/Public/public-whatwg-archive/2012Apr/0082.html
http://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/0046.html

Some key quotes:

"Not treating big5 and big5-hkscs as aliases is clearly breaking pages, so I would recommend a single mapping for both."

"Using Big5-HKSCS would be a net improvement for Hong Kong sites."

"Using Big5-UAO for Taiwanese sites would give mixed results. Correctly encoded Big5-UAO is very rare, so the tested mapping (Firefox) introduces almost as many user-visible misencodings as it fixes and masks many others."

Obviously, some content will be decoded differently if Firefox aligns with the spec, some better and some worse. However, so far I haven't seen any suggestion put forward for a mapping, or separate mappings with whatever heuristic to switch been them, that would break less content than the spec.

For the encoder, the question is whether or not the "Avoid emitting Hong Kong Supplementary Character Set extensions literally" step is warranted. I haven't done any research on this, but "Maybe we could try a symmetric encoder instead and see what breaks" sounds very appealing. If it does break something, then the second best thing would be skipping the step for Big5-HKSCS-labeled content.

Regarding PUA in the encoder, I'm not sure what problem that's trying to solve. How do the PUA code points get into the textfield or whatever is being encoded? Copy-pasted or from an IME, maybe? Just not doing that and seeing what breaks sounds reasonable, but then Gecko isn't my browser engine to break...
Flags: needinfo?(philip)
(In reply to philip from comment #48)
> to decode Big5 in April 2012. The Git repository contains all of the scripts
> used, but some of the mails I wrote are easier to follow:
> 
> http://lists.w3.org/Archives/Public/public-whatwg-archive/2012Apr/0082.html
> http://lists.w3.org/Archives/Public/public-html-ig-zh/2012Apr/0046.html

Was digging feverishly these days about the tools and research you all have done back then. Hope I'm not too late into the party.


> Obviously, some content will be decoded differently if Firefox aligns with
> the spec, some better and some worse. However, so far I haven't seen any
> suggestion put forward for a mapping, or separate mappings with whatever
> heuristic to switch been them, that would break less content than the spec.

That's hard. In W3C mailing list you would have already noticed people had to rely on sentence context to determine some of the ambiguous cases. I have seen discussion forum in mozilla taiwan suggesting to construct a word list, and try decoding twice (UAO and HKSCS) to see which matches. While that may make sense as an extension, I suspect it's a bad idea seeing that inside core decoding engine.


> Regarding PUA in the encoder, I'm not sure what problem that's trying to
> solve. How do the PUA code points get into the textfield or whatever is
> being encoded? Copy-pasted or from an IME, maybe? Just not doing that and
> seeing what breaks sounds reasonable, but then Gecko isn't my browser engine
> to break...

My bet is, majority of PUA usage is due to IME. For example, while Windows 7 has native support of HKSCS characters in IME tables, few are aware of the setting, not to mention the setting is in fact affected by showstopping bug, so people end up downloading other 3rd party IMEs (some simply abandoned HKSCS characters). And the IME I'm using on Linux emits PUA code for 100% of plane 2 characters.

However, it might be better to get some research on how widely PUA was spreading. This could be done in similar fashion as your crawling of web pages during 2012. Yet this topic is starting to grow out of scope of this bugzilla report. Maybe it's possible to continue discussion of whether including PUA in html-ig-zh list?
(In reply to Abel Cheung from comment #49)
> I have
> seen discussion forum in mozilla taiwan suggesting to construct a word list,
> and try decoding twice (UAO and HKSCS) to see which matches. While that may
> make sense as an extension, I suspect it's a bad idea seeing that inside
> core decoding engine.

We're not going to have heuristics like that in Gecko.

> My bet is, majority of PUA usage is due to IME. For example, while Windows 7
> has native support of HKSCS characters in IME tables, few are aware of the
> setting, not to mention the setting is in fact affected by showstopping bug,
> so people end up downloading other 3rd party IMEs (some simply abandoned
> HKSCS characters). And the IME I'm using on Linux emits PUA code for 100% of
> plane 2 characters.

That's bad. Addressing that problem in the Big5 encoder would be the wrong solution, though, since PUA-using data would continue to spread via UTF-8 and GB18030. If we want to address the problem, we should address it closer to the IME integration.

At this point, it looks pretty clear that we should implement the decoder as specified in the Encoding Standard both for Big5-labeled and Big5-HKSCS-labeled pages.

For the encoder, we have three options:

1) Treat Big5 and Big5-HKSCS as one encoding, never emit HKSCS byte sequences in the encoder.

2) Treat Big5 and Big5-HKSCS as one encoding, emit HKSCS byte sequences in the encoder.

3) Treat Big5 and Big5-HKSCS as separate encodings that decode alike, make the encoder for Big5 never emit HKSCS byte sequences (not UAO either) and make the encoder for Big5-HKSCS emit HKSCS byte sequences.

Bug 310299 comment 0 goes against #2. How does the encoder behave in IE?
IE only has big5 support, not big5-hkscs.
(In reply to Anne (:annevk) from comment #51)
> IE only has big5 support, not big5-hkscs.

OK. Let's implement option #1 from comment 50, which is what the current draft of the Encoding Standard requires.

emk, are you interested in implementing this?
Flags: needinfo?(VYV03354)
I called a quick and small Mozilla community meeting last night in Hong Kong to discuss this bug with Abel Cheung and others.

For Hong Kong, the perfect case is that as the Big5-HKSCS is submitted to ISO 10646, so it is a standard, but Big5-uao isn't a global standard. If developers prefer to follow the standard, Big5-HKSCS should be fully supported instead of Big5-uao. Big5-HKSCS is still popular in use in Hong Kong website.

As previous discussion at this bug, it is known that it is some conflict between different Chinese character coding, eg. Big5-HKSCS, Big5-uao, GB18030, etc. So, merging different Chinese character coding into one single encoder/decoder isn't possible to solve all cases.

And lastly, in meeting last night, we think that other browser will look and follow what does Firefox do in this case/bug. So, it may makes the case worse.

But it is a good alert to Hong Kong that web broswers won't support Big5-HKSCS anymore in the future. This is one of our conculsions from the meeting last night. 

And I should thank Abel Cheung to look at different Chinese character encoding and fonts in this few weeks, and other contributions from local Mozillians in Hong Kong.
(In reply to Sammy Fung from comment #53)
> As previous discussion at this bug, it is known that it is some conflict
> between different Chinese character coding, eg. Big5-HKSCS, Big5-uao,
> GB18030, etc. So, merging different Chinese character coding into one single
> encoder/decoder isn't possible to solve all cases.

Our (Japanese) experience has shown that adding more encodings will add more problems than it solves. Honestly, I have half a mind to disagree with the Encoding Standard inventing yet another set of character encoding dialects which is the perfect example of <https://xkcd.com/927/>.

> But it is a good alert to Hong Kong that web broswers won't support
> Big5-HKSCS anymore in the future.

In precise, browsers have never supported Big5-HKSCS strictly as defined in the ISO standard (e.g. we have never supported astral code points), just like SGML-based HTML, ISO HTML, XHTML, and so on.
(In reply to Masatoshi Kimura [:emk] from comment #54)
> Our (Japanese) experience has shown that adding more encodings will add more
> problems than it solves. Honestly, I have half a mind to disagree with the
> Encoding Standard inventing yet another set of character encoding dialects
> which is the perfect example of <https://xkcd.com/927/>.

I am afraid, I think Japanese encoding experience doesn't apply to Chinese encoding because Chinese encoding is much complicated from the angle of standards/schema, politics, culture, etc. Japan is a country and no difference on language writing/use in whole country, but mainland China, Taiwan and Hong Kong are completely different case which non-Chinese doesn't understand it easily.

> In precise, browsers have never supported Big5-HKSCS strictly as defined in
> the ISO standard (e.g. we have never supported astral code points), just
> like SGML-based HTML, ISO HTML, XHTML, and so on.

It's right, but current proposal will make it worse. ;)
(In reply to Sammy Fung from comment #55)
> I am afraid, I think Japanese encoding experience doesn't apply to Chinese
> encoding because Chinese encoding is much complicated from the angle of
> standards/schema, politics, culture, etc. Japan is a country and no
> difference on language writing/use in whole country, but mainland China,
> Taiwan and Hong Kong are completely different case which non-Chinese doesn't
> understand it easily.

Then adding a new Chinese encoding will add more complicated problems than Japanese one.

> > In precise, browsers have never supported Big5-HKSCS strictly as defined in
> > the ISO standard (e.g. we have never supported astral code points), just
> > like SGML-based HTML, ISO HTML, XHTML, and so on.
> 
> It's right, but current proposal will make it worse. ;)

Do you have any concrete example of Hong Kong pages which will be broken by the new Big5 decoder?
FYI the new decoder would have at least one improvement. It will support astral code points for Big5-HKSCS.
Flags: needinfo?(VYV03354)
(In reply to Sammy Fung from comment #53)
> As previous discussion at this bug, it is known that it is some conflict
> between different Chinese character coding, eg. Big5-HKSCS, Big5-uao,
> GB18030, etc. So, merging different Chinese character coding into one single
> encoder/decoder isn't possible to solve all cases.

This is misleading. Browsers support both big5 and gb18030. There is no proposal to merge these. big5-uao is only supported by Gecko. It is not helping improve the situation for web developers or interoperability among user agents.


> And lastly, in meeting last night, we think that other browser will look and
> follow what does Firefox do in this case/bug. So, it may makes the case
> worse.

No, we will align with other browsers (by not having big5-uao) and provide more compatibility with HKSCS than we have today. This should improve the situation.


> But it is a good alert to Hong Kong that web broswers won't support
> Big5-HKSCS anymore in the future. This is one of our conculsions from the
> meeting last night.

They only support it if you explicitly declare it, which most sites don't do today. Going forward we will actually decode existing HKSCS content (typically labeled with "big5") better.
Adding Anthony who contributed Big5-HKSCS converters to gecko back in early 2000's.
I'm planning to work on this.
Assignee: nobody → hsivonen
Attached patch Reimplement big5 decoding (obsolete) β€” β€” Splinter Review
TODO: Unit tests, encoder
nsIUnicodeDecoder documentation says:
"When a decoding error is returned to the caller, it is the caller's responsibility to advance over the bad byte (unless aSrcLength is -1 in which case the caller should call the decoder with 0 offset again) and reset the decoder before trying to call the decoder again."

emk, do we still have callers that set kOnError_Signal and still expect to continue decoding? That is, could we change the API to say that the caller either has to request the converter not to report errors or has to stop decoding upon receiving an error?
Flags: needinfo?(VYV03354)
Attached patch Reimplement big5 decoding with tests (obsolete) β€” β€” Splinter Review
This patch now has more compact data and test cases. However, I haven't run the test cases yet, because running them locally is broken at the moment. Let's see what happens with the try run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d9b3097f28b8

Again, this patch lacks the encoding, which is necessary before the decoder can land.
Attachment #8617916 - Attachment is obsolete: true
Attachment #8622966 - Attachment is obsolete: true
(In reply to Henri Sivonen (:hsivonen) from comment #63)
> Again, this patch lacks the encoding, which is necessary before the decoder
> can land.

s/encoding/encoder/
When testing on Ubuntu that was originally installed using a non-Chinese language choice, it's a good idea to manually install the fonts-arphic-uming fonts-arphic-ukai packages before testing.
(In reply to Henri Sivonen (:hsivonen) from comment #62)
> emk, do we still have callers that set kOnError_Signal and still expect to
> continue decoding? That is, could we change the API to say that the caller
> either has to request the converter not to report errors or has to stop
> decoding upon receiving an error?

nsConverterInputStream and the legacy parser still depends on this flag to customize the replacement character.
Flags: needinfo?(VYV03354)
(In reply to Masatoshi Kimura [:emk] from comment #67)
> (In reply to Henri Sivonen (:hsivonen) from comment #62)
> > emk, do we still have callers that set kOnError_Signal and still expect to
> > continue decoding? That is, could we change the API to say that the caller
> > either has to request the converter not to report errors or has to stop
> > decoding upon receiving an error?
> 
> nsConverterInputStream and the legacy parser still depends on this flag to
> customize the replacement character.

OK. Thanks.

What methodology was used to generate http://mxr.mozilla.org/mozilla-central/source/dom/encoding/test/unit/test_big5.js ? Should I just take the actual output, trust that it's right given that it looks right, and use it as the new expectation?
Flags: needinfo?(VYV03354)
(In reply to Henri Sivonen (:hsivonen) from comment #69)
> What methodology was used to generate
> http://mxr.mozilla.org/mozilla-central/source/dom/encoding/test/unit/
> test_big5.js ? Should I just take the actual output, trust that it's right
> given that it looks right, and use it as the new expectation?

I imported the test from [1]. (Surprisingly, the test didn't fail on our old big5 decoder.)
Probably we should update the test to the latest one.

[1] https://github.com/inexorabletash/text-encoding
Flags: needinfo?(VYV03354)
Attached patch Reimplement big5 decoding (obsolete) β€” β€” Splinter Review
(In reply to Masatoshi Kimura [:emk] from comment #70)
> (In reply to Henri Sivonen (:hsivonen) from comment #69)
> > What methodology was used to generate
> > http://mxr.mozilla.org/mozilla-central/source/dom/encoding/test/unit/
> > test_big5.js ? Should I just take the actual output, trust that it's right
> > given that it looks right, and use it as the new expectation?
> 
> I imported the test from [1]. (Surprisingly, the test didn't fail on our old
> big5 decoder.)
> Probably we should update the test to the latest one.
> 
> [1] https://github.com/inexorabletash/text-encoding

It still passes now that I fixed the new decoder. Clearly, it can't be testing the interesting code points. (I added tests for interesting code points to test_TextDecoder.js.)

The reason for the failure was that TextDecoder.cpp appended a U+FFFD, because it wanted to see NS_OK instead of NS_PARTIAL_MORE_INPUT when the input doesn't end in the middle of an input code unit sequence. I took and a look at the UTF-8 decoder and saw this behavior indeed being implemented there. I documented this in nsIUnicodeDecoder.h.

Here's a new, hopefully final, patch:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=def5754dcba9
Attachment #8622992 - Attachment is obsolete: true
Comment on attachment 8623620 [details] [diff] [review]
Reimplement big5 decoding

(This can't land without a new encoder. I intend to write that one as well.)
Attachment #8623620 - Flags: review?(VYV03354)
Comment on attachment 8623620 [details] [diff] [review]
Reimplement big5 decoding

Review of attachment 8623620 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with the following GetMaxLength problem fixed.

::: intl/uconv/ucvtw/nsBIG5ToUnicode.cpp
@@ +30,5 @@
> +  if (mPendingTrail) {
> +    if (out == outEnd) {
> +      *aSrcLength = 0;
> +      *aDestLength = 0;
> +      return NS_PARTIAL_MORE_OUTPUT;

NS_PARTIAL_* is deprecated.
https://mxr.mozilla.org/mozilla-central/source/xpcom/base/ErrorList.h#424
Please use NS_OK_UDEC_* instead.

@@ +154,5 @@
> +                              int32_t* aDestLength)
> +{
> +  // The length of the output in UTF-16 code units never exceeds the length
> +  // of the input in bytes.
> +  *aDestLength = aSrcLength;

The out length may be one more than the input length if mBig5Lead is non-zero and [mBig5Lead, the first input] is conveted to an astral character, or if mPendingTrail is non-zero. (This kind of issue actually caused a security bug in the past.)
Attachment #8623620 - Flags: review?(VYV03354) → review+
Blocks: 1200152
Thanks. Attaching the patch with review comments attached for the record.
Attachment #8623620 - Attachment is obsolete: true
Attachment #8655910 - Flags: review+
As for encoding:

 * The API has no explicit support for malformed input (unpaired surrogates). Some existing encoders report unpaired surrogates as unmappable, some generate a replacement without reporting. emk, what should be done here?

 * The API is unclear on the behavior of the "consumed" number of code units when a character is unmappable. The existing code is remarkable unclear on what behavior is expected. emk, do you happen to recall what the expected behavior is or should I just take a harder look at the unclear code?
Flags: needinfo?(VYV03354)
(In reply to Henri Sivonen (:hsivonen) from comment #75)
> As for encoding:
> 
>  * The API has no explicit support for malformed input (unpaired
> surrogates). Some existing encoders report unpaired surrogates as
> unmappable, some generate a replacement without reporting. emk, what should
> be done here?

Replying to self: Report as unmappable to give the caller the *opportunity* to do the right thing. (Whether nsSaveAsCharset does the right thing is a follow-up investigation.)
Attached patch Part 2: Encoder WIP (obsolete) β€” β€” Splinter Review
https://treeherder.mozilla.org/#/jobs?repo=try&revision=844cac0b49b1
Attachment #8655910 - Attachment is obsolete: true
Attachment #8657244 - Flags: review+
Attached patch Part 2: Encoder WIP, v2 (obsolete) β€” β€” Splinter Review
https://treeherder.mozilla.org/#/jobs?repo=try&revision=80e8714de52e
Attachment #8657069 - Attachment is obsolete: true
(In reply to Henri Sivonen (:hsivonen) from comment #75)
>  * The API is unclear on the behavior of the "consumed" number of code units
> when a character is unmappable. The existing code is remarkable unclear on
> what behavior is expected. emk, do you happen to recall what the expected
> behavior is or should I just take a harder look at the unclear code?

https://mxr.mozilla.org/mozilla-central/source/dom/base/nsDocumentEncoder.cpp#675 strongly suggests that the encoding API does not work like the decoding API...
(In reply to Henri Sivonen (:hsivonen) from comment #81)
> (In reply to Henri Sivonen (:hsivonen) from comment #75)
> >  * The API is unclear on the behavior of the "consumed" number of code units
> > when a character is unmappable. The existing code is remarkable unclear on
> > what behavior is expected. emk, do you happen to recall what the expected
> > behavior is or should I just take a harder look at the unclear code?
> 
> https://mxr.mozilla.org/mozilla-central/source/dom/base/nsDocumentEncoder.
> cpp#675 strongly suggests that the encoding API does not work like the
> decoding API...

Experimentally verified that the encode API indeed does not work the same way as the decode API!
Flags: needinfo?(VYV03354)
http://software.hixie.ch/utilities/js/live-dom-viewer/saved/3627
Oops. I previously addressed the review comments by editing a stale version of the patch. That explained the try failures. Now addressing the review comments by first downloading the right patch from Bugzilla.
Attachment #8657244 - Attachment is obsolete: true
Attachment #8657469 - Flags: review+
Attachment #8655910 - Flags: review+ → review-
Attachment #8657244 - Flags: review+ → review-
Attached patch Part 2: Reimplement Big5 encoder, v3 (obsolete) β€” β€” Splinter Review
TODO:
 * Mochitestify the encoder test case from comment 83.
 * Remove the expected failure annotations from Web Platform Tests, since the tests now pass.
Attachment #8657247 - Attachment is obsolete: true
Try results for reference: https://treeherder.mozilla.org/#/jobs?repo=try&revision=82bc5faa48de
Attached patch Part 2: Reimplement Big5 encoder, v4 (obsolete) β€” β€” Splinter Review
https://treeherder.mozilla.org/#/jobs?repo=try&revision=57186d004d9b
Attachment #8657471 - Attachment is obsolete: true
OS: Mac OS X → All
Hardware: x86 → All
Once these changes make into a Firefox release, I suggest including the following in the release notes:

Made support for legacy-encoded Traditional Chinese pages (Big5, including HKSCS) more compatible with other browsers and with the fonts and input methods of Windows Vista and later, Mac OS X and Linux systems. Users of Windows XP who read content that might include Hong Kong supplementary characters and who are unable to upgrade away from Windows XP should install <a href="https://www.microsoft.com/en-us/download/details.aspx?id=10109">Windows XP Font Pack for ISO 10646:2003 + Amendment 1 Traditional Chinese Support</a>.

(End suggested release note.)

For people who come here for regressions, here's an explanation of what this patches do, why it makes things better and what might break:

=Changes=

==Big5-HKSCS and Big5 are no longer treated as distinct encodings==

Windows and, by extension IE, have had a strange way of supporting Big5-HKSCS: Users would install a patch to change Big5 into Big5-HKSCS. Thus, Web pages labeled as "big5" (not "big5-hkscs") but using HKSCS byte sequences would work in IE for users who have installed the patch.

*By treating supporting HKSCS byte sequences on pages labeled "big5", we gain compatibility with such IE-oriented legacy content.*

When the big5 decoder supports HKSCS byte sequences, "big5-hkscs" can become a mere alias (label) for big5.

==Unicode-compliant code points used instead of the Unicode Private Use Area==

Back when support for HKSCS was first implemented in a Gecko and back when support for HKSCS was first implemented in Windows, many of the characters in HKSCS lacked corresponding characters in Unicode. To work around this, the byte sequences for these characters were mapped to the Unicode Private Use Area. By Unicode 4.1, these characters had gotten official code points in Unicode and Windows Vista started using this new official Unicode code points. Mac and Linux migrated to the official Unicode code points, too.

*By using the official Unicode code points, Firefox becomes more compatible with the current operating systems and stops perpetuating the use of the Private Use Area for public interchange.*

==Even of pages labeled "big5-hkscs", the encoder doesn't produce HKSCS byte sequences==

To avoid sending HKSCS byte sequences in form submissions to servers that don't expect them and cannot deal with them, the encoder (there is only one, since "big5-hkscs" is now an alias for big5 for reasons given above) never produces HKSCS byte sequences. This means that HKSCS characters are submitted as HTML decimal character references, e.g. &#123456;. Since operating systems other than Windows XP have had input methods that produce the official Unicode code points instead of the Private Use Area code points for HKSCS characters and Firefox has been unable to encode the official code points even when the page has been declared "big5-hkscs", even Web sites declared as "big5-hkscs" must have been receiving HKSCS characters submitted as HTML decimal character references in form submissions for many years.

*For this reason, encoding HKSCS characters as HTML decimal character references in form submissions is not a new phenomenon for sites that accept form submissions from pages declared as "big5-hkscs" and it seems reasonable not to have an encoder mode that would produce HKSCS byte sequences.*

=Potential breakage=

==Breaking Firefox-only Taiwanese pages==

Previously, Firefox, and Firefox alone, supported a big5 extension called Unicode-at-on that conflicts with HKSCS. Therefore, it is possible that there exist Firefox-only Taiwanese pages that made use of the Unicode-at-on feature. However, we should expect this to be rare, since the market share of Firefox has never been high in Taiwan and it was a Firefox-only feature.

*This change intentionally trades off compatibility with potentially-existing Taiwanese Firefox-only content to gain compatibility with IE-only Hong Kong content.* (We prefer interoperability over Firefox-only features.)

==Breaking rendering on Windows XP==

Since Windows XP didn't originally ship with fonts that support all the Unicode code points that the new big5 decoder can emit and instead mapped the corresponding glyphs to Private Use Area code points, rendering on Windows XP can break for users who haven't followed the advice given in the proposed release note at the start of this Bugzilla comment. Since Windows XP is no longer supported by Microsoft and since Microsoft produced a patch (linked from the proposed release note) that backports the Traditional Chinese fonts from Windows 7 to XP, it seems reasonable to cater to standards and to more recent operating systems and to require XP users to install the patch from Microsoft.

==Breaking form submissions on XP-only intranets in Hong Kong==

It is possible that there exist intranets in Hong Kong which 1) have intranet sites that accept form submissions, 2) the pages that contain the submitted forms are labeled as "big5-hkscs", 3) users actually need to enter HKSCS characters into the forms and 4) only have Windows XP computers accessing them.

In this scenario, the patches here could break the use of HKSCS characters in such an XP-only intranet case, since previously Firefox produced HKSCS byte sequences for the kind of Private Use Area characters that the Hong Kong Traditional Chinese input method on Windows XP would produce.

However, intranets that have had Firefox running on systems other than XP must have encountered HKSCS encoded as HTML decimal character references already, since the input methods on more recent systems generate the official Unicode code points that Firefox was already previously unable to encode as HKSCS byte sequences.
Whiteboard: [If you believe this change regressed something, please read comment 88 before replying.]
Attached patch Part 2: Reimplement Big5 encoder, v5 (obsolete) β€” β€” Splinter Review
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6fe32cbb52e8

Made a copy-paste error in the Web Platform Test expectations.
Attachment #8657726 - Attachment is obsolete: true
Comment on attachment 8657777 [details] [diff] [review]
Part 2: Reimplement Big5 encoder, v5

The mochitest is not granular, because the Web Platform Test is the real test. The mochitest is there only because our URL implementation isn't compliant yet in the case of unmappables.
Attachment #8657777 - Flags: review?(VYV03354)
Oh, and the encoder deliberately uses the decode-optimized data structure on the assumption that non-UTF-8 encode doesn't need to be super-fast, so it makes sense to optimize the binary size instead.
Whiteboard: [If you believe this change regressed something, please read comment 88 before replying.] → [If you believe this change regressed something, please read comment 88 about intentional changes and patchable XP breakage.]
Comment on attachment 8657777 [details] [diff] [review]
Part 2: Reimplement Big5 encoder, v5

Review of attachment 8657777 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/uconv/tools/gen-big5-data.py
@@ +121,4 @@
>  
>  // static
>  char16_t
> +nsBIG5Data::LowBits(size_t aPointer)

Why don't you use uint32_t? (Same for all occurrence of size_t)

::: intl/uconv/ucvtw/nsBIG5Data.h
@@ +3,5 @@
> + * License, v. 2.0. If a copy of the MPL was not distributed with this
> + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> +
> +#ifndef nsBIG5Data_h___
> +#define nsBIG5Data_h___

Do not use double-underscore that is always reserved by the C++ spec.

::: intl/uconv/ucvtw/nsUnicodeToBIG5.cpp
@@ +85,5 @@
> +        }
> +        *out++ = '?';
> +        continue;
> +      }
> +      size_t codePoint = (mUtf16Lead << 10) + codeUnit - 56613888;

Please use
  (((0xD800 << 10) - 0x10000) + 0xDC00)
instead of the cryptic magic number. Compilers should be smart enough to fold this.

::: testing/web-platform/meta/encoding/big5-encoder.html.ini
@@ +1,1 @@
> +[big5-encoder.html]

Why are these tests fail?
(In reply to Masatoshi Kimura [:emk] from comment #92)
> Why don't you use uint32_t? (Same for all occurrence of size_t)

I'm assuming that using register-sized types on 64-bit CPUs has a non-zero perf benefit thanks to the avoidance of useless truncation. I don't know if this is actually a reasonable perf trick or superstition. That is, I didn't actually check if it makes a real difference in this case. (But IIRC, it did in the past in other Gecko code.)

> ::: intl/uconv/ucvtw/nsBIG5Data.h
> @@ +3,5 @@
> > + * License, v. 2.0. If a copy of the MPL was not distributed with this
> > + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> > +
> > +#ifndef nsBIG5Data_h___
> > +#define nsBIG5Data_h___
> 
> Do not use double-underscore that is always reserved by the C++ spec.

Oops. Copy-paste from ancient code. I'll fix.

> ::: intl/uconv/ucvtw/nsUnicodeToBIG5.cpp
> @@ +85,5 @@
> > +        }
> > +        *out++ = '?';
> > +        continue;
> > +      }
> > +      size_t codePoint = (mUtf16Lead << 10) + codeUnit - 56613888;
> 
> Please use
>   (((0xD800 << 10) - 0x10000) + 0xDC00)
> instead of the cryptic magic number. Compilers should be smart enough to
> fold this.

OK.

> ::: testing/web-platform/meta/encoding/big5-encoder.html.ini
> @@ +1,1 @@
> > +[big5-encoder.html]
> 
> Why are these tests fail?

The tests for unmappables fail, because our URL implementation doesn't do the spec-compliant thing with unmappables. For this reason, I also added a Gecko-specific mochitest that uses the form submission code path to reach the encoder instead.
Now using less magical numbers for UTF-16 math and not using double underscores.
Attachment #8657777 - Attachment is obsolete: true
Attachment #8657777 - Flags: review?(VYV03354)
Attachment #8659272 - Flags: review?(VYV03354)
Comment on attachment 8659272 [details] [diff] [review]
Part 2: Reimplement Big5 encoder, v6

LGTM
Attachment #8659272 - Flags: review?(VYV03354) → review+
Release Note Request (optional, but appreciated)
[Why is this notable]:
See comment 88. (Requires Hong Kong Windows XP users to install a patch that backports fonts from Windows 7.)

[Suggested wording]:
See the second paragraph of comment 88.

[Links (documentation, blog post, etc)]:
https://www.microsoft.com/en-us/download/details.aspx?id=10109

(In reply to Masatoshi Kimura [:emk] from comment #95)
> LGTM

Thank you. Landed.
relnote-firefox: --- → ?
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f5073378ac07
Attachment #8659415 - Flags: review?(VYV03354)
Attachment #8659415 - Flags: review?(VYV03354) → review+
For the record, these patches made the Android ARMv7 (API 11) libxul (and also the apk) 52 KB smaller.
https://hg.mozilla.org/mozilla-central/rev/5cb4f84b2b86
https://hg.mozilla.org/mozilla-central/rev/10a1e7a746b1
https://hg.mozilla.org/mozilla-central/rev/01d4b53ea438
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla43
The release notes are usually quite short, just one line. We can link to a longer blog post or MDN documentation though. Henri, can you write something that I can link to other than comment 88? 

Suggested wording: Improved support for legacy-encoded Traditional Chinese pages (Big5, including HKSCS)
Flags: needinfo?(hsivonen)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #103)
> The release notes are usually quite short, just one line. We can link to a
> longer blog post or MDN documentation though. Henri, can you write something
> that I can link to other than comment 88? 

To be clear, I didn't suggest comment 88 be used as a release note in its entirety. I suggested using just the second paragraph as the release note. I.e. the paragraph that read:

Made support for legacy-encoded Traditional Chinese pages (Big5, including HKSCS) more compatible with other browsers and with the fonts and input methods of Windows Vista and later, Mac OS X and Linux systems. Users of Windows XP who read content that might include Hong Kong supplementary characters and who are unable to upgrade away from Windows XP should install <a href="https://www.microsoft.com/en-us/download/details.aspx?id=10109">Windows XP Font Pack for ISO 10646:2003 + Amendment 1 Traditional Chinese Support</a>.

> Suggested wording: Improved support for legacy-encoded Traditional Chinese
> pages (Big5, including HKSCS)

That we improved Big5 support is not worth a release note in itself. The key release note-worthy point is that users who are still on Hong Kong-localized XP should install a patch from Microsoft.

Suggested shorter wording:

Due to improvements in Firefox's Big5 support (including HKSCS), users of Windows XP who read sites that use Hong Kong supplementary characters should install <a href="https://www.microsoft.com/en-us/download/details.aspx?id=10109">Windows XP Font Pack for ISO 10646:2003 + Amendment 1 Traditional Chinese Support</a> from Microsoft.
Flags: needinfo?(hsivonen)
Blocks: 1231078
Flags: needinfo?(s793016)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: