Closed Bug 951691 Opened 11 years ago Closed 10 years ago

Make gbk a label of gb18030

Categories

(Core :: Internationalization, enhancement, P4)

enhancement

Tracking

()

RESOLVED WONTFIX

People

(Reporter: hsivonen, Unassigned)

References

(Blocks 1 open bug)

Details

The fix for https://www.w3.org/Bugs/Public/show_bug.cgi?id=16862 made gbk a label of gb18030. Therefore, we should make gbk a label of gb18030.

Risk:
Characters previously submitted as &#....; in forms become valid GB18030 byte sequences  and it is imaginable that there could exist a site that expects &#....; and can't deal with the new byte sequences.
Blocks: encoding
(In reply to Henri Sivonen (:hsivonen) from comment #0)
> Characters previously submitted as &#....; in forms become valid GB18030
> byte sequences  and it is imaginable that there could exist a site that
> expects &#....; and can't deal with the new byte sequences.

Also, the way the euro sign is submitted in forms changes.
Severity: normal → enhancement
Priority: -- → P4
We could support 4 byte sequences only for the decoder to mitigate the compatibility issue. I'll file a spec bug later. (Sorry for I couldn't point out before the spec text is changed.)
(In reply to Masatoshi Kimura [:emk] from comment #2)
> We could support 4 byte sequences only for the decoder to mitigate the
> compatibility issue. I'll file a spec bug later. (Sorry for I couldn't point
> out before the spec text is changed.)

At this point, we don't even know if there is a real compatibility issue worth addressing. Should we just try doing this and see if anyone complains, or should we do something more complicated just in case there might be compatibility issues?

FWIW, submitting bytes that in the de jure sense don't belong to a label is nothing new considering Windows supersets of ISO-8859-n. Maintaining the separation of GBK and GB18030 just for the euro sign, which isn't even a Chinese symbol, in form submissions would be kinda sad.
(In reply to Henri Sivonen (:hsivonen) from comment #3)
>  Should we just try doing this and see if anyone complains,

It actually caused compatibility problems for EUC-JP and Big5, so these encodings adopted asymmetric mappings. But I'm not sure about Simplified Chinese, so I'm fine with the approach. Given that nobody complained when Firefox stopped decoding 4 byte sequences for "gb2312" label, the usage rate of 4 byte sequences would be pretty low anyway.
(In reply to Masatoshi Kimura [:emk] from comment #4)
> (In reply to Henri Sivonen (:hsivonen) from comment #3)
> >  Should we just try doing this and see if anyone complains,
> 
> It actually caused compatibility problems for EUC-JP and Big5, so these
> encodings adopted asymmetric mappings.

Where can I read more about this?
(In reply to Masatoshi Kimura [:emk] from comment #6)
> Bug 600715 and bug 310299.

Thank you.

Annoying that neither bug lists URLs for broken sites and neither bug mentions concrete characters that could be used to test what other browsers are doing these days. :-(
OK, let me explain about EUC-JP. All JIS X 0212 characters were encoded to jure JIS X 0212 code points only on Firefox unless the character happens to be present in IBM extension.
For example, 鷗 was encoded to 8F EC BF on older Firefox and will be encoded to 鷗 on other browsers and later Firefox. The former is unreadable by Internet Explorer, then people complained. Affected charaters are not only exiotic Japanese Kanji, but also many accented latin alphabets because JIS X 0212 has accented letters. Two major Japanese sites, Mixi[1] and Hatena-diary[2], still use EUC-JP. This is unresolvable unless Internet Explorer supports decoding JIS X 0212 code points in EUC-JP.
[1] http://mixi.jp/
[2] http://d.hatena.ne.jp/

Regarding to GB18030, the situation would be slightly better because IE supports GB18030. People can switch the encoding to GB18030 manually to view garbled characters. But it is the bad UX to force users to switch the encoding manually. I don't know how often GB18030 4-byte sequences would be used, but the label GB2312 and GBK would be used very broadly.

Please ask Moztw folks about Big5.
> jure JIS X 0212 code points
de jure JIS X 0212 code points
In general, it would be problematic to add the mappings to encoder. It was not an issue about windows code pages such as windows-1252 because Internet Explorer (of course) supports windows code pages. However EUC-JP and Big5-UAO are not.
In the first place, people should migrate to UTF-8. Do we really want to add any new encoder mappings for non-UTF-8 encodings at all?
(In reply to Masatoshi Kimura [:emk] from comment #8)
> OK, let me explain about EUC-JP.

Thank you.

> Regarding to GB18030

Data point: Until bug 844082, the Simplified Chinese localization handled unlabeled pages as GB18030 and, hence, submitted GB18030 despite IE handling the same sites as gbk, and that didn't bother people enough to result in bug reports from China.

> Please ask Moztw folks about Big5.

OK.
I think everyone agrees on the decoder. That can be gb18030 for all.

So the question is whether we need to distinguish labels for the encoder and then what the behavior of that encoder should be. The simplest would be to not use a different encoder. If we do decide to use a different encoder for gbk we need to figure out if it needs a two-byte index that is distinct from the two-byte index gb18030 uses.

I would personally favor trying out the simplest solution. Should we do it behind a pref so we can change it just before release if it turns out to be problematic?
(In reply to Masatoshi Kimura [:emk] from comment #10)
> In the first place, people should migrate to UTF-8.

Yes. There's also the risk, giving unlabeled pages an UTF (GB18030) automagically (as the fallback) might lower the pressure for proper UTF-8 migration.

> Do we really want to add
> any new encoder mappings for non-UTF-8 encodings at all?

Good question. It's very tempting to try to make the current labels of gbk labels of gb18030 and then have just gb18030. Since we were already able to use gb18030 as the fallback for the Simplified Chinese, we might get away with just doing the aliasing the current draft of the Encoding Standard says. Still, I'm a bit worried of potentially causing users grief over something that's almost in the theoretical elegance bucket as far as implementation burden goes.

Anyway, it seems clear that having two options for Simplified Chinese in the menu doesn't provide real value to the user. Also, it's hard to see why the user would benefit from not using the full GB18030 *decoder* for GBK-labeled pages that contain GB18030 byte sequences.

This would be safe as a first step:
 * Keep gbk and gb18030 as separate encodings.
 * Make the decoder for gbk identical to the decoder for gb18030.
 * Keep gbk as the fallback.
 * Put only gb18030 in the menu.
 * Label the one menu item as just "Simplified Chinese"
 * Make the menu show the checkmark even if the encoding of the current page is gbk.

If all browsers follow, maybe later we could then make the gbk labels be labels of gb18030, but at that point changing the alias table might not be worthwhile anymore. But going all the way up front worries me a bit considering how popular IE is in China and the forum both problem has already been seen with two encodings (EUC-JP and Big5).

Anne, do you have more arguments in favor of unification?

(In reply to Anne (:annevk) from comment #12)
> I would personally favor trying out the simplest solution.

Simplicity is appealing, yes.

> Should we do it
> behind a pref so we can change it just before release if it turns out to be
> problematic?

Do you mean keeping the encodings as distinct, having both instantiate the GB18030 decoder and having the encoder GBK instantiates be controlled by a pref?

Or do you mean making the pref even control the label handling?
(In reply to Henri Sivonen (:hsivonen) from comment #13)
> Or do you mean making the pref even control the label handling?

Hmm. Actually, that would be fairly doable, too.
Not unifying would be okay to me provided that we use a single two-byte index. It seems unnecessary complexity and footprint to have two of those. Other than that I do not feel strongly here. It has mostly been you and Philip that have asked me whether we could do a merge here. And I think emk is right in thinking that doing it on the encoder side too is probably more risk than we want.
I thought about this over the weekend, and I came to the conclusion that we should prefer interop over elegance and, sadly, that means not unifying the encoders.

(In reply to Anne (:annevk) from comment #15)
> Not unifying would be okay to me provided that we use a single two-byte
> index.

If the euro sign was special-cased in the gbk encoder in the spec, would having a single two-byte index match what Firefox does already?

> And I think emk
> is right in thinking that doing it on the encoder side too is probably more
> risk than we want.

OK. I guess this mean this in WONTFIX, then. I'll file new bugs about decoding and the menu.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
See Also: → 959058
Filed bug 959058.
I don't think it would match as gbk does not have the PUA entries gb18030 has. In Gecko gbk is not layered on top of gb18030 as far as I know.
(In reply to Anne (:annevk) from comment #18)
> I don't think it would match as gbk does not have the PUA entries gb18030
> has.

Wouldn't it be pretty simple, spec-wise, to say "If the encoder is in the GBK mode and the candidate character is in the Private Use Area, return error."?

> In Gecko gbk is not layered on top of gb18030 as far as I know.

nsUnicodeToGB18030 inherits from nsUnicodeToGBK but the two end up using different tables for the 2-byte sequences.
We could do that I suppose.
You need to log in before you can comment on or make changes to this bug.