Closed Bug 844114 Opened 7 years ago Closed 7 years ago

The Traditional Chinese localization should use big5 as the fallback encoding and the pan-Chinese detector or no detector instead of universal

Categories

(Mozilla Localizations :: zh-TW / Chinese (Traditional), defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hsivonen, Assigned: petercpg)

References

Details

In bug 536506, the Traditional Chinese localization started defaulting to UTF-8 and the "universal" detector.

The change to UTF-8 was motivated by the effect on Accept-Charset header which we no longer send.

The enablement of a detector (previously none was enabled by default) was motivated by the need to mitigate the impact of the change to UTF-8.

Since Accept-Charset is no longer an issue, I think we should revert the fallback encoding to big5. If this also removes the need for a detector, we should default to not having a detector enabled (for performance and for predictable behavior). If users of the Traditional Chinese localization encounter unlabeled Simplified Chinese content so often that some detector is needed, the pan-Chinese detector (pref value: zh_parallel_state_machine) should be used instead of universal.
Summary: The Traditional Chinese should use big5 as the fallback encoding and the pan-Chinese detector or no detector instead of universal → The Traditional Chinese localization should use big5 as the fallback encoding and the pan-Chinese detector or no detector instead of universal
Blocks: 844115
Henri, 

One foreseeable situation is that if user opens an ancient file encoded in Shift_JIS with no encoding defined in <meta> would fall into garbled text, do I realize it that correctly?

Does it still make differences in performance while setting intl.charset.detector to zh_ of cjk_parallel_state_machine rather than universal?
(In reply to Peter Pin-Guang Chen [:petercpg] (MozTW.org) from comment #1)
> Does it still make differences in performance while setting
> intl.charset.detector to zh_ of cjk_parallel_state_machine rather than
> universal?

Now every detector is just the universal detector with a language filter. The detector name "zh_*" and "cjk_parallel_state_machine" were unchanged for compatibility.
(In reply to Henri Sivonen (:hsivonen) from comment #0)
> Since Accept-Charset is no longer an issue, I think we should revert the
> fallback encoding to big5. 

IMHO, even though Accept-Charset is not an issue, set the fallback encoding back to big-5 will still contribute different-than-en-US behavior described in bug 536506 comment 0 point b.

I don't see the reason not to use the detector if it's already there in the code base. Unless we have decided to remove it ...
We want to see if we can do with less heuristics and over time at least disable the universal detector. We also want to more closely align with what other shipping browsers default to.
(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) from comment #3)
> I don't see the reason not to use the detector if it's already there in the
> code base. Unless we have decided to remove it ...

Ah, bug 844115 is about removing the detector ... :-/

(In reply to Anne van Kesteren from comment #4)
> We want to see if we can do with less heuristics and over time at least
> disable the universal detector. We also want to more closely align with what
> other shipping browsers default to.

Appreciate the speedy reply. I tend to think that "do what others do" is not a good argument; switch back to big5 essentially breaks the interoperability of the different locales of the Firefox, e.g. a plain text file on the hard drive (thus w/o HTTP header nor <meta>) will be decoded in big5 in zh-TW Firefox, but in other encodings in Firefox of other locales.

That said, there are still outdated government websites that uses big5 without declaring anything. Without the universal detector, switch back to big5 would be the only way to prevent those websites from breaking.

Bye, chardet, you will be missed.
(In reply to Peter Pin-Guang Chen [:petercpg] (MozTW.org) from comment #1)
> One foreseeable situation is that if user opens an ancient file encoded in
> Shift_JIS with no encoding defined in <meta> would fall into garbled text,
> do I realize it that correctly?

Yes. It is worth noting that unlabeled Shift_JIS shows up garbled by default in all non-Japanese Firefox localizations other than zh-TW. Why would zh-TW need to be different?

> Does it still make differences in performance while setting
> intl.charset.detector to zh_ of cjk_parallel_state_machine rather than
> universal?

Not sure. Do zh-TW users read unlabeled Japanese and Korean content so much more commonly than zh-CN users that detecting Japanese and Korean is necessary by default?

(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) from comment #3)
> (In reply to Henri Sivonen (:hsivonen) from comment #0)
> > Since Accept-Charset is no longer an issue, I think we should revert the
> > fallback encoding to big5. 
> 
> IMHO, even though Accept-Charset is not an issue, set the fallback encoding
> back to big-5 will still contribute different-than-en-US behavior described
> in bug 536506 comment 0 point b.

If you have a detector enabled, you have different-than-en-US behavior. If you want same-as-en-US behavior, you need to set the fallback to ISO-8859-1 and turn the detector off.

> I don't see the reason not to use the detector if it's already there in the
> code base.

Reasons for not using a detector:
 1) When pages rely on the detector and the first 1024 bytes are not enough to make the detector fire, they get reloaded in mid-stream which is bad for performance, bad for user experience and bad because scripts may run twice.
 2) Enabling the detector leads to new pages that rely on the detector to be authored.
 3) When pages rely on the detector, by default, they only work with browser localizations that have the detector they rely on enabled by default. Therefore, the pages are broken when viewed in other localizations, which is bad for users who have a reason to view pages from outside their home locale.
 4) #3 won't be solved by enabling the universal detector in all locales. The universal detector won't be enabled for any new locale, because it's not really universal (i.e. would break badly in some locales in its current state) and because of point #1.

Since no detector will be enabled for the en-US locale, you can't have same-as-en-US behavior except by setting the fallback to ISO-8859-1 and turning off the detector. If zh-TW users encounter so little unlabeled big5 content that shipping same-as-en-US defaults would be feasible, that would be totally awesome. We try to ship same-as-en-US defaults for fallback and detector state for all locales where feasible.

http://gs.statcounter.com/#browser-TW-monthly-201201-201301 indicates that IE and Chrome have way more market share than Firefox in Taiwan. What are their defaults for zh-TW?
(In reply to Henri Sivonen (:hsivonen) from comment #6)
> (In reply to Peter Pin-Guang Chen [:petercpg] (MozTW.org) from comment #1)
> > One foreseeable situation is that if user opens an ancient file encoded in
> > Shift_JIS with no encoding defined in <meta> would fall into garbled text,
> > do I realize it that correctly?
> 
> Yes. It is worth noting that unlabeled Shift_JIS shows up garbled by default
> in all non-Japanese Firefox localizations other than zh-TW. Why would zh-TW
> need to be different?

Agreed. I just took Japanese as an example to make sure I didn't take it wrong.

> 
> > Does it still make differences in performance while setting
> > intl.charset.detector to zh_ of cjk_parallel_state_machine rather than
> > universal?
> 
> Not sure. Do zh-TW users read unlabeled Japanese and Korean content so much
> more commonly than zh-CN users that detecting Japanese and Korean is
> necessary by default?
> 
> (In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) from comment #3)
> > (In reply to Henri Sivonen (:hsivonen) from comment #0)
> > > Since Accept-Charset is no longer an issue, I think we should revert the
> > > fallback encoding to big5. 
> > 
> > IMHO, even though Accept-Charset is not an issue, set the fallback encoding
> > back to big-5 will still contribute different-than-en-US behavior described
> > in bug 536506 comment 0 point b.
> 
> If you have a detector enabled, you have different-than-en-US behavior. If
> you want same-as-en-US behavior, you need to set the fallback to ISO-8859-1
> and turn the detector off.
> 
> > I don't see the reason not to use the detector if it's already there in the
> > code base.
> 
> Reasons for not using a detector:
>  1) When pages rely on the detector and the first 1024 bytes are not enough
> to make the detector fire, they get reloaded in mid-stream which is bad for
> performance, bad for user experience and bad because scripts may run twice.
>  2) Enabling the detector leads to new pages that rely on the detector to be
> authored.
>  3) When pages rely on the detector, by default, they only work with browser
> localizations that have the detector they rely on enabled by default.
> Therefore, the pages are broken when viewed in other localizations, which is
> bad for users who have a reason to view pages from outside their home locale.
>  4) #3 won't be solved by enabling the universal detector in all locales.
> The universal detector won't be enabled for any new locale, because it's not
> really universal (i.e. would break badly in some locales in its current
> state) and because of point #1.
> 
> Since no detector will be enabled for the en-US locale, you can't have
> same-as-en-US behavior except by setting the fallback to ISO-8859-1 and
> turning off the detector. If zh-TW users encounter so little unlabeled big5
> content that shipping same-as-en-US defaults would be feasible, that would
> be totally awesome. We try to ship same-as-en-US defaults for fallback and
> detector state for all locales where feasible.

My opinion is that we can change the fallback encoding to Big5 to keep compatibility for some more time, then turn chardet off as Accept-Charset issue has gone, like what you mentioned.

> 
> http://gs.statcounter.com/#browser-TW-monthly-201201-201301 indicates that
> IE and Chrome have way more market share than Firefox in Taiwan. What are
> their defaults for zh-TW?

Just tested IE10 and Chrome 25 on Win7 (all zh-TW versions), both of them are fell back to Big5 if there's no encoding set, and by default their encoding auto-detection function are disabled. It is observed that Chrome keeps sending Accept-Charset though.
Great. Let's set the fallback to big5 and turn off the detector.
Patch landed to -central and -aurora, changesets:

https://hg.mozilla.org/l10n-central/zh-TW/rev/c2c725b2f173
https://hg.mozilla.org/releases/l10n/mozilla-aurora/zh-TW/rev/4a58c216f82e

Need uplift to beta?
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Verified on latest nightly.
Status: RESOLVED → VERIFIED
Awesome. Thank you. I think this doesn't need a beta uplift and can ride the trains.
Status: VERIFIED → RESOLVED
Closed: 7 years ago7 years ago
Note for future Bugzilla archeologists:
Today, I realized that while this patch successfully turned off the detector, the patch failed to change the fallback encoding, because the platform-specific intl.properties files were not updates. However, at this point, no action is needed, because bug 910192 fixed the fallback for real for Firefox 28 and it's too late to fix Firefox 27.
You need to log in before you can comment on or make changes to this bug.