844114 - The Traditional Chinese localization should use big5 as the fallback encoding and the pan-Chinese detector or no detector instead of universal

Henri Sivonen (:hsivonen) (temporarily away from Bugzilla)

Reporter

Description

•

12 years ago

In bug 536506, the Traditional Chinese localization started defaulting to UTF-8 and the "universal" detector. The change to UTF-8 was motivated by the effect on Accept-Charset header which we no longer send. The enablement of a detector (previously none was enabled by default) was motivated by the need to mitigate the impact of the change to UTF-8. Since Accept-Charset is no longer an issue, I think we should revert the fallback encoding to big5. If this also removes the need for a detector, we should default to not having a detector enabled (for performance and for predictable behavior). If users of the Traditional Chinese localization encounter unlabeled Simplified Chinese content so often that some detector is needed, the pan-Chinese detector (pref value: zh_parallel_state_machine) should be used instead of universal.

Henri Sivonen (:hsivonen) (temporarily away from Bugzilla)

Reporter

Updated

•

12 years ago

Summary: The Traditional Chinese should use big5 as the fallback encoding and the pan-Chinese detector or no detector instead of universal → The Traditional Chinese localization should use big5 as the fallback encoding and the pan-Chinese detector or no detector instead of universal

Henri Sivonen (:hsivonen) (temporarily away from Bugzilla)

Reporter

Updated

•

12 years ago

Blocks: 844115

Peter Pin-Guang Chen [:petercpg] (MozTW.org)

Assignee

Comment 1

•

12 years ago

Henri, One foreseeable situation is that if user opens an ancient file encoded in Shift_JIS with no encoding defined in <meta> would fall into garbled text, do I realize it that correctly? Does it still make differences in performance while setting intl.charset.detector to zh_ of cjk_parallel_state_machine rather than universal?

Masatoshi Kimura [:emk]

Comment 2

•

12 years ago

(In reply to Peter Pin-Guang Chen [:petercpg] (MozTW.org) from comment #1) > Does it still make differences in performance while setting > intl.charset.detector to zh_ of cjk_parallel_state_machine rather than > universal? Now every detector is just the universal detector with a language filter. The detector name "zh_*" and "cjk_parallel_state_machine" were unchanged for compatibility.

Tim Guan-tin Chien [:timdream] (please needinfo)

Comment 3

•

12 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #0) > Since Accept-Charset is no longer an issue, I think we should revert the > fallback encoding to big5. IMHO, even though Accept-Charset is not an issue, set the fallback encoding back to big-5 will still contribute different-than-en-US behavior described in bug 536506 comment 0 point b. I don't see the reason not to use the detector if it's already there in the code base. Unless we have decided to remove it ...

Anne (:annevk)

Comment 4

•

12 years ago

We want to see if we can do with less heuristics and over time at least disable the universal detector. We also want to more closely align with what other shipping browsers default to.

Tim Guan-tin Chien [:timdream] (please needinfo)

Comment 5

•

12 years ago

(In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) from comment #3) > I don't see the reason not to use the detector if it's already there in the > code base. Unless we have decided to remove it ... Ah, bug 844115 is about removing the detector ... :-/ (In reply to Anne van Kesteren from comment #4) > We want to see if we can do with less heuristics and over time at least > disable the universal detector. We also want to more closely align with what > other shipping browsers default to. Appreciate the speedy reply. I tend to think that "do what others do" is not a good argument; switch back to big5 essentially breaks the interoperability of the different locales of the Firefox, e.g. a plain text file on the hard drive (thus w/o HTTP header nor <meta>) will be decoded in big5 in zh-TW Firefox, but in other encodings in Firefox of other locales. That said, there are still outdated government websites that uses big5 without declaring anything. Without the universal detector, switch back to big5 would be the only way to prevent those websites from breaking. Bye, chardet, you will be missed.

Henri Sivonen (:hsivonen) (temporarily away from Bugzilla)

Reporter

Comment 6

•

12 years ago

(In reply to Peter Pin-Guang Chen [:petercpg] (MozTW.org) from comment #1) > One foreseeable situation is that if user opens an ancient file encoded in > Shift_JIS with no encoding defined in <meta> would fall into garbled text, > do I realize it that correctly? Yes. It is worth noting that unlabeled Shift_JIS shows up garbled by default in all non-Japanese Firefox localizations other than zh-TW. Why would zh-TW need to be different? > Does it still make differences in performance while setting > intl.charset.detector to zh_ of cjk_parallel_state_machine rather than > universal? Not sure. Do zh-TW users read unlabeled Japanese and Korean content so much more commonly than zh-CN users that detecting Japanese and Korean is necessary by default? (In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) from comment #3) > (In reply to Henri Sivonen (:hsivonen) from comment #0) > > Since Accept-Charset is no longer an issue, I think we should revert the > > fallback encoding to big5. > > IMHO, even though Accept-Charset is not an issue, set the fallback encoding > back to big-5 will still contribute different-than-en-US behavior described > in bug 536506 comment 0 point b. If you have a detector enabled, you have different-than-en-US behavior. If you want same-as-en-US behavior, you need to set the fallback to ISO-8859-1 and turn the detector off. > I don't see the reason not to use the detector if it's already there in the > code base. Reasons for not using a detector: 1) When pages rely on the detector and the first 1024 bytes are not enough to make the detector fire, they get reloaded in mid-stream which is bad for performance, bad for user experience and bad because scripts may run twice. 2) Enabling the detector leads to new pages that rely on the detector to be authored. 3) When pages rely on the detector, by default, they only work with browser localizations that have the detector they rely on enabled by default. Therefore, the pages are broken when viewed in other localizations, which is bad for users who have a reason to view pages from outside their home locale. 4) #3 won't be solved by enabling the universal detector in all locales. The universal detector won't be enabled for any new locale, because it's not really universal (i.e. would break badly in some locales in its current state) and because of point #1. Since no detector will be enabled for the en-US locale, you can't have same-as-en-US behavior except by setting the fallback to ISO-8859-1 and turning off the detector. If zh-TW users encounter so little unlabeled big5 content that shipping same-as-en-US defaults would be feasible, that would be totally awesome. We try to ship same-as-en-US defaults for fallback and detector state for all locales where feasible. http://gs.statcounter.com/#browser-TW-monthly-201201-201301 indicates that IE and Chrome have way more market share than Firefox in Taiwan. What are their defaults for zh-TW?

Peter Pin-Guang Chen [:petercpg] (MozTW.org)

Assignee

Comment 7

•

12 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #6) > (In reply to Peter Pin-Guang Chen [:petercpg] (MozTW.org) from comment #1) > > One foreseeable situation is that if user opens an ancient file encoded in > > Shift_JIS with no encoding defined in <meta> would fall into garbled text, > > do I realize it that correctly? > > Yes. It is worth noting that unlabeled Shift_JIS shows up garbled by default > in all non-Japanese Firefox localizations other than zh-TW. Why would zh-TW > need to be different? Agreed. I just took Japanese as an example to make sure I didn't take it wrong. > > > Does it still make differences in performance while setting > > intl.charset.detector to zh_ of cjk_parallel_state_machine rather than > > universal? > > Not sure. Do zh-TW users read unlabeled Japanese and Korean content so much > more commonly than zh-CN users that detecting Japanese and Korean is > necessary by default? > > (In reply to Tim Guan-tin Chien [:timdream] (MoCo-TPE) from comment #3) > > (In reply to Henri Sivonen (:hsivonen) from comment #0) > > > Since Accept-Charset is no longer an issue, I think we should revert the > > > fallback encoding to big5. > > > > IMHO, even though Accept-Charset is not an issue, set the fallback encoding > > back to big-5 will still contribute different-than-en-US behavior described > > in bug 536506 comment 0 point b. > > If you have a detector enabled, you have different-than-en-US behavior. If > you want same-as-en-US behavior, you need to set the fallback to ISO-8859-1 > and turn the detector off. > > > I don't see the reason not to use the detector if it's already there in the > > code base. > > Reasons for not using a detector: > 1) When pages rely on the detector and the first 1024 bytes are not enough > to make the detector fire, they get reloaded in mid-stream which is bad for > performance, bad for user experience and bad because scripts may run twice. > 2) Enabling the detector leads to new pages that rely on the detector to be > authored. > 3) When pages rely on the detector, by default, they only work with browser > localizations that have the detector they rely on enabled by default. > Therefore, the pages are broken when viewed in other localizations, which is > bad for users who have a reason to view pages from outside their home locale. > 4) #3 won't be solved by enabling the universal detector in all locales. > The universal detector won't be enabled for any new locale, because it's not > really universal (i.e. would break badly in some locales in its current > state) and because of point #1. > > Since no detector will be enabled for the en-US locale, you can't have > same-as-en-US behavior except by setting the fallback to ISO-8859-1 and > turning off the detector. If zh-TW users encounter so little unlabeled big5 > content that shipping same-as-en-US defaults would be feasible, that would > be totally awesome. We try to ship same-as-en-US defaults for fallback and > detector state for all locales where feasible. My opinion is that we can change the fallback encoding to Big5 to keep compatibility for some more time, then turn chardet off as Accept-Charset issue has gone, like what you mentioned. > > http://gs.statcounter.com/#browser-TW-monthly-201201-201301 indicates that > IE and Chrome have way more market share than Firefox in Taiwan. What are > their defaults for zh-TW? Just tested IE10 and Chrome 25 on Win7 (all zh-TW versions), both of them are fell back to Big5 if there's no encoding set, and by default their encoding auto-detection function are disabled. It is observed that Chrome keeps sending Accept-Charset though.

Henri Sivonen (:hsivonen) (temporarily away from Bugzilla)

Reporter

Comment 8

•

12 years ago

Great. Let's set the fallback to big5 and turn off the detector.

Peter Pin-Guang Chen [:petercpg] (MozTW.org)

Assignee

Comment 9

•

12 years ago

Patch landed to -central and -aurora, changesets: https://hg.mozilla.org/l10n-central/zh-TW/rev/c2c725b2f173 https://hg.mozilla.org/releases/l10n/mozilla-aurora/zh-TW/rev/4a58c216f82e Need uplift to beta?

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Peter Pin-Guang Chen [:petercpg] (MozTW.org)

Assignee

Comment 10

•

12 years ago

Verified on latest nightly.

Status: RESOLVED → VERIFIED

Henri Sivonen (:hsivonen) (temporarily away from Bugzilla)

Reporter

Comment 11

•

12 years ago

Awesome. Thank you. I think this doesn't need a beta uplift and can ride the trains.

Status: VERIFIED → RESOLVED

Closed: 12 years ago → 12 years ago

Henri Sivonen (:hsivonen) (temporarily away from Bugzilla)

Reporter

Comment 12

•

11 years ago

Note for future Bugzilla archeologists: Today, I realized that while this patch successfully turned off the detector, the patch failed to change the fallback encoding, because the platform-specific intl.properties files were not updates. However, at this point, no action is needed, because bug 910192 fixed the fallback for real for Firefox 28 and it's too late to fix Firefox 27.

Bugzilla

Quick Search

The Traditional Chinese localization should use big5 as the fallback encoding and the pan-Chinese detector or no detector instead of universal

Categories

(Mozilla Localizations :: zh-TW / Chinese (Traditional), defect)

Tracking

(Not tracked)

People

(Reporter: hsivonen, Assigned: petercpg)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12