Closed Bug 488433 Opened 15 years ago Closed 10 years ago

default charset localization is confusing

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla28

People

(Reporter: Pike, Assigned: smontagu)

Details

(Whiteboard: [fixed by bug 910192])

The default charset setting is a tad confusing, we should clean that up.

For one, the entity is available both in global/intl.properties, and global-platform/intl.properties. Only the latter seems to be used, so we should remove the first, if we think this really needs to be platform specific. I'm not sure if that's true.

Looking at the source, Lithuanian uses windows-1257 on windows and ISO-8859-13 on the rest. Czech uses windows-1250 and ISO-8859-2. Simon, is that relevant?

Another comment we got is that it's surprising that the values here seem to be depending on upper and lower case, we should add a comment to that extent.
This all dates back to before my time, but I did some code and bugzilla archæology. It was confusing at first, because the checkin comment for http://bonsai.mozilla.org/cvslog.cgi?file=mozilla/xpfe/browser/resources/locale/en-US/unix/navigator.properties&root=/cvsroot rev 1.2 refers to bug 8124, which has nothing to do with this. I finally discovered via bug 97606 that this referred to http://bugscape.netscape.com/show_bug.cgi?id=8124, which was unavailable for comment at the time of writing, so all I have is the quotation in bug 97606 comment 13:

  "Currently, intl.charset.default is set to Shift_JIS for Win and mac, but for
  linux it should be set to EUC-JP.

I still don't understand why that is, and our current ja localization sets Shift_JIS on unix anyway. I wonder if intl.charset.default was once used for some other purpose which made this necessary at the time?

Would it cause unnecessary inconvenience for localizers if we did away with the platform specific properties?
Bug 723609 added a localization note regarding the case sensitivity of the value of intl.charset.default (and related properties).

I wonder if global-platform/intl.properties is necessary at all. It only contains intl.charset.default and intl.ellipsis.

From what I can tell, only the Japanese locales have different values on different platforms: Mac keeps the default ellipsis and Win/Unix uses 3 dots. ('pt-PT' is the only non-Japanese locale that uses 3 dots, but it uses it on all platforms. A handful of others use the \u2026 escape character instead of the literal ellipsis character.)

In addition to the two locales mentioned in comment 0 ('cs' and 'lt'), 'st' (Sesotho) uses ISO 8859-1 on Mac and ISO 8859-15 on Win/Unix.

Given how similar the charsets in question are, I can't imagine it would be much of a hassle to use the same charset on all platforms and get rid of global-platform/intl.properties altogether.

(After all, the ideal situation for the future is to get everyone using UTF-8 anyway, right? ^_^)
Hi folks,
regarding Lithuanian: I think it was set to use different character sets simply because it was possible. :) Windows didn't even support ISO-8859-13 properly up until some service pack of IE6 was released, which makes me wonder why my predecessors chose to use this charset by default, even if on non-windows platforms only.

I have absolutely no objections to using same intl.charset.default on all platforms.
Hi,
generally I don't like the idea to have a common character set for all platforms.

Historically there used to be only windows-1250 on windows and ISO-8859-2 on *nix. So using windows-1250 on windows was the obvious choice. Moreover it allows us to render some document (there used to be many of them) without character set correctly.

Using windows-1250 on *nix would be kind of controversial decision because some people from *nix community are strongly opposed to this "non standard" encoding.
I think Henri has oppinions on this.

I guess the default charset might come into play in two cases, loading files from disk, and loading data from the web without explicit charsets specified. Files from disk may be platform dependent, I guess, but I'm not sure how much I'd allow that to regress loading legacy content off the web.

Henri made the point in the past that the future to be in is explicitly declared utf-8 encoding within files and http headers, much rather than changing the default charset to utf-8, but I'll let him speak to that, in case that changed.
(In reply to Pavel Franc - Mozilla.cz from comment #4)
> generally I don't like the idea to have a common character set for all
> platforms.
> 
> Historically there used to be only windows-1250 on windows and ISO-8859-2 on
> *nix. So using windows-1250 on windows was the obvious choice. Moreover it
> allows us to render some document (there used to be many of them) without
> character set correctly.

There used to be, yes. But now, a document without a correctly declared character set is quite a rarity. Furthermore, I would assume that most of such documents that are accessible on-line are platform-independent, so it's hard for me to understand why a Linux user would want the same document to be represented in a different character set.

> Using windows-1250 on *nix would be kind of controversial decision because
> some people from *nix community are strongly opposed to this "non standard"
> encoding.

The people, who are strongly opposed to defaulting to windows-something, can easily change their default in application preferences anyway. I guess if you really want to follow standards to the letter, you should always default to ISO-8859-1 anyway. By the way, did you know that our ISO-8859-1 is actually windows-1252? How do you think those people from the *nix community feel about that? :)
> By the way, did you know that our ISO-8859-1 is
> actually windows-1252? 

I know that but this is completely irrelevant to Czech users. We use windows-1250 in windows world and ISO-8859-2 in *nix world and these two sets differ in some important char positions. 

BTW the intl.charset.default is used in other places, not only in nsHTMLDocument (see http://mxr.mozilla.org/mozilla-central/search?string=intl.charset.default ). So I don't mind to be innovative and use UTF-8 everywhere but I would like to know if this would not harm our users.

And what about
#ifdef XP_WIN
   pref("intl.charset.default", "windows-1250");
#endif

in Czech firefox-l10n.js. This would IMHO make me happy.
Firefox should always expose the same features and settings on all platforms if in any way possible. Web content is not different if you access it on a Windows or Linux computer, so our settings to what we do wrt to supporting them should not differ either.
(In reply to Axel Hecht [:Pike] from comment #5)
> I think Henri has oppinions on this.

I indeed have opinions here. :-)

Since browsers need to have different fallback encodings for different locales for historical reasons, it is never right for a Web page to depend on the fallback encoding. For this reason, HTML pages must (as in required by the spec for the page to be valid) always declare the encoding in the HTTP headers, in a meta element or by using a BOM. Thus, correctly authored UTF-8 pages declare their encoding and don't rely on the fallback encoding provided by the browser. The fallback exists for legacy Web content--not to facilitate the authoring of new Web content that depends on the locale of the browser.

If a locale does not have a body of legacy Web content that doesn't declare its encoding, it makes sense to choose the fallback encoding so that it is maximally successful at decoding pages that users of the localization are most likely to encounter when they venture outside their own sphere own language on the Web. For example, if a locale doesn't need any specific fallback encoding for legacy reasons and the users are most likely to read English/Spanish/French/etc. content when they read foreign content, the fallback encoding should be set to Windows-1252. However, for localization for Latin-script-using minority language in Russia it might make sense to set the fallback to windows-1251 assuming that the users are most likely to read Russian legacy pages when they read pages that are not in the language of the localization.

It follows from the above that setting the fallback encoding to UTF-8 has always been a localization bug at the time the default was initially set in Firefox, but making Firefox default to UTF-8 may have caused such a body of legacy content to accumulate that UTF-8 needs to remain as the fallback encoding for the locale.

I think we should have review procedures in place to make sure that localizers don't introduce more locales with the fallback encoding set to UTF-8 and don't introduce more locales that turn on chardet. I think we should also have localization note comments in the relevant file to this effect.

Note that the table in step #9 of http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding is based on data extracted from Firefox localizations, so it would be circular reasoning to appeal to that table as an argument for using UTF-8 as the fallback encoding for a locale without demonstrating that unlabeled UTF-8 legacy content exists in substantial amounts on Web sites targeting the language of that locale.

(In reply to Pavel Franc - Mozilla.cz from comment #7)
> And what about
> #ifdef XP_WIN
>    pref("intl.charset.default", "windows-1250");
> #endif

The fallback encoding exists to cater for legacy Web pages--not for local files. We should not vary the fallback encoding by operating system. Czech Web users see the same Web regardless of their OS, so the Czech Firefox localization should use the fallback most successful for legacy Czech Web content that fails to declare its encoding. Newly-authored Czech Web content should use UTF-8 and declare it.
This is now FIXED (since Firefox 28), because the fallback encoding is no longer localizable. Instead, Gecko itself contains a mapping table from locales to fallbacks so that the mapping for all locales is centralized and maintained by Gecko developers instead of being spread out across localization repos and maintained by localizers.

As far as I know, we've had no complaints from non-Windows users of the Czech localization for aligning the fallback for the Czech localization on all platforms with the previous fallback on Windows. That is, it's now windows-1250 on all platforms for the Czech localization.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Whiteboard: [fixed by bug 910192]
Target Milestone: --- → mozilla28
You need to log in before you can comment on or make changes to this bug.