Last Comment Bug 910192 - Get rid of intl.charset.default as a localizable pref and deduce the fallback from the locale
: Get rid of intl.charset.default as a localizable pref and deduce the fallback...
Status: RESOLVED FIXED
[qa-]
:
Product: Core
Classification: Components
Component: HTML: Parser (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla28
Assigned To: Henri Sivonen (:hsivonen)
:
Mentors:
Depends on: 918294 934415 1034960
Blocks: 934492 967981
  Show dependency treegraph
 
Reported: 2013-08-28 05:15 PDT by Henri Sivonen (:hsivonen)
Modified: 2015-05-05 09:32 PDT (History)
19 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Gecko patch (14.96 KB, patch)
2013-09-18 01:50 PDT, Henri Sivonen (:hsivonen)
no flags Details | Diff | Review
Gecko patch, v2 (33.31 KB, patch)
2013-09-19 03:04 PDT, Henri Sivonen (:hsivonen)
bzbarsky: review+
Details | Diff | Review
Firefox patch (8.32 KB, patch)
2013-09-19 03:04 PDT, Henri Sivonen (:hsivonen)
no flags Details | Diff | Review
Firefox patch, v2 (8.31 KB, patch)
2013-09-19 04:44 PDT, Henri Sivonen (:hsivonen)
gavin.sharp: review+
Details | Diff | Review

Description Henri Sivonen (:hsivonen) 2013-08-28 05:15:34 PDT
Leaving the management of the HTML and plain text fallback encoding to localizations isn't quite working: bug 910163, bug 910165, bug 910169, bug 910179, bug 910181, bug 910187, bug 844087, bug 844082, bug 522218, bug 844114.

Instead of expecting localizations to set intl.charset.default correctly, We should make the code that currently reads the pref read the language code of the localization instead and use a lookup table of known non-windows-1252 locales to choose a fallback encoding.

The problem with only doing what's proposed in the previous paragraph is that it would remove the possibility of using a localization out of locale. For example, it would remove the possibility of configuring an en-US build to use Shift_JIS as the fallback encoding so that the build is suitable for use in Japan. To that end, the current UI for changing intl.charset.default should be replaced with UI for managing a new non-localizable preference for overriding the hard-coded lookup table. It should be possible to set this override only to encodings that also exist in the lookup table (plus windows-1252). The UI labels for these encodings could then be less technical: e.g. Western, Central European (Windows), Central European (ISO), Greek, Turkish, Cyrillic, Traditional Chinese, Simplified Chinese, Korean, Japanese, Thai, etc., since  with the exception of Central European, each of these would mean a single  legacy encoding.
Comment 1 Henri Sivonen (:hsivonen) 2013-08-29 00:53:51 PDT
The override pref menu (in Preferences: Content: Font & Colors: Advanced: Character Encoding for Legacy Content) should probably have these items:
 * Default for Current Locale
 * Arabic
 * Baltic
 * Central European (ISO)
 * Central European (Microsoft)
 * Chinese, Simplified
 * Chinese, Traditional
 * Cyrillic
 * Hebrew
 * Japanese
 * Korean
 * Thai
 * Turkish
 * Other (incl. Western European)

Where "Default for Current Locale" means: Choose which other item on the list to actually use based on the currently active language pack.

The label for "Other" is a bit awkward, but it was just "Other", I bet some people who should be choosing "Other" would choose either of the Central European options. In any case, this short list would be much less of a footgun than our current list. And of course, users who don't mismatch their UI locale and the content they browse never need to touch this setting.

(And looks like the HTML spec doesn't special case Greek and we already use windows-1252 for the Greek locale!)
Comment 2 Masatoshi Kimura [:emk] 2013-08-29 02:45:50 PDT
Makes sense.
Comment 3 Axel Hecht [:Pike] 2013-08-29 10:44:19 PDT
I'd rather see us picking the first useful locale entry from the accept-lang list. Quite a few upcoming minority languages are in regions like Russia etc, and if they'd just add Russian to their accept lang, they'd get a good value for their region.

I think the UI language is a much bigger set to get right than the 2nd or 3rd value of accept-lang.
Comment 4 Henri Sivonen (:hsivonen) 2013-08-30 01:06:43 PDT
(In reply to Axel Hecht [:Pike] from comment #3)
> I'd rather see us picking the first useful locale entry from the accept-lang
> list. Quite a few upcoming minority languages are in regions like Russia
> etc, and if they'd just add Russian to their accept lang, they'd get a good
> value for their region.

1) This bug is not about changing the general approach of how we guess the fallback. This is about keeping the general approach  that we currently use and that  other browsers currently use (guessing the fallback from the UI locale)  but moving where the  mapping data resides so that  bugs in the mappings can be fixed in one place with the sort of review processes that are needed to land code in Core instead of chasing bugs  spread around localization repositories and explaining the same thing over and over again to different localizers. I think it's not reasonable to  refrain from making this improvement to the implementation of our *current* general approach (guessing the fallback from UI locale) on the grounds that some other yet unproven  approach *might* be better.

2) The fallback that is needed  depends on the legacy content being browsed rather than the user's preferences. Using the user's preferences (be it the choice of UI locale or the list of preferred languages) as a prediction about the content being browsed is fundamentally more of a stretch than using some data that's connected to what's actually being browsed.  So I think making more guesses based on the list of preferred languages is missing the point.  The piece of information that could be better locale-correlated with the content being browsed and that is known before we start actually looking at the payload is the TLD form the URL. Bug 910211 is about exploring that observation as a potential improvement. (But we should fix this bug first so that we can watch the effect on telemetry on a step-by-step basis so that we know which steps are actual improvements.)

3) As long as the default list of preferred languages is under the control localizations,  tying the fallback encoding to the list of preferred languages would be a complete failure to solve the problem that this bug is about solving, which is taking the control of the fallback encoding away from localizations (in response to the track record of bugs linked from comment 0).

4) Even if one believed that localizers are more likely to get the default list of preferred language is right that getting the fallback encoding right,  I think it's not reasonable to expect localizations to consistently succeed at following a rule like including the region code for Russia or the language code for Russian in the list of preferred languages.

5) Since the list of preferred languages is user-configurable  and it's not obvious that the list of preferred languages should have side effects related to the fallback encoding, it seems  like a very bad idea to me that if a user located outside the Cyrillic region and reading most often non-Cyrillic content adds Russian to their list of preferred languages, the browser starts using windows-1251 as the fallback.

6) We don't actually have that many minority languages of Russia in our localization repertoire. We can cross that bridge when we get there. And in any case, even before your comment, I was on the case of  making the spec better in terms of rules for Cyrillic languages, including ones that we don't yet have a localization for: https://www.w3.org/Bugs/Public/show_bug.cgi?id=23089 (Minority languages of Russia that use a non-Cyrillic  script aren't the trickier case, if one assumes the locale names can't be guaranteed to include "-RU" but the legacy content the users face is predominantly majority-language  Cyrillic legacy content, but as noted we can cross that bridge when we get there.)

(In reply to Henri Sivonen (:hsivonen) from comment #1)
> (And looks like the HTML spec doesn't special case Greek and we already use
> windows-1252 for the Greek locale!)

https://www.w3.org/Bugs/Public/show_bug.cgi?id=23090
Comment 5 Henri Sivonen (:hsivonen) 2013-09-18 01:50:46 PDT
Created attachment 806506 [details] [diff] [review]
Gecko patch
Comment 6 Henri Sivonen (:hsivonen) 2013-09-19 03:04:10 PDT
Created attachment 807124 [details] [diff] [review]
Gecko patch, v2
Comment 7 Henri Sivonen (:hsivonen) 2013-09-19 03:04:49 PDT
Created attachment 807125 [details] [diff] [review]
Firefox patch
Comment 8 Henri Sivonen (:hsivonen) 2013-09-19 04:20:38 PDT
This patch changes the fallback encoding for the following existing Firefox locales:

Arabic and Farsi: Change from UTF-8 to windows-1256. I checked on the localization mailing list, and the reasons for using UTF-8 as the fallback were not good.  The change aligns us with Chrome and IE both of which have higher market share than Firefox in the region.

Vietnamese: Changes from windows-1252 to windows-1258 to align with Chrome and IE.

Macedonian, Serbian and Kazakh: Change from UTF-8 to windows-1251.

Tswana, Windows&Linux Southern Sotho, Northern Sotho and Armenian: Change from ISO-8859-15 (yes, the ISO-8895-1 modification that was introduced *after* UTF-8!) to windows-1252. (Mac Southern Sotho was already!)

Zulu, Venda, Tsonga, Swati, South Ndebele, Georgian and Khmer: Change from UTF-8 to windows-1252.

Croatian: Changes from windows-1252 to windows-1250. (But Albanian stays at windows-1252. Hmm.)

Romanian: Changes from UTF-8 to windows-1252. (IE has windows-1250! Hmm.)

Latvian and Mac and Linux Lithuanian: Change from ISO-8859-13 to windows-1257. (Windows Lithuanian was already.)

Thai: Changes from windows-1252 to windows-874. (It's been trying to do that already but has failed due to per-platform overrides foiling the main value. No wonder Thai shows up in the charset menu stats.)


If we ever introduce localizations for the following languages, they *might* need special treatment, but it's not obvious if they do (research needed!):
Turkmen (windows-1250?)
Azeri (windows-1254?)
Dari (windows-1256?)
Urdu (windows-1256?)
Uyghur (windows-1256? or gbk?)
Comment 9 Henri Sivonen (:hsivonen) 2013-09-19 04:44:00 PDT
Created attachment 807155 [details] [diff] [review]
Firefox patch, v2
Comment 10 Henri Sivonen (:hsivonen) 2013-09-19 04:46:12 PDT
Remaining issues:
 * Figure out how to decouple Thunderbird's default encoding for outgoing mail from the Web a fallback encoding.
 * Figure out how to test this on Android.
 * Figure out how to test this on B2G.
Comment 11 Henri Sivonen (:hsivonen) 2013-09-19 23:34:21 PDT
The Thunderbird situation was already okay. That leaves testing on Android and on B2G. I'm told I can't get  suitable builds from the tryserver.
Comment 12 Henri Sivonen (:hsivonen) 2013-09-20 05:50:57 PDT
Since the reply I got in dev-platform indicates that this should work on Android and on B2G and setting up multi-locale Android and B2G builds seems like a bad use of time on balance compared to having to back out, let's move forward with this.
Comment 13 Boris Zbarsky [:bz] 2013-09-20 08:08:53 PDT
Comment on attachment 807124 [details] [diff] [review]
Gecko patch, v2

>+      mFallback.EqualsLiteral("UTF-8")) {

Why truncate mFallback in this case?

r=me modulo an answer to that.
Comment 14 Henri Sivonen (:hsivonen) 2013-09-23 01:46:06 PDT
(In reply to Boris Zbarsky [:bz] from comment #13)
> Comment on attachment 807124 [details] [diff] [review]
> Gecko patch, v2
> 
> >+      mFallback.EqualsLiteral("UTF-8")) {
> 
> Why truncate mFallback in this case?
> 
> r=me modulo an answer to that.

That's just to deal with inappropriate values given via about:config. This pref supposed to take the label of a legacy encoding as its value and UTF-8 isn't a legacy encoding. Having that check or not having that check will probably not matter much either way. However, the check against UTF-16 is more important, since letting UTF-16  act as a fallback would lead to big trouble.
Comment 15 Boris Zbarsky [:bz] 2013-09-23 05:49:27 PDT
Right; my question was specifically about the UTF-8 case.
Comment 16 Henri Sivonen (:hsivonen) 2013-09-23 08:18:32 PDT
(In reply to Boris Zbarsky [:bz] from comment #15)
> Right; my question was specifically about the UTF-8 case.

As noted, it just deals with an inappropriate (not a legacy encoding) value that could be introduced via about:config. It probably doesn't matter that much whether such about:config inappropriateness is handled.
Comment 17 Henri Sivonen (:hsivonen) 2013-11-01 08:16:26 PDT
This patch can't break Firefox for Android, because Gecko on Android is stuck to en-US already: bug 933761.
Comment 18 Henri Sivonen (:hsivonen) 2013-11-01 08:49:58 PDT
(In reply to Henri Sivonen (:hsivonen) from comment #17)
> This patch can't break Firefox for Android, because Gecko on Android is
> stuck to en-US already: bug 933761.

Same thing with B2G: bug 933785.

So *this* bug can land after review for the chrome patch, since Fennec and B2G are broken already and landing this can't break them further.
Comment 19 :Gavin Sharp [email: gavin@gavinsharp.com] 2013-11-01 17:15:43 PDT
Comment on attachment 807155 [details] [diff] [review]
Firefox patch, v2

Apologies for the ridiculous delay.
Comment 20 :Gavin Sharp [email: gavin@gavinsharp.com] 2013-11-01 17:17:16 PDT
Comment on attachment 807124 [details] [diff] [review]
Gecko patch, v2

>diff --git a/browser/components/migration/src/SafariProfileMigrator.js b/browser/components/migration/src/SafariProfileMigrator.js

>-        // Default charset migration
>-        this._set("WebKitDefaultTextEncodingName", "intl.charset.default",

Why remove this code, rather than adjusting it to use the new pref?

(seems like this should have been in the Firefox patch)
Comment 21 :Gavin Sharp [email: gavin@gavinsharp.com] 2013-11-01 17:20:29 PDT
Comment on attachment 807124 [details] [diff] [review]
Gecko patch, v2

>diff --git a/toolkit/components/search/nsSearchService.js b/toolkit/components/search/nsSearchService.js

>-  return getLocalizedPref("intl.charset.default", DEFAULT_QUERY_CHARSET);
>+  // Don't bother being fancy about what to return in the failure case.
>+  return "windows-1252";

I vaguely recall there being something peculiar about our definition of ISO-8859-1 vs. windows-1252, but I don't recall the details. Why did you pick this value?
Comment 22 Henri Sivonen (:hsivonen) 2013-11-04 03:12:44 PST
Thanks for the review!

(In reply to :Gavin Sharp (email gavin@gavinsharp.com) from comment #20)
> Comment on attachment 807124 [details] [diff] [review]
> Gecko patch, v2
> 
> >diff --git a/browser/components/migration/src/SafariProfileMigrator.js b/browser/components/migration/src/SafariProfileMigrator.js
> 
> >-        // Default charset migration
> >-        this._set("WebKitDefaultTextEncodingName", "intl.charset.default",
> 
> Why remove this code, rather than adjusting it to use the new pref?
> 
> (seems like this should have been in the Firefox patch)

The population that has explicitly adjusted this pref and would benefit from migration is probably very small. Developing code to distinguish that case would be a significant effort for little gain. Migration without distinguishing that case would mean that we'd hand this area of Web compat to Safari's control. To the extent Safari  does what was determined to be the right thing here, there would be no effect either way. To the extent Safari doesn't do what was different determined to be the right thing here, we'd let Safari make us do the wrong thing.

(In reply to :Gavin Sharp (email gavin@gavinsharp.com) from comment #21)
> Comment on attachment 807124 [details] [diff] [review]
> Gecko patch, v2
> 
> >diff --git a/toolkit/components/search/nsSearchService.js b/toolkit/components/search/nsSearchService.js
> 
> >-  return getLocalizedPref("intl.charset.default", DEFAULT_QUERY_CHARSET);
> >+  // Don't bother being fancy about what to return in the failure case.
> >+  return "windows-1252";
> 
> I vaguely recall there being something peculiar about our definition of
> ISO-8859-1 vs. windows-1252, but I don't recall the details. Why did you
> pick this value?

ISO-8859-1 is an alias for windows-1252. windows-1252 is the canonical name. I picked this value, because windows-1252 is the de facto legacy fallback and the case above where we reach this return statement already involves a failure on the earlier lines, so it doesn't make sense to be fancier than just to return the most general legacy fallback value.
Comment 24 Carsten Book [:Tomcat] 2013-11-04 04:06:30 PST
sorry, had to backout this changesets because of build bustages/failures on inbound with errors like https://tbpl.mozilla.org/php/getParsedLog.php?id=30070947&tree=Mozilla-Inbound
Comment 25 Henri Sivonen (:hsivonen) 2013-11-04 05:54:31 PST
A change to #includes had gotten dropped on the floor when rebasing. Relanded:

https://hg.mozilla.org/integration/mozilla-inbound/rev/bbf4142fb81e
https://hg.mozilla.org/integration/mozilla-inbound/rev/d79a832a8cae
Comment 27 Axel Hecht [:Pike] 2013-11-05 07:28:43 PST
(In reply to Henri Sivonen (:hsivonen) from comment #17)
> This patch can't break Firefox for Android, because Gecko on Android is
> stuck to en-US already: bug 933761.

This is not correct, gecko is indeed localized in Android, you just picked an example that isn't. I don't think you're breaking the multi-locale android builds, though, getSelectedLocale('global') is made to work in the multi-locale builds for exactly this purpose.

A few notes, setting matchOS on and off might not retrigger the cache but change the value.

The chinese code matching is still brittle regarding bcp47. I wonder if we'll get to the point to just use ICU for things like this.
Comment 28 Henri Sivonen (:hsivonen) 2013-11-06 05:15:59 PST
(In reply to Axel Hecht [:Pike] from comment #27)
> This is not correct, gecko is indeed localized in Android, you just picked
> an example that isn't.

Before I landed this, it looked like intl.charset.default wasn't localized either.

> I don't think you're breaking the multi-locale
> android builds, though, getSelectedLocale('global') is made to work in the
> multi-locale builds for exactly this purpose.

Are you sure it works? It doesn't seem to work for me on Android.

Can you suggest a way to test that *something* in Gecko is localized when the Fennec UI language isn't en-US?

> The chinese code matching is still brittle regarding bcp47. I wonder if
> we'll get to the point to just use ICU for things like this.

Throughout the existence of Firefox, we've always used zh-TW for Traditional Chinese. How likely it would be that we'd ship a localization for Traditional Chinese whose language code wasn't one of the codes this patch checks? Using ICU in case there are other language tag components that either region or scripts seems like an overkill. Could be handled as a follow-up, of course, if this is a Real Problem.
Comment 29 Gordon P. Hemsley [:GPHemsley] 2013-11-06 06:19:43 PST
(In reply to Henri Sivonen (:hsivonen) from comment #28)
> (In reply to Axel Hecht [:Pike] from comment #27)
> > The chinese code matching is still brittle regarding bcp47. I wonder if
> > we'll get to the point to just use ICU for things like this.
> 
> Throughout the existence of Firefox, we've always used zh-TW for Traditional
> Chinese. How likely it would be that we'd ship a localization for
> Traditional Chinese whose language code wasn't one of the codes this patch
> checks? Using ICU in case there are other language tag components that
> either region or scripts seems like an overkill. Could be handled as a
> follow-up, of course, if this is a Real Problem.

IIUC, this is mostly to handle legacy content. 'zh-TW', 'zh-HK', and 'zh-MO' seem to me, in this case, to be legacy locale codes (as opposed to full-on language tags). It might be worth it, though, to check if a tag *starts with* (rather than *equals*) 'zh-Hant' to defend against any language tags with both a script and a region subtag.

I would never be opposed to always dealing with language tags using the full BCP47 machinery, but I can see how it might be considered overkill in this case.
Comment 30 Matt Ruffalo 2014-02-27 19:30:25 PST
I find it bizarre that UTF-8 was removed as a value for this setting. I '''want''' to assume a modern (i.e. UTF-8) encoding for all pages that I view. Any site that doesn't declare an encoding and fails to decode as UTF-8 is *broken*. I liked that Firefox 27 would allow me to see many occurrences of U+FFFD REPLACEMENT CHARACTER on such sites, to remind me to contact these sites' owners to update their encoding declarations.

I have used Firefox since it was called Phoenix (in 2002 or 2003 I think) and then Firebird, and I have never considered switching to a different browser until now.
Comment 31 Henri Sivonen (:hsivonen) 2014-02-27 23:34:59 PST
(In reply to Matt Ruffalo from comment #30)
> I find it bizarre that UTF-8 was removed as a value for this setting.

I did so to avoid users who go to about:config but don't have sufficient familiarity with the issues breaking their browsing experience, because absent sufficient familiarity, UTF-8 looks like an attractive value, even though it's never a sensible value, because UTF-8 isn't a legacy encoding. It's the only non-legacy encoding. Bug 967981 indicates that my expectation of misplaced attractiveness of UTF-8 as a value of this setting was not unfounded.

> I
> '''want''' to assume a modern (i.e. UTF-8) encoding for all pages that I
> view. Any site that doesn't declare an encoding and fails to decode as UTF-8
> is *broken*.

Any site that doesn't declare its encoding is broken. Sites that use UTF-8 and don't declare it are broken, too. Your methodology for detecting broken sites ignores that class of brokenness. Worse, if you create new pages under your configuration, you are at risk of contributing to the problem of UTF-8, because you are less likely to notice that class of error.

> I liked that Firefox 27 would allow me to see many occurrences
> of U+FFFD REPLACEMENT CHARACTER on such sites,

That's a highly unusual thing for a user to want. Compare with https://xkcd.com/1172/ .

We generally try to give a non-broken experience to users and surface latent authoring problems to Web developers in the developer tools. For example, we don't make the HTML parser catastrophically fail when encountering HTML syntax errors but we highlight (mid-level; neither low-level encoding errors not high-level validity errors) parse errors in View Source. Failure of the top-level document to declare its encoding is reported to the console available in the developer tools. (Failure of framed documents to do so is not reported, because advertising iframes that don't declare their encoding and have no user-visible text because they only include an image or plug-in are so common that the error flood would mask the server messages that would actually be actionable by a Web developer.)

> to remind me to contact these
> sites' owners to update their encoding declarations.

Thank you for doing that.

> I have used Firefox since it was called Phoenix (in 2002 or 2003 I think)
> and then Firebird, and I have never considered switching to a different
> browser until now.

I hope we don't lose you as a user.
Comment 32 Matt Ruffalo 2014-03-02 16:51:25 PST
(In reply to Henri Sivonen (:hsivonen) from comment #31)
Thank you for your reply, and for doing so more tactfully than I deserved from the tone of my message. I apparently don't take it very well when features I like are removed from software that I've been using for a while, and I apologize.

Of course you're right that sites are also broken if they don't specify an encoding at all, but UTF-8 is very nice in that encoding problems will generally not produce mojibake. I've been using Python 3 exclusively for quite some time, and really started to appreciate its behavior (at least on modern Linux systems) of "assume UTF-8 for text and require an explicit choice of encoding for anything else". I have Thunderbird configured as such, and I liked configuring Firefox this way too.

I would like to suggest allowing UTF-8 for the default legacy encoding through about:config but not through the GUI preferences dialog -- I don't think it's appropriate to second-guess about:config settings given the warning: Changing these advanced settings can be harmful to the stability, security, and performance of this application. You should only continue if you are sure of what you are doing." If someone changes an encoding setting in about:config, starts to see a lot of REPLACEMENT CHARACTERs, and is unhappy about this, they have only themselves to blame :) (though hopefully this doesn't clog up any of the Mozilla support resources; I don't know whether this is or would be the case.)
Comment 33 Henri Sivonen (:hsivonen) 2014-03-03 02:46:12 PST
(In reply to Matt Ruffalo from comment #32)
> Of course you're right that sites are also broken if they don't specify an
> encoding at all, but UTF-8 is very nice in that encoding problems will
> generally not produce mojibake.

REPLACEMENT CHARACTERs and mojibake are both brokenness from the user perspective.

> I've been using Python 3 exclusively for
> quite some time, and really started to appreciate its behavior (at least on
> modern Linux systems) of "assume UTF-8 for text and require an explicit
> choice of encoding for anything else".

Python 3 is also backwards-incompatible in other ways. With the level of backwards-incompatibility that Python 3 has, it could have gone UTF-8-*only*.

In the case of the Web Platform, we go UTF-8-*only* for new stuff like AppCache manifests or JSON-in-XHR. We don't have that luxury in the case of HTML.

> I would like to suggest allowing UTF-8 for the default legacy encoding
> through about:config but not through the GUI preferences dialog -- I don't
> think it's appropriate to second-guess about:config settings given the
> warning: Changing these advanced settings can be harmful to the stability,
> security, and performance of this application. You should only continue if
> you are sure of what you are doing." If someone changes an encoding setting
> in about:config, starts to see a lot of REPLACEMENT CHARACTERs, and is
> unhappy about this, they have only themselves to blame :) (though hopefully
> this doesn't clog up any of the Mozilla support resources; I don't know
> whether this is or would be the case.)

The line of argument that once you go to about:config, you should be able to break anything is persuasive in theory. In practice though, there are cases like Hacker News threads telling people to change the sort of encryption settings that Mozilla had already previously deemed necessary reset to defaults due to people configuring the browser to be less secure than the defaults without fully understanding the implications of changing those settings.

In this case specifically, I have the following concerns from less serious to more serious:

 * User setting the value to UTF-8, experiencing brokenness, failing to draw the connection and blaming Firefox.

 * User setting the value to UTF-8, experiencing brokenness and then using the Character Encoding menu more than is normal thereby skewing telemetry data making me draw the wrong conclusions about the success of Firefox's encoding-related behavior. (Some locales have relatively few users, so noise even from one user can sway the data notably. And yes, it's probably statistically unsound to reason from data that one user can sway and, technically, I could build more elaborate checks to filter out "bad" configurations, but that would mean more work to support and unsupported configuration.)

 * User setting the value to UTF-8, authoring some Web pages in UTF-8, failing to declare UTF-8, not noticing this due to the setting thereby adding unlabeled UTF-8 data to the Web everyone else sees.
Comment 34 Jon 2014-03-28 17:28:00 PDT
(In reply to Henri Sivonen (:hsivonen) (Not reading bugmail or doing reviews until 2014-03-31) from comment #31)
> I did so to avoid users who go to about:config but don't have sufficient
> familiarity with the issues breaking their browsing experience, because
> absent sufficient familiarity, UTF-8 looks like an attractive value, even
> though it's never a sensible value, because UTF-8 isn't a legacy encoding.
> It's the only non-legacy encoding. Bug 967981 indicates that my expectation
> of misplaced attractiveness of UTF-8 as a value of this setting was not
> unfounded.
> 

But the setting isn't really about legacy encodings, because it controls what Firefox does when dealing with content that doesn't declare an encoding. The presumption here is that only content written in legacy encodings suffers this problem. The truth is that the same people who didn't know about sending character encoding headers in 1995 are still around and now they create stuff in UTF-8 instead of Windos-1252.

I will also note that there is no way to set this header for plain text content when you do not control the webserver. HTML may have ways of declaring the encoding inside the document itself, but other human readable document formats do not. (FWIW Chrome seems to default to UTF-8 for plain text content)


Yes, its a mess. Yes, content should be declaring its encodings. But that was also true years ago and we all know people couldn't do it right back then. Which is why Firefox also has an option to assume pages really meant Windows-1252 if they say they are ISO-8859-1. So UTF-8 is a legitimate choice for a fallback today. In fact, as I mentioned, chrome seems to use it as one.


So please reconsider the prohibition on using UTF-8 as the value for this preference. Or at least consider adding a new preference to control the default character encoding for content rendered as plain text since that content may have no way of doing the right thing.
Comment 35 Boris Zbarsky [:bz] 2014-03-28 20:16:58 PDT
> I will also note that there is no way to set this header for plain text content when you
> do not control the webserver

Hmm.  Does a UTF-8 BOM not work here?
Comment 36 Jon 2014-03-28 21:05:21 PDT
I did test this and the UTF-8 BOM does work, but the UTF-8 BOM is not necessary and even undesired in some circumstances. For example, putting the BOM in a shell script breaks it. Also, existing data doesn't have one, which brings me back to the real issue.

The setting was about controlling behavior with legacy content and whether or not something is a legacy encoding has nothing to do with that. Despite the fact that it is a modern encoding, UTF-8 is older than Netscape and it is naive to think that there is no legacy content in UTF-8. As I joked before about people in 1995 being bad at declaring encodings, well those people could have been writing documents in UTF-8.


While the selection of UTF-8 may not be the best choice of a default fallback and it may even be chosen for the wrong reasons, it is a useful and reasonable option for those who actually do have to regularly deal with legacy content in a UTF-8 encoding.
Comment 37 Henri Sivonen (:hsivonen) 2014-03-31 23:58:34 PDT
(In reply to Jon from comment #34)
> But the setting isn't really about legacy encodings, because it controls
> what Firefox does when dealing with content that doesn't declare an
> encoding.

If you are creating new content like that, you are contributing to the problem. If you are browsing content like that, I suggest complaining to the author or using the Character Encoding menu. Even if UTF-8 happened to be the most successful fallback *for you*, letting anyone choose it would create a situation when people who choose UTF-8 as the fallback end up authoring more unlabeled UTF-8 Web content instead of noticing that the content is lacking an encoding label (or BOM).

> The presumption here is that only content written in legacy
> encodings suffers this problem. The truth is that the same people who didn't
> know about sending character encoding headers in 1995 are still around and
> now they create stuff in UTF-8 instead of Windos-1252.

UX-wise, it's completely unreasonable for a Web author to expect the user to go tweak this pref in order to be able to read the author's mispublished content. Suppose one author who is unable/unwilling to publish properly uses Windows-1252 and another user UTF-8. Do you really think it's reasonable to expect the user to go change a setting in a subwindow of a panel of the pref window?

That is, it's even less reasonable to expect users to go change this pref in response to a *particular site* than to expect them to use the Character Encoding menu.

This pref is for dealing with a broader-than-one-site context: e.g. running en-US builds in Japan and, therefore, having a need to fall back to Shift_JIS instead of windows-1252 on a more general basis.

> I will also note that there is no way to set this header for plain text
> content when you do not control the webserver.

Publishing on the Web without control of the server is a really bad idea but, noted, a reality for many authors. Those authors can either use the UTF-8 BOM in text/plain or to prefix the content with <meta charset=utf-8><plaintext> and serve as text/html instead of text/plain. (Or, preferably, get a proper hosting setup.) In any case, publishing undeclared UTF-8 text/plain and expecting users to go change this pref to accommodate is completely unreasonable, since it degrades compatibility with unlabeled content that assumes the browser-default behavior.

In general, it's unreasonable to author Web content in a way that requires the user to change browser settings in order to be able to read the content.

> (FWIW Chrome seems to default to UTF-8 for plain text content)

This does not appear to be the case with the default settings. Demo: http://hsivonen.com/test/moz/unlabeled.text . Tested with Chrome 33 on Linux.

(Generally, when making an allegation like that, it's polite to provide a demo page so save others the time to write a demo and go check if the allegation is true.)

(In reply to Jon from comment #36)
> but the UTF-8 BOM is not necessary

It is if you can't add ;charset=utf-8 on the HTTP layer!
Comment 38 Jon 2014-04-01 08:26:13 PDT
As far as HTML content goes, I don't disagree with you at all. HTML has an inline method of setting the character encoding and authors should be using it. The problem is really that plain text content has no real equivalent that simply works. That is part of the reason that I acknowledged that a seperate preference for plain text content may be the right solution to the problem.

(In reply to Henri Sivonen (:hsivonen) from comment #37)
> > I will also note that there is no way to set this header for plain text
> > content when you do not control the webserver.
> 
> Publishing on the Web without control of the server is a really bad idea
> but, noted, a reality for many authors. Those authors can either use the
> UTF-8 BOM in text/plain or to prefix the content with <meta
> charset=utf-8><plaintext> and serve as text/html instead of text/plain. 

As I said, BOM are not possible in file formats like shell scripts where the first characters are already required to be #!. Placing a BOM into them or serving them as text/html would break the ability to download them and run them. BOM are also not actually recommended as a normal part of a UTF-8 document, so many programs do not add them. As far as UTF-8 goes, BOM are the oddball rare exception, not the rule.

The only other option is a change to the webserver's configuration, at least Apache will allow you to append ";charset=utf8" to anything showing up as text/plain. But the reality is that many people use shared hosting where changes of this sort are out of their control.

> Do you really think it's reasonable to expect the user to go change a
> setting in a subwindow of a panel of the pref window?

No, I do not. But this was a permitted user override until recently when you removed it due to some localizations setting it wrong. The HTML 5 document you selected your locale based fallbacks from quite explicitly mentions that while UTF-8 would not be the correct default fallback, it is a perfectly cromulent option in some environments.

http://www.w3.org/TR/html5/syntax.html#determining-the-character-encoding

So I agree that those localizations where wrong to set the default fallback to UTF-8, they should be fixed. But forbidding setting the default to UTF-8 at all is just punishing innocent bystanders for the sins of a few localization maintainers and bad webmasters.

> 
> > (FWIW Chrome seems to default to UTF-8 for plain text content)
> 
> This does not appear to be the case with the default settings. Demo:
> http://hsivonen.com/test/moz/unlabeled.text . Tested with Chrome 33 on Linux.
> 


Chrome 33 on OSX 10.9.2 renders your page fine. I'm not aware of anyway to change this default (I don't normally use Chrome) but I hit the "Reset settings" button before viewing just in case.

http://i.imgur.com/ECLZtGt.png

>
> (Generally, when making an allegation like that, it's polite to provide a
> demo page so save others the time to write a demo and go check if the
> allegation is true.)
> 

I'm sorry, I do not have access the public resources to server demo pages. My personal webserver is configured "correctly" and is thus unsuited to the task. Note that the only way to configure it "correctly" was to tell it all files sent as text/plain should be forced to charset=utf8.
Comment 39 Boris Zbarsky [:bz] 2014-04-01 08:41:16 PDT
> Chrome 33 on OSX 10.9.2 renders your page fine.

Chrome 33 (the same exact version as yours) does not for me on OS X 10.8.5.  

In your View menu in Chrome, under Encoding, is it perhaps set to "Auto Detect"?  If I do that, I get the behavior you describe and the "Reset Settings" button does NOT undo that configuration change.
Comment 40 Jon 2014-04-01 12:14:19 PDT
Yes, I guess that explains that. I pretty much only have chrome because its faster to open it up and copy a URL from firefox than to fiddle with adblock/noscript/etc when a site that I want to use for a moment doesn't work right.
Comment 41 Jon 2014-04-01 18:00:30 PDT
While playing with as many options I can find in an attempt to circumvent the explicit ban on the use of UTF-8 as the default character encoding, I have discovered that turning on View -> Character Encoding -> Auto-Detect -> Japanese causes it to detect UTF-8 text and render it properly without needing a BOM in the file.

I have found no particularly harmful side effects so far. The Russian and Ukrainian options do not have this super power.
Comment 42 Henri Sivonen (:hsivonen) 2014-04-03 05:54:55 PDT
(In reply to Jon from comment #38)
> As I said, BOM are not possible in file formats like shell scripts where the
> first characters are already required to be #!. Placing a BOM into them or
> serving them as text/html would break the ability to download them and run
> them.

There probably is a good reason, but it's rather weird to first argue from the hosting setup being technically challenged enough not to allow charset parameter configuration and then turn around and argue that your use case is so technical that you are actually serving shell scripts. That is, one would expect someone who is technical enough to host shell scripts not to stay with a hosting solution that non-technical enough not to allow proper configuration.

> As far as UTF-8 goes, BOM are
> the oddball rare exception, not the rule.

Yes, but it *is* the *inline* way to declare UTF-8 as the encoding of a text/plain resource.

> But the reality is that many people use shared hosting where
> changes of this sort are out of their control.

If you use shared Apache hosting without .htaccess enabled, you are going to have a bad time. Nothing new there.
 
> > Do you really think it's reasonable to expect the user to go change a
> > setting in a subwindow of a panel of the pref window?
> 
> No, I do not. But this was a permitted user override until recently when you
> removed it due to some localizations setting it wrong.

That the old UI allowed the selection of all encodings that Gecko supported for Web content was always a terrible idea. (The old menu included catastrophically inappropriate values like UTF-16!) The problem of localizations setting the pref value wrong was pretty widespread. So much so that it was futile to try to talk each localizer into fixing their stuff and the only practical fix was hoisting this stuff into code managed by core Gecko developers.

> The HTML 5 document
> you selected your locale based fallbacks from quite explicitly mentions that
> while UTF-8 would not be the correct default fallback, it is a perfectly
> cromulent option in some environments.
> 
> http://www.w3.org/TR/html5/syntax.html#determining-the-character-encoding

The relevant text is: "In controlled environments or in environments where the encoding of documents can be prescribed (for example, for user agents intended for dedicated use in new networks), the comprehensive UTF-8 encoding is suggested."

The Web isn't a controlled environment of that kind, so that sentence doesn't apply to implementations designed for browsing the Web.

> So I agree that those localizations where wrong to set the default fallback
> to UTF-8, they should be fixed. But forbidding setting the default to UTF-8
> at all is just punishing innocent bystanders for the sins of a few
> localization maintainers and bad webmasters.

See above. Also worth noting that adding UTF-8 to the setting would interfere with bug 910211.

From my perspective, you are asking to cater for (a different kind of) bad webmasters.

> My personal webserver is configured "correctly" and is thus unsuited to the
> task.

Mine, too, which is why I had to tweak my server config to map .text to text/plain without a charset parameter.
Comment 43 Oh 2014-04-29 12:50:09 PDT
I have a little different use case of browser: viewing local docs (of legacy and non-legacy).
Now I should adapt to Fx28+ and insert BOM/<meta> into each utf file. Or Firefox would accept utf8 as a fallback charset, ugly or not.
Comment 44 Masatoshi Kimura [:emk] 2014-04-30 19:34:53 PDT
Autodetecting UTF-8 only for locals files was proposed in bug 815551 but rejected :(
Comment 45 Paul Rubin 2014-05-26 17:20:15 PDT
I have to say I'm pretty accustomed to the experience of looking at a web page, seeing garbled characters, and mucking with the legacy encoding preference until the page is readable.  I found this bug specifically because UTF-8 was removed from the choices.  So I'm trying to read http://www.volokh.com/posts/chain_1240849478.shtml which is a page from a venerable and popular web site, and characters are messed up because it's written in utf-8 with specifying that.  The rationale that utf-8 shouldn't be supported because it's not a legacy character set (i.e. won't occur in legacy pages) is observably invalid.  Yes that page is broken, but a big part of the task of a mainstream web browser is to deal with broken pages in sensible ways.  Otherwise we'd ditch HTML completely, use XML and reject any page that didn't conform to some DTD.  

I'd also like to support the idea that the about:config settings are not part of the user preferences, but rather, as the name implies, they are part of the browser configuration.  It's generally acceptable that configuration options can sometimes conflict with each other and can break things.  Having a way to detect conflicts or restore defaults is fine but the about:config users should be presumed to know what they are doing.  As a user I'm disappointed by the choices imposed by the UX designers (e.g. they removed the user preference for turning off Javascript) but had felt that about:config at least provided a fallback route where I could reassert my own choices in a less convenient way.  Now apparently UX design is reaching into about:config as well, so I'm feeling like I'm seeing an adversarial relationship between the designers and the users.  This isn't good.  UX design should make the recommended and common settings easy to reach (by surfacing them to preference buttons) but knowledgeable users should still be able to make less-common choices if they want to.
Comment 46 Henri Sivonen (:hsivonen) 2014-06-10 02:28:34 PDT
(In reply to Paul Rubin from comment #45)
> I have to say I'm pretty accustomed to the experience of looking at a web
> page, seeing garbled characters, and mucking with the legacy encoding
> preference until the page is readable.

The pref is *not* for dealing with an individual page. The pref is for dealing with many, many pages at once.

When you need to change the encoding for a single page, the UI you are supposed to use is the Character Encoding menu, which is a submenu of the View menu (press alt-v in the en-US localization on Windows or Linux to access the menu) and, alternatively, available as a toolbar menu button behing the "Customize" option in the hamburger menu.

> UX design should make the recommended and common settings easy to 
> reach (by surfacing them to preference buttons) but knowledgeable 
> users should still be able to make less-common choices if they 
> want to.

This argument isn't persuasive when you are misusing a feature (fallback encoding pref) for a use case for which we already have more easily reachable UI (the Character Encoding menu available via the menubar and via customizing the Australis toolbar).
Comment 47 Angry Bird 2014-07-05 06:32:40 PDT
I am working with multilingual content and want my fallback encoding (at least for plain text file) to be UTF-8. Please put UTF-8 back to the list of encodings to fallback to. Going into menu and choosing UTF-8 manually every time i reload or open a file in FF is not an option, neither are BOMs and others.

I am reverting back to previous version of FF and will not upgrade until I see this fixed.
Comment 48 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2014-07-05 09:24:56 PDT
(In reply to Angry Bird from comment #47)
> I am reverting back to previous version of FF and will not upgrade until I
> see this fixed.

Please file a new bug report referencing *this* bug.
Comment 49 Angry Bird 2014-07-06 04:20:40 PDT
Created Bug 1034960 - Restore menu option to fallback to UTF-8 (or other) encoding in case it is not specified in the content

Note You need to log in before you can comment on or make changes to this bug.