Closed Bug 967981 Opened 6 years ago Closed 6 years ago

Provide some way for addons/prefs to control the default fallback character set

Categories

(Core :: DOM: HTML Parser, defect)

x86
Windows 7
defect
Not set

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: mikeperry, Unassigned, NeedInfo)

References

Details

(Whiteboard: [tor])

In Bug 910192, the pref 'intl.charset.default' was removed. In Tor Browser, we give users the option to spoof the English locale to reduce fingerprinting, and would like to set this pref to UTF-8 across all locales for users who wish to spoof an English locale. We either need this pref, or some other way to specify a default fallback character set.
Firefox-Help Documentation is definitely the wrong component...
Component: Help Documentation → HTML: Parser
Depends on: 910192
Product: Firefox → Core
Version: unspecified → Trunk
You can use a new unlocalizable pref "intl.charset.fallback.override" instead.
Whiteboard: [tor]
Mike, doesn't "intl.charset.fallback.override" work?
Flags: needinfo?(mikeperry)
(In reply to Mike Perry from comment #0)
> In Tor Browser,
> we give users the option to spoof the English locale to reduce
> fingerprinting, and would like to set this pref to UTF-8 across all locales
> for users who wish to spoof an English locale.

The patch that removed the old pref added a new one: intl.charset.fallback.override . To get the same behavior as in the en-US locale regardless of the actual locale, you need to set the pref to "windows-1252". (The way to reach this configuration from the UI is to select "Other (incl. Western European)" in Preferences: Content: Advanced: Fallback Character Encoding.)

Setting the pref to UTF-8 would be contrary to your goal, since no locale uses UTF-8 as the fallback (because UTF-8 is not a legacy encoding but the only non-legacy one), so having UTF-8 as the fallback would make the user more fingerprintable by having a weird configuration. Since I predicted that people who aren't appropriately familiar with the purpose of the setting might try to set it to UTF-8 via about:config, the code explicitly ignores the pref value if it designates UTF-8.

Note that there are two other things that depend on localization and affect the behavior of the HTML parser. (Maybe you have these covered, but I'm mentioning these just in case.) To get the en-US behavior regardless of locale, you should also set the pref intl.charset.detector to the empty string and de-localize the string IsIndexPromptWithSpace in HtmlForm.properties.
Status: UNCONFIRMED → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
Hi-

I find it perplexing that I can't set UTF-8 as my default encoding. It is certainly not a legacy encoding, but this setting helps me identify legacy pages that need to specify their encoding. I *want* to see many occurrences of U+FFFD REPLACEMENT CHARACTER when a page doesn't specify an encoding and sends non-ASCII bytes.

I had previously set my fallback encoding to UTF-8, but since this is broken in Firefox 28, I suppose I'll have to continue using Firefox 27 for the foreseeable future.
(In reply to Matt Ruffalo from comment #5)
> Hi-
> 
> I find it perplexing that I can't set UTF-8 as my default encoding. It is
> certainly not a legacy encoding, but this setting helps me identify legacy
> pages that need to specify their encoding. I *want* to see many occurrences
> of U+FFFD REPLACEMENT CHARACTER when a page doesn't specify an encoding and
> sends non-ASCII bytes.

Seconded, based on <https://unix.stackexchange.com/q/308469/3645>.
(In reply to Victor Engmark from comment #6)
> (In reply to Matt Ruffalo from comment #5)
> > Hi-
> > 
> > I find it perplexing that I can't set UTF-8 as my default encoding. It is
> > certainly not a legacy encoding, but this setting helps me identify legacy
> > pages that need to specify their encoding. I *want* to see many occurrences
> > of U+FFFD REPLACEMENT CHARACTER when a page doesn't specify an encoding and
> > sends non-ASCII bytes.
> 
> Seconded, based on <https://unix.stackexchange.com/q/308469/3645>.

For that use case, changing the fallback is the wrong answer. The right answer is detecting UTF-8 for file: URLs. (I'm working on relevant infrastructure changes.) Failing that, in the mean time a hidden pref to change the fallback for file: URLs only would make more sense than letting a file:-motivated change leak to the Web.
(In reply to Henri Sivonen (:hsivonen) from comment #7)
> (In reply to Victor Engmark from comment #6)
> > Seconded, based on <https://unix.stackexchange.com/q/308469/3645>.
> 
> For that use case, changing the fallback is the wrong answer. The right
> answer is detecting UTF-8 for file: URLs. (I'm working on relevant
> infrastructure changes.)

Sounds like a good idea. But why single out file: URLs? Surely such a change should apply to *any* request which does not receive encoding information at all.

> Failing that, in the mean time a hidden pref to
> change the fallback for file: URLs only would make more sense than letting a
> file:-motivated change leak to the Web.

Again, it would be good to have this option, but why single out file: URLs?
(In reply to Victor Engmark from comment #8)
> But why single out file: URLs?

Local files are available in their entirety at the start of the parse.
(In reply to Henri Sivonen (:hsivonen) from comment #9)
> Local files are available in their entirety at the start of the parse.

Is it obviously true when we try to avoid main thread synchronous disk I/O and when we even try to race network and disk cache?
(In reply to Masatoshi Kimura [:emk] from comment #10)
> (In reply to Henri Sivonen (:hsivonen) from comment #9)
> > Local files are available in their entirety at the start of the parse.
> 
> Is it obviously true when we try to avoid main thread synchronous disk I/O

I mean it should be safe enough to assume that file: URLs point to finite data streams whose end is reachable fast (within the bounds of disk io speed). Therefore, it's reasonable to assume to be able to read all the bytes before displaying anything and not potentially stall forever trying to do so. (There are probably "fun" mounted-as-file system counter examples...)

I'm not suggesting doing synchronous disk I/O.

> and when we even try to race network and disk cache?

I think file: URL special-casing shouldn't be applied to data coming out of the disk cache.
You need to log in before you can comment on or make changes to this bug.