Open Bug 1485258 Opened 6 years ago Updated 10 months ago

When privacy.spoof_english is true, don't reveal locale by charset fallback

Categories

(Core :: DOM: HTML Parser, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: arthur, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [tor 20025][fingerprinting][fp-triaged])

In Tor Browser, we want to make sure that the locale is not revealed to content when "privacy.spoof_english" is enabled. But in Firefox, the charset encoding can depend on the user's locale. See https://dxr.mozilla.org/mozilla-esr60/source/dom/encoding/FallbackEncoding.h#34 So we'd like to make sure the fallback behavior is the same, regardless of locale, when "privacy.spoof_english" is enabled.
We should also examine the behavior of the following two prefs, that are set by torbutton: pref("intl.accept_charsets", "iso-8859-1,*,utf-8"); pref("intl.charsetmenu.browser.cache", "UTF-8");
(In reply to Arthur Edelstein (Tor Browser dev) [:arthuredelstein] from comment #1) > We should also examine the behavior of the following two prefs, that are set > by torbutton: > pref("intl.accept_charsets", "iso-8859-1,*,utf-8"); > pref("intl.charsetmenu.browser.cache", "UTF-8"); I can't find either of these on searchfox. I'm pretty sure I've personally removed the latter one.
Priority: -- → P3
(In reply to Henri Sivonen (:hsivonen) from comment #2) > I can't find either of these on searchfox. I'm pretty sure I've personally removed the latter one. Neither are found on ESR-60 source either: https://dxr.mozilla.org/mozilla-esr60/source/
Assignee: nobody → xeonchen
Whiteboard: [tor] → [tor][fingerprinting][fp-triaged]

This appears to be controlled by https://searchfox.org/mozilla-central/source/dom/encoding/FallbackEncoding.h and seems fairly straightforward to avoid doing any locale-based decisions if spoof_english is true. The hard part is probably writing the test.

Assignee: xeonchen → nobody

(In reply to Tom Ritter [:tjr] from comment #4)

This appears to be controlled by https://searchfox.org/mozilla-central/source/dom/encoding/FallbackEncoding.h and seems fairly straightforward to avoid doing any locale-based decisions if spoof_english is true.

More specifically, if spoof_english is set, setting mFallback to WINDOWS_1252_ENCODING should take precedence over this block:
https://searchfox.org/mozilla-central/source/dom/encoding/FallbackEncoding.cpp#69-79

Tor Ticket: https://trac.torproject.org/projects/tor/ticket/20025

When a document, or server, fails to set a charset, the page falls back to the following setting: General>Language and Appearance>Fonts and Colors>Advanced>Text Encoding for Legacy Content. The default is to fallback to the "current locale" based on your "app language" (the pref value is blank). For non en-US app languages, this can then create entropy depending on the language. You can test this on https://hsivonen.com/test/moz/check-charset.htm. e.g. japanese will reveal Shift_JIS, traditional chinese will reveal Big5, etc.

Solution: set intl.charset.fallback.override = windows-1252 when privacy.spoof_english == 2, and reset it when privacy.spoof_english !== 2

(In reply to Simon Mainey from comment #6)

Solution: set intl.charset.fallback.override = windows-1252 when privacy.spoof_english == 2, and reset it when privacy.spoof_english !== 2

A less brittle solution would be to check privacy.spoof_english where intl.charset.fallback.override is checked and make privacy.spoof_english take precedence.

Whiteboard: [tor][fingerprinting][fp-triaged] → [tor 20025][fingerprinting][fp-triaged]

(In reply to Henri Sivonen (:hsivonen) (not reading bugmail until 2020-08-03) from comment #7)

A less brittle solution would be to check privacy.spoof_english where intl.charset.fallback.override is checked and make privacy.spoof_english take precedence.

Any chance on nudging this now intl.charset.fallback.override has been deprecated (Bug 1603712)

AFAICT bug 1603712 made this bug moot except for display of non-ASCII file paths in FTP directory listings. I'm unsure if it's possible for Web sites to scrape those. Without testing, it seems to me that the ftp scheme should at least block scraping from an http or https page.

AFAICT, the issue will go away entirely with bug 1647898.

FTP is disabled in Bug 1691890 - can we confirm the leak is resolved now?

^ or I guess it's not really resolved as users could still toggle the pref: nevermind. Guess we'll wait for the FTP code to get ripped out :)

Severity: normal → S3

FallbackEncoding.h doesn't exist anymore, but I am not totally sure if that means that the characterSet is never based on the language now. I tried tracing Document::SetDocumentCharacterSet and didn't immediately see anything related to the language, but I was by no means exhaustive.

Something else I noticed is that Document::RecomputeLanguageFromCharset calls RecomputeLanguageFromCharset, which seems to use the LocaleLanguage, but that might be unproblematic and unrelated to this bug anyway.

Henri, would you mind updating us with the current status when you are back?

Flags: needinfo?(hsivonen)

For text/html and text/plain, the UI locale is no longer used as an input to determining the character encoding. When the character encoding is not declared, the content of the text/html or text/plain stream and, possibly, the top-level domain are used for guessing the encoding, so neither depends on user-side configuration.

I'm not aware of UI locale-dependent encoding determination other than bug 1824325.

(In reply to Tom Schuster (MoCo) from comment #13)

Something else I noticed is that Document::RecomputeLanguageFromCharset calls RecomputeLanguageFromCharset, which seems to use the LocaleLanguage, but that might be unproblematic and unrelated to this bug anyway.

That's indeed unrelated but still problematic.

What that does is guessing the language (group) of the page for font selection purposes in the absence of explicit lang="..." tagging. For example, if the encoding is Shift_JIS, you get Japanese glyph forms for ideographs whose preferred glyph details vary by locale. The problem here is that if the encoding isn't locale-affiliated (e.g. UTF-8 isn't locale-affiliated), the UI language participates in font selection (e.g. if the UI language is Japanese, you get Japanese glyph forms). (There are probably non-CJK examples where the difference affects glyphs metrics and, therefore, line breaks.)

So preventing a UI locale leak on that point is relevant, but out of scope for this bug.

Flags: needinfo?(hsivonen)
You need to log in before you can comment on or make changes to this bug.