Closed Bug 1054739 Opened 10 years ago Closed 9 years ago

Reduce HTTP Accept-Language Entropy

Categories

(Core :: Networking: HTTP, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla36

People

(Reporter: karl156, Assigned: awake)

References

(Blocks 2 open bugs)

Details

(Keywords: dev-doc-complete, privacy, site-compat, Whiteboard: [fingerprinting])

Attachments

(1 file, 2 obsolete files)

When switching language via UI the Accept-Language Header is altered.

For example the default German Accept-Language String is this:
"Accept-Language: de,en-US;q=0.7,en;q=0.3"
When playing with it in the UI, even after resetting, I get this:
"Accept-Language: de,en-us;q=0.7,en;q=0.3"
Note the switch from uppercase to lowercase country-code.

To reset this I had to reset "intl.accept_languages" in about:config

This unnecessarily increases fingerprintability.

Steps to reproduce:
1. Note original Accept-Language (uppercase country-code)
2. Go to Preferences - Languages
3. Hit "up" and "down"
4. See altered Accept-Language (lowercase country-code)
Keywords: privacy
Whiteboard: [fingerprinting]
Version: 31 Branch → Trunk
I'd like to give this bug a shot. Could somebody assign it to me?
Assignee: nobody → bugzillafx
The Accept-Language header is created from the intl.accept_languages string in about:config. As such, the capitalization of that string is reflected in the request header.

Changing the preferred languages in the UI always sets intl.accept_languages to lowercase.

Also, different localizations define intl.accept_languages differently: some have uppercase regions, while others do not. See https://mxr.mozilla.org/l10n-central/search?string=intl.accept_languages&find=intl.properties. A quick check of Internet Explorer and Chrome show that they use uppercase regions.

Possible solutions:
1. Make intl.accept_languages consistent among all locales. Modify the behavior of setting the preferred languages in the UI.
2. Change the header to be consistent prior to making the request, so that the case does not matter.

The first solution looks to be the better, but longer, approach.

Does anyone have any thoughts?
This is a preliminary patch that fixes the problem of inconsistent capitalization of the Accept-Language header for locales with uppercase regions.

Locales with a default lowercase language code were not affected since preference changes through the UI automatically sets the new accepted languages to lowercase. Their default intl.accept_languages string will need to be modified to match the new capitalization. That would mean patching roughly half or more of the 104 locales in l10n-central.
I think normalizing the string automatically in the netwerk code actually setting the HTTP header would be largely preferable.
This patch normalizes the Accept-Language header by making the language code lowercase and the region code uppercase (everything after the hyphen).
Attachment #8526803 - Attachment is obsolete: true
Attachment #8527305 - Flags: review?(dao)
I think the patch works slightly different to "navigator.language" and "navigator.languages":

intl.accept_languages;DE-DE, EN-US, EN
navigator.language  - DE-DE
navigator.languages - DE-DE,EN-US,EN

intl.accept_languages;de-de, en-us, en
navigator.language  - de-DE
navigator.languages - de-DE,en-US,en

These tests are somewhat artificial, but as you can see the language token remains untouched while the county token is converted to uppercase.

This is not a suggestion to change the patch. I just wanted to bring it to your attention.
Attachment #8527305 - Flags: review?(dao) → review?(mcmanus)
Thanks for bringing that up. The navigator.languages code makes uppercase subtags after the first that are two letters long. All other tags are left unmodified.

With the latest patch, the formatting of the language tags for special cases is different in navigator.languages and in the Accept-Language request header, as pointed out. If the preferred languages were set from the UI, the result should be the same. (The list of available languages for selection in the UI is from https://mxr.mozilla.org/mozilla-central/source/intl/locale/language.properties and all the subtags there are two letters.)

https://tools.ietf.org/html/rfc5646 describes conventions for formatting the language tags:

>   o  [ISO639-1] recommends that language codes be written in lowercase
>      ('mn' Mongolian).
> 
>   o  [ISO15924] recommends that script codes use lowercase with the
>      initial letter capitalized ('Cyrl' Cyrillic).
> 
>   o  [ISO3166-1] recommends that country codes be capitalized ('MN'
>      Mongolia).
> 
>   An implementation can reproduce this format without accessing the
>   registry as follows.  All subtags, including extension and private
>   use subtags, use lowercase letters with two exceptions: two-letter
>   and four-letter subtags that neither appear at the start of the tag
>   nor occur after singletons.  Such two-letter subtags are all
>   uppercase (as in the tags "en-CA-x-ca" or "sgn-BE-FR") and four-
>   letter subtags are titlecase (as in the tag "az-Latn-x-latn").

Perhaps this formatting convention can be used in navigator.languages and the Accept-Language request header, or when setting the intl.accept_languages string. Either one of these would be an improvement.
Comment on attachment 8527305 [details] [diff] [review]
Normalize language codes in the netwerk code

Review of attachment 8527305 [details] [diff] [review]:
-----------------------------------------------------------------

This changes the content of an accept header - so its really gerv's module that needs to ok it. I'm fine with the c++ changes in it, thanks for letting me see it.

::: netwerk/protocol/http/nsHttpHandler.cpp
@@ +1527,5 @@
> +    {
> +        if (*code == '-') {
> +            is_region = true;
> +        } else {
> +            if (is_region)

brace the if/else bits please
Attachment #8527305 - Flags: review?(mcmanus) → review?(gerv)
Comment on attachment 8527305 [details] [diff] [review]
Normalize language codes in the netwerk code

Review of attachment 8527305 [details] [diff] [review]:
-----------------------------------------------------------------

This Wikipedia article:
http://www.wikiwand.com/en/IETF_language_tag
explains all the different parts of the language code which may be present, and the appropriate case for each.

If we have to cope with arbitrary language codes, then this code will need enhancing somewhat. It doesn't deal with e.g. "nan-Hant-TW" correctly.

If we only have to cope with codes which are defined locales, that is a smaller set. But AIUI, people can enter their own codes in the UI?

Gerv
Attachment #8527305 - Flags: review?(gerv) → review-
(In reply to Gervase Markham [:gerv] from comment #9)
> If we only have to cope with codes which are defined locales, that is a
> smaller set. But AIUI, people can enter their own codes in the UI?

No, they can selected languages from a list we provide.
Comment on attachment 8527305 [details] [diff] [review]
Normalize language codes in the netwerk code

Review of attachment 8527305 [details] [diff] [review]:
-----------------------------------------------------------------

OK. If we are sure all the possible inputs to this function are of the simple form that it copes with, then it's OK with me to canonicalise Accept-Language to xx-YY.

Gerv
Attachment #8527305 - Flags: review- → review+
Added braces.
Renamed function from NormalizeLanguageCode to NormalizeLanguageTag.
Attachment #8527305 - Attachment is obsolete: true
Attachment #8528297 - Flags: review?(gerv)
Comment on attachment 8528297 [details] [diff] [review]
Normalize language tags when setting the Accept-Language header. v2

Review of attachment 8528297 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not the right person to review this from a code perspective. It's fine otherwise.

Gerv
Attachment #8528297 - Flags: review?(gerv)
Attachment #8528297 - Flags: review?(mcmanus)
Comment on attachment 8528297 [details] [diff] [review]
Normalize language tags when setting the Accept-Language header. v2

Review of attachment 8528297 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Attachment #8528297 - Flags: review?(mcmanus) → review+
https://hg.mozilla.org/mozilla-central/rev/a426024d5d15
Status: UNCONFIRMED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
This is an Internationalization change that I and others should have been made aware of and consulted on.

Here is a list of sample language tags in their canonical form to be used for testing:

https://github.com/GPHemsley/BCP47/blob/master/test_tags.txt
Blocks: bcp47
Can we back this out? The code logic violates the standard, and doesn't match our code behavior, notably, if a sr would set their default to sr-Latn, sr, en for example, the upper-casing of the thing after the - doesn't work.
dao - based on 17-19 do you think this should come out of 36 and 37?
Flags: needinfo?(dao)
My understanding is that comment 18 is referring to hypothetical cases that were already discussed earlier in this bug. It's probably a good idea to take care of them to be future-proof, but unless some locales are affected by this today (e.g. can sr-Latn be selected from the UI?) I don't see the immediate need to back this out.
Flags: needinfo?(dao)
I've created a patch to address the canonicalization of the language tag in the HTTP request header. Should it be posted here or in a separate bug?
(In reply to awake [:awake] from comment #22)
> I've created a patch to address the canonicalization of the language tag in
> the HTTP request header. Should it be posted here or in a separate bug?

New bug, please. Thanks!
(In reply to Dão Gottwald [:dao] from comment #21)
> My understanding is that comment 18 is referring to hypothetical cases that
> were already discussed earlier in this bug. It's probably a good idea to
> take care of them to be future-proof, but unless some locales are affected
> by this today (e.g. can sr-Latn be selected from the UI?) I don't see the
> immediate need to back this out.

Locales set their default Accept-Language header explicitly, outside of the UI. In addition, anyone can set it via about:config, which is also outside of the UI.

This is Bad Code™. We already have enough of this faulty algorithm in our codebase. New code should not be duplicating it, especially when there is already a standardized algorithm available to do these sorts of things.

This should be backed out so it doesn't accidentally get left in the codebase forever.
Blocks: 1108183
Status: RESOLVED → REOPENED
Ever confirmed: true
Resolution: FIXED → ---
(In reply to Gordon P. Hemsley [:GPHemsley] from comment #25)
> (In reply to Dão Gottwald [:dao] from comment #21)
> > My understanding is that comment 18 is referring to hypothetical cases that
> > were already discussed earlier in this bug. It's probably a good idea to
> > take care of them to be future-proof, but unless some locales are affected
> > by this today (e.g. can sr-Latn be selected from the UI?) I don't see the
> > immediate need to back this out.
> 
> Locales set their default Accept-Language header explicitly, outside of the
> UI.

Sure, what I said and the question I asked don't contradict this.

> In addition, anyone can set it via about:config, which is also outside
> of the UI.

Things can break when messing with about:config. about:config isn't for end users, so no reason to set off the alarm.

> This should be backed out so it doesn't accidentally get left in the
> codebase forever.

We file bugs on things so we don't forget about them.
No longer blocks: 1108183
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Depends on: 1108183
Resolution: --- → FIXED
Here's a way to 100% reliably break standards without interacting with prefs or even UI:

Start Firefox OS 2.2, Simulator will work, and select either version of Serbian. The first value in accept-languages will either be sr-Latn or sr-Cyrl, which should be normalized exactly like this.
(In reply to Dão Gottwald [:dao] from comment #26)
> (In reply to Gordon P. Hemsley [:GPHemsley] from comment #25)
> > (In reply to Dão Gottwald [:dao] from comment #21)
> > > My understanding is that comment 18 is referring to hypothetical cases that
> > > were already discussed earlier in this bug. It's probably a good idea to
> > > take care of them to be future-proof, but unless some locales are affected
> > > by this today (e.g. can sr-Latn be selected from the UI?) I don't see the
> > > immediate need to back this out.
> > 
> > Locales set their default Accept-Language header explicitly, outside of the
> > UI.
> 
> Sure, what I said and the question I asked don't contradict this.

My point is that it's not hypothetical. Locales can already do this now, and I expect some already have. And when they do, it shows up in the UI.

> > This should be backed out so it doesn't accidentally get left in the
> > codebase forever.
> 
> We file bugs on things so we don't forget about them.

Right, because it's not like there are any bugs on file from 14 years ago.

This is not an urgent issue. There is no reason to commit broken code to fix it.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I agree that if this code is known-broken, it should be backed out. The aim of reducing entropy in Accept-Language is not worth us sending badly-case-folded language tags. I rescind my approval of this patch.

Gerv
To my knowledge, this still has not been backed out.
Flags: needinfo?(philringnalda)
Flags: needinfo?(dao)
Your knowledge of that appears to me to be correct, though it seems a lot easier to just look at https://hg.mozilla.org/integration/fx-team/log/default/netwerk/protocol/http/nsHttpHandler.cpp to see whether or not there's a push which backed it out that wasn't mentioned in the bug, rather than asking me.
Flags: needinfo?(philringnalda)
(In reply to Phil Ringnalda (:philor) from comment #31)
> Your knowledge of that appears to me to be correct, though it seems a lot
> easier to just look at
> https://hg.mozilla.org/integration/fx-team/log/default/netwerk/protocol/http/
> nsHttpHandler.cpp to see whether or not there's a push which backed it out
> that wasn't mentioned in the bug, rather than asking me.

The needinfo wasn't about whether it was backed out; it was more a request to back it out, since you pushed it to m-c.
Oh, you *did* need info that I have!

"Would it be appropriate for either the volunteer sheriff who happened to be the one to merge properly reviewed code which was not failing any tests to mozilla-central or for one of the paid sheriffs to get in the middle of a situation where a patch which wanders across the weird ownership boundary between necko and content-headers has one owner deferring his backout decision to someone else, and the other owner saying it should be backed out, but not actually doing it?"

"No, that's not the sort of backouts that sheriffs do. We back out patches which fail tests. Adjudicating the boundary between ownerships is not at all our department."

Either mcmanus or dao can decide that it should be backed out, and either one is perfectly capable of doing the backout themselves. Or gerv can decide that it's entirely his decision and back it out himself, or if he feels he needs help backing it out, he can ask either whoever has the nick ending in |sheriffduty or any developer for assistance.
Given the amount of time that has passed, it would be better if the patch author (Dao) backed it out as they would know best how to fix up any conflicts. However, if that is not done, I will back it out.

Gerv
I'm not the patch author.

A patch is waiting in bug 1108183, pending a response from Pike who raised his concern here but is now unresponsive there; apparently the concern isn't so critical after all.
Flags: needinfo?(dao)
And I'm resolving this bug again to reflect its actual state (landed, not backed out). Please stop messing with the bug status as a means to force your way.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.