Closed Bug 1108183 Opened 10 years ago Closed 9 years ago

Regularize case of Accept-Language HTTP header

Categories

(Core :: Networking: HTTP, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla39
Tracking Status
firefox39 --- fixed

People

(Reporter: awake, Assigned: awake)

References

Details

Attachments

(1 file)

Subtags of language tags in the Accept-Language HTTP header request do not match the case of subtags defined in the Language Subtag Registry (https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry).

The mismatch occurs when the user sets a language tag that is not of the form xx-yy in the 'intl.accept_languages' about:config option, for example, sr-Latn.

To reproduce:
1. In about:config, modify intl.accept_languages to sr-Latn.
2. Check the Accept-Language request header when accessing any site.

See original discussion at https://bugzilla.mozilla.org/show_bug.cgi?id=1054739.
Assignee: nobody → bugzillafx
Status: NEW → ASSIGNED
Attachment #8532763 - Flags: review?(mcmanus)
I don't know how or if it's possible to reuse the js function here.

The C++ function only regularizes letter cases and does not substitute deprecated subtags or sort extension tags, unlike the js function. The letter cases should be identical.

The substitutions or reordering of subtags when canonicalizing language tags are only required where the user has set their own language tags in the about:config option. In these cases, it should be safe to assume that the provided input is valid. Thus, that additional functionality should not be necessary here.
Comment on attachment 8532763 [details] [diff] [review]
Regularize case of subtags

Review of attachment 8532763 [details] [diff] [review]:
-----------------------------------------------------------------

I'm going to delegate this review
Attachment #8532763 - Flags: review?(mcmanus) → review?(gandalf)
Depends on: 1054739
Blocks: 1054739
No longer depends on: 1054739
Comment on attachment 8532763 [details] [diff] [review]
Regularize case of subtags

This code doesn't deal well with ext-lang subtabs from http://tools.ietf.org/html/bcp47#section-2.1.

That it doesn't uppercase 3-digit region codes obviously doesn't matter much ;-).

I haven't found anything utterly constructive in ICU, fwiw, it would be nice if we could pick up code from there to help the general case, and just do something "pretty good" as fallback as long as we don't have ICU across all platforms.
As I mentioned in bug 1054739 comment 17, a smattering of language tags in their canonical form are available for testing here:

https://github.com/GPHemsley/BCP47/blob/master/test_tags.txt
(In reply to Axel Hecht [:Pike] from comment #5)
> Comment on attachment 8532763 [details] [diff] [review]
> Regularize case of subtags
> 
> This code doesn't deal well with ext-lang subtabs from
> http://tools.ietf.org/html/bcp47#section-2.1.
> 
> That it doesn't uppercase 3-digit region codes obviously doesn't matter much
> ;-).
> 
> I haven't found anything utterly constructive in ICU, fwiw, it would be nice
> if we could pick up code from there to help the general case, and just do
> something "pretty good" as fallback as long as we don't have ICU across all
> platforms.

When ICU is available, I think uloc_canonicalize() could provide what we want.
(In reply to Jonathan Kew (:jfkthame) from comment #7)
> 
> When ICU is available, I think uloc_canonicalize() could provide what we
> want.

Might return '_' seperated instead of '-', and it might have @something stuff. IIRC, the result matches the Unicode locale identifiers, which is a superset of BCP 47 (more data, and _ instead of - is OK)
(In reply to Axel Hecht [:Pike] from comment #8)
> (In reply to Jonathan Kew (:jfkthame) from comment #7)
> > 
> > When ICU is available, I think uloc_canonicalize() could provide what we
> > want.
> 
> Might return '_' seperated instead of '-', and it might have @something
> stuff.

OK, so we can use uloc_getBaseName (to drop any @keyword stuff). And then do s/_/-/ on the result if necessary.
This sounds slightly overengineered; what's the point if we need a fallback for when ICU isn't available anyway?
(In reply to Axel Hecht [:Pike] from comment #5)
> Comment on attachment 8532763 [details] [diff] [review]
> Regularize case of subtags
> 
> This code doesn't deal well with ext-lang subtabs from
> http://tools.ietf.org/html/bcp47#section-2.1.
> 
> That it doesn't uppercase 3-digit region codes obviously doesn't matter much
> ;-).
> 
> I haven't found anything utterly constructive in ICU, fwiw, it would be nice
> if we could pick up code from there to help the general case, and just do
> something "pretty good" as fallback as long as we don't have ICU across all
> platforms.

Forgive me as I my understanding of localization is not great. I've been learning more about it since working on this issue. Could you elaborate on which ext-lang subtags are not dealt with correctly?

The algorithm used is from http://tools.ietf.org/html/bcp47#section-2.1.1.
> An implementation can reproduce this format without accessing the
> registry as follows.  All subtags, including extension and private
> use subtags, use lowercase letters with two exceptions: two-letter
> and four-letter subtags that neither appear at the start of the tag
> nor occur after singletons.  Such two-letter subtags are all
> uppercase (as in the tags "en-CA-x-ca" or "sgn-BE-FR") and four-
> letter subtags are titlecase (as in the tag "az-Latn-x-latn").
Flags: needinfo?(l10n)
You can create codes like zh-nan-Hant, but I think I misread the code there.

One thing that rereading seems to not work is that x-I-can-do-what-I-want is ok, privateuse and grandfathered productions from the grammar in http://tools.ietf.org/html/bcp47#section-2.1. And AFAICT, the want and what should stay lowercase. Upper-case I would get lowercased, too.

Should we add most of the example codes in http://tools.ietf.org/html/bcp47#appendix-A to the test?
Flags: needinfo?(l10n)
(In reply to Axel Hecht [:Pike] from comment #12)
> One thing that rereading seems to not work is that x-I-can-do-what-I-want is
> ok, privateuse and grandfathered productions from the grammar in
> http://tools.ietf.org/html/bcp47#section-2.1. And AFAICT, the want and what
> should stay lowercase. Upper-case I would get lowercased, too.

What of this stuff is really relevant here? We're not dealing with arbitrary input and thus don't need to support the whole grammar. We only need to care about language tags that locales might want to set by default and that users can select from the UI.
Flags: needinfo?(l10n)
Comment on attachment 8532763 [details] [diff] [review]
Regularize case of subtags

Review of attachment 8532763 [details] [diff] [review]:
-----------------------------------------------------------------

removing review request until we get a consensus on where we're going with this.
Attachment #8532763 - Flags: review?(gandalf)
Comment on attachment 8532763 [details] [diff] [review]
Regularize case of subtags

I'm interpreting the silence as "don't care"; gandalf, can you please review this?
Attachment #8532763 - Flags: review?(gandalf)
(In reply to Dão Gottwald [:dao] from comment #13)
> (In reply to Axel Hecht [:Pike] from comment #12)
> > One thing that rereading seems to not work is that x-I-can-do-what-I-want is
> > ok, privateuse and grandfathered productions from the grammar in
> > http://tools.ietf.org/html/bcp47#section-2.1. And AFAICT, the want and what
> > should stay lowercase. Upper-case I would get lowercased, too.
> 
> What of this stuff is really relevant here? We're not dealing with arbitrary
> input and thus don't need to support the whole grammar. We only need to care
> about language tags that locales might want to set by default and that users
> can select from the UI.

That's practically wrong. Our metrics data shows that people put random stuff in those prefs.

Note, bullying people around isn't going to get you speedy responses, I just didn't feel like going into another round of you just ignoring the real world out there on the web.
Flags: needinfo?(l10n)
People messing with hidden prefs (which we explicitly warn about) isn't exactly "the real world out there on the web". So the question is how far we want to go with supporting such edge cases. Would the code become much more complicated? (I really have no idea.) Can users of other browsers put random stuff in the Accept-Language header?
Looks like we're not getting anywhere here. I would suggest that we take the proposed patch, assuming it's correct for what it intends to do, and file the case of users putting random stuff in the hidden pref as a new bug, knowing that it might end up as WONTFIX depending on how much we care about that case.
Comment on attachment 8532763 [details] [diff] [review]
Regularize case of subtags

Review of attachment 8532763 [details] [diff] [review]:
-----------------------------------------------------------------

I feel really bad about this bug. I struggle to understand anymore if bug 1054739 + the patch that I'm looking at, together with the benefit of bug 1054739 are enough to out-weight the regressions that bug 1054739 introduced.

On top of that, I'm worried about subtle ICU vs. non-ICU differences that will bite us back.

I've tried to review this patch multiple times, but I start to feel that while I can technically review the code, the review here is more of a sr= nature and I'm not the right person to make the call.

Should I still review it?
(In reply to Zibi Braniecki [:gandalf] from comment #19)
> Comment on attachment 8532763 [details] [diff] [review]
> Regularize case of subtags
> 
> Review of attachment 8532763 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I've tried to review this patch multiple times, but I start to feel that
> while I can technically review the code, the review here is more of a sr=
> nature and I'm not the right person to make the call.
> 
> Should I still review it?

FWIW, I think this code seems reasonable. Regularizing the case of Accept-Language is a Good Thing™, IMO.
(In reply to Axel Hecht [:Pike] from comment #12)
> One thing that rereading seems to not work is that x-I-can-do-what-I-want is
> ok, privateuse and grandfathered productions from the grammar in
> http://tools.ietf.org/html/bcp47#section-2.1. And AFAICT, the want and what
> should stay lowercase. Upper-case I would get lowercased, too.

Oh, to follow up on this point. I think I agree with Axel: Rather than doing nothing, the seenSingleton code should be lowercasing all the way. And grandfathered tags should mostly be OK, I think. Most come after singletons, and the rest follow capitalization rules for their length. (Which is not to say that they shouldn't be hard-coded instead; the BCP implies that they should be.) It might also be worth noting that there has been discussion recently about deprecating all remaining grandfathered tags in favor of standard forms.
Comment on attachment 8532763 [details] [diff] [review]
Regularize case of subtags

back to necko peer/owner; gandalf doesn't feel comfortable reviewing this
Attachment #8532763 - Flags: review?(gandalf) → review?(mcmanus)
As this is more of a content issue, I'm going to ask gerv to find the right person to ok it.
Comment on attachment 8532763 [details] [diff] [review]
Regularize case of subtags

Review of attachment 8532763 [details] [diff] [review]:
-----------------------------------------------------------------

normally I don't adjudicate the contents of non-protocol elements..
Attachment #8532763 - Flags: review?(mcmanus) → review?(gerv)
Attachment #8532763 - Flags: review?(gerv) → review+
https://hg.mozilla.org/mozilla-central/rev/a653443ce794
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: