1108183 - Regularize case of Accept-Language HTTP header

Assignee

Description

•

10 years ago

Subtags of language tags in the Accept-Language HTTP header request do not match the case of subtags defined in the Language Subtag Registry (https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry).

The mismatch occurs when the user sets a language tag that is not of the form xx-yy in the 'intl.accept_languages' about:config option, for example, sr-Latn.

To reproduce:
1. In about:config, modify intl.accept_languages to sr-Latn.
2. Check the Accept-Language request header when accessing any site.

See original discussion at https://bugzilla.mozilla.org/show_bug.cgi?id=1054739.

awake [:awake]

Assignee

Comment 1

•

10 years ago

Attached patch Regularize case of subtags — Details — Splinter Review

Assignee: nobody → bugzillafx

Status: NEW → ASSIGNED

Attachment #8532763 - Flags: review?(mcmanus)

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 2

•

10 years ago

Can we make sure that it matches https://mxr.mozilla.org/mozilla-central/source/js/src/builtin/Intl.js#302 (or reuse it if possible?)

awake [:awake]

Assignee

Comment 3

•

10 years ago

I don't know how or if it's possible to reuse the js function here.

The C++ function only regularizes letter cases and does not substitute deprecated subtags or sort extension tags, unlike the js function. The letter cases should be identical.

The substitutions or reordering of subtags when canonicalizing language tags are only required where the user has set their own language tags in the about:config option. In these cases, it should be safe to assume that the provided input is valid. Thus, that additional functionality should not be necessary here.

Patrick McManus [:mcmanus]

Comment 4

•

10 years ago

Comment on attachment 8532763 [details] [diff] [review]
Regularize case of subtags

Review of attachment 8532763 [details] [diff] [review]:
-----------------------------------------------------------------

I'm going to delegate this review

Attachment #8532763 - Flags: review?(mcmanus) → review?(gandalf)

Gordon P. Hemsley [:GPHemsley]

Updated

•

10 years ago

Depends on: 1054739

Dão Gottwald [:dao]

Updated

•

10 years ago

Blocks: 1054739

No longer depends on: 1054739

Axel Hecht [:Pike]

Comment 5

•

10 years ago

Comment on attachment 8532763 [details] [diff] [review]
Regularize case of subtags

This code doesn't deal well with ext-lang subtabs from http://tools.ietf.org/html/bcp47#section-2.1.

That it doesn't uppercase 3-digit region codes obviously doesn't matter much ;-).

I haven't found anything utterly constructive in ICU, fwiw, it would be nice if we could pick up code from there to help the general case, and just do something "pretty good" as fallback as long as we don't have ICU across all platforms.

Gordon P. Hemsley [:GPHemsley]

Comment 6

•

10 years ago

As I mentioned in bug 1054739 comment 17, a smattering of language tags in their canonical form are available for testing here:

https://github.com/GPHemsley/BCP47/blob/master/test_tags.txt

Jonathan Kew [:jfkthame]

Comment 7

•

10 years ago

(In reply to Axel Hecht [:Pike] from comment #5)
> Comment on attachment 8532763 [details] [diff] [review]
> Regularize case of subtags
> 
> This code doesn't deal well with ext-lang subtabs from
> http://tools.ietf.org/html/bcp47#section-2.1.
> 
> That it doesn't uppercase 3-digit region codes obviously doesn't matter much
> ;-).
> 
> I haven't found anything utterly constructive in ICU, fwiw, it would be nice
> if we could pick up code from there to help the general case, and just do
> something "pretty good" as fallback as long as we don't have ICU across all
> platforms.

When ICU is available, I think uloc_canonicalize() could provide what we want.

Axel Hecht [:Pike]

Comment 8

•

10 years ago

(In reply to Jonathan Kew (:jfkthame) from comment #7)
> 
> When ICU is available, I think uloc_canonicalize() could provide what we
> want.

Might return '_' seperated instead of '-', and it might have @something stuff. IIRC, the result matches the Unicode locale identifiers, which is a superset of BCP 47 (more data, and _ instead of - is OK)

Jonathan Kew [:jfkthame]

Comment 9

•

10 years ago

(In reply to Axel Hecht [:Pike] from comment #8)
> (In reply to Jonathan Kew (:jfkthame) from comment #7)
> > 
> > When ICU is available, I think uloc_canonicalize() could provide what we
> > want.
> 
> Might return '_' seperated instead of '-', and it might have @something
> stuff.

OK, so we can use uloc_getBaseName (to drop any @keyword stuff). And then do s/_/-/ on the result if necessary.

Dão Gottwald [:dao]

Comment 10

•

10 years ago

This sounds slightly overengineered; what's the point if we need a fallback for when ICU isn't available anyway?

awake [:awake]

Assignee

Comment 11

•

10 years ago

(In reply to Axel Hecht [:Pike] from comment #5)
> Comment on attachment 8532763 [details] [diff] [review]
> Regularize case of subtags
> 
> This code doesn't deal well with ext-lang subtabs from
> http://tools.ietf.org/html/bcp47#section-2.1.
> 
> That it doesn't uppercase 3-digit region codes obviously doesn't matter much
> ;-).
> 
> I haven't found anything utterly constructive in ICU, fwiw, it would be nice
> if we could pick up code from there to help the general case, and just do
> something "pretty good" as fallback as long as we don't have ICU across all
> platforms.

Forgive me as I my understanding of localization is not great. I've been learning more about it since working on this issue. Could you elaborate on which ext-lang subtags are not dealt with correctly?

The algorithm used is from http://tools.ietf.org/html/bcp47#section-2.1.1.
> An implementation can reproduce this format without accessing the
> registry as follows.  All subtags, including extension and private
> use subtags, use lowercase letters with two exceptions: two-letter
> and four-letter subtags that neither appear at the start of the tag
> nor occur after singletons.  Such two-letter subtags are all
> uppercase (as in the tags "en-CA-x-ca" or "sgn-BE-FR") and four-
> letter subtags are titlecase (as in the tag "az-Latn-x-latn").

Dão Gottwald [:dao]

Updated

•

10 years ago

Flags: needinfo?(l10n)

Axel Hecht [:Pike]

Comment 12

•

10 years ago

You can create codes like zh-nan-Hant, but I think I misread the code there.

One thing that rereading seems to not work is that x-I-can-do-what-I-want is ok, privateuse and grandfathered productions from the grammar in http://tools.ietf.org/html/bcp47#section-2.1. And AFAICT, the want and what should stay lowercase. Upper-case I would get lowercased, too.

Should we add most of the example codes in http://tools.ietf.org/html/bcp47#appendix-A to the test?

Flags: needinfo?(l10n)

Dão Gottwald [:dao]

Comment 13

•

10 years ago

(In reply to Axel Hecht [:Pike] from comment #12)
> One thing that rereading seems to not work is that x-I-can-do-what-I-want is
> ok, privateuse and grandfathered productions from the grammar in
> http://tools.ietf.org/html/bcp47#section-2.1. And AFAICT, the want and what
> should stay lowercase. Upper-case I would get lowercased, too.

What of this stuff is really relevant here? We're not dealing with arbitrary input and thus don't need to support the whole grammar. We only need to care about language tags that locales might want to set by default and that users can select from the UI.

Dão Gottwald [:dao]

Updated

•

10 years ago

Flags: needinfo?(l10n)

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 14

•

10 years ago

Comment on attachment 8532763 [details] [diff] [review]
Regularize case of subtags

Review of attachment 8532763 [details] [diff] [review]:
-----------------------------------------------------------------

removing review request until we get a consensus on where we're going with this.

Attachment #8532763 - Flags: review?(gandalf)

Dão Gottwald [:dao]

Comment 15

•

9 years ago

Comment on attachment 8532763 [details] [diff] [review]
Regularize case of subtags

I'm interpreting the silence as "don't care"; gandalf, can you please review this?

Attachment #8532763 - Flags: review?(gandalf)

Axel Hecht [:Pike]

Comment 16

•

9 years ago

(In reply to Dão Gottwald [:dao] from comment #13)
> (In reply to Axel Hecht [:Pike] from comment #12)
> > One thing that rereading seems to not work is that x-I-can-do-what-I-want is
> > ok, privateuse and grandfathered productions from the grammar in
> > http://tools.ietf.org/html/bcp47#section-2.1. And AFAICT, the want and what
> > should stay lowercase. Upper-case I would get lowercased, too.
> 
> What of this stuff is really relevant here? We're not dealing with arbitrary
> input and thus don't need to support the whole grammar. We only need to care
> about language tags that locales might want to set by default and that users
> can select from the UI.

That's practically wrong. Our metrics data shows that people put random stuff in those prefs.

Note, bullying people around isn't going to get you speedy responses, I just didn't feel like going into another round of you just ignoring the real world out there on the web.

Flags: needinfo?(l10n)

Dão Gottwald [:dao]

Comment 17

•

9 years ago

People messing with hidden prefs (which we explicitly warn about) isn't exactly "the real world out there on the web". So the question is how far we want to go with supporting such edge cases. Would the code become much more complicated? (I really have no idea.) Can users of other browsers put random stuff in the Accept-Language header?

Dão Gottwald [:dao]

Comment 18

•

9 years ago

Looks like we're not getting anywhere here. I would suggest that we take the proposed patch, assuming it's correct for what it intends to do, and file the case of users putting random stuff in the hidden pref as a new bug, knowing that it might end up as WONTFIX depending on how much we care about that case.

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 19

•

9 years ago

Comment on attachment 8532763 [details] [diff] [review]
Regularize case of subtags

Review of attachment 8532763 [details] [diff] [review]:
-----------------------------------------------------------------

I feel really bad about this bug. I struggle to understand anymore if bug 1054739 + the patch that I'm looking at, together with the benefit of bug 1054739 are enough to out-weight the regressions that bug 1054739 introduced.

On top of that, I'm worried about subtle ICU vs. non-ICU differences that will bite us back.

I've tried to review this patch multiple times, but I start to feel that while I can technically review the code, the review here is more of a sr= nature and I'm not the right person to make the call.

Should I still review it?

Gordon P. Hemsley [:GPHemsley]

Comment 20

•

9 years ago

(In reply to Zibi Braniecki [:gandalf] from comment #19)
> Comment on attachment 8532763 [details] [diff] [review]
> Regularize case of subtags
> 
> Review of attachment 8532763 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I've tried to review this patch multiple times, but I start to feel that
> while I can technically review the code, the review here is more of a sr=
> nature and I'm not the right person to make the call.
> 
> Should I still review it?

FWIW, I think this code seems reasonable. Regularizing the case of Accept-Language is a Good Thing™, IMO.

Gordon P. Hemsley [:GPHemsley]

Comment 21

•

9 years ago

(In reply to Axel Hecht [:Pike] from comment #12)
> One thing that rereading seems to not work is that x-I-can-do-what-I-want is
> ok, privateuse and grandfathered productions from the grammar in
> http://tools.ietf.org/html/bcp47#section-2.1. And AFAICT, the want and what
> should stay lowercase. Upper-case I would get lowercased, too.

Oh, to follow up on this point. I think I agree with Axel: Rather than doing nothing, the seenSingleton code should be lowercasing all the way. And grandfathered tags should mostly be OK, I think. Most come after singletons, and the rest follow capitalization rules for their length. (Which is not to say that they shouldn't be hard-coded instead; the BCP implies that they should be.) It might also be worth noting that there has been discussion recently about deprecating all remaining grandfathered tags in favor of standard forms.

Dão Gottwald [:dao]

Comment 22

•

9 years ago

Comment on attachment 8532763 [details] [diff] [review]
Regularize case of subtags

back to necko peer/owner; gandalf doesn't feel comfortable reviewing this

Attachment #8532763 - Flags: review?(gandalf) → review?(mcmanus)

Patrick McManus [:mcmanus]

Comment 23

•

9 years ago

As this is more of a content issue, I'm going to ask gerv to find the right person to ok it.

Patrick McManus [:mcmanus]

Comment 24

•

9 years ago

Comment on attachment 8532763 [details] [diff] [review]
Regularize case of subtags

Review of attachment 8532763 [details] [diff] [review]:
-----------------------------------------------------------------

normally I don't adjudicate the contents of non-protocol elements..

Attachment #8532763 - Flags: review?(mcmanus) → review?(gerv)

Gervase Markham [:gerv]

Updated

•

9 years ago

Attachment #8532763 - Flags: review?(gerv) → review+

Dão Gottwald [:dao]

Comment 25

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=efc206efbee4

Keywords: checkin-needed

Carsten Book [:Tomcat]

Comment 26

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/a653443ce794

Keywords: checkin-needed

Phil Ringnalda (:philor)

Comment 27

•

9 years ago

https://hg.mozilla.org/mozilla-central/rev/a653443ce794

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

status-firefox39: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla39