Open Bug 1648889 Opened 4 years ago Updated 6 months ago

isLabelSafe (punycode heuristic) should allow Limited_Use scripts

Categories

(Core :: Networking: DNS, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: manishearth, Unassigned, NeedInfo)

References

Details

(Whiteboard: [necko-triaged])

Attachments

(2 files)

Currently, Javanese domains get punycoded. There is at least one Indonesian registrar (Pandi) which allows Javanese domain names (e.g. http://ꦗꦒꦢ꧀ꦗꦮ.id). They have been asking Unicode people to see what needs to change so that things no longer get punycoded.

Our heuristics for this are in isLabelSafe(). It contains multiple heuristics:

It's the last one that hits us. Unfortunately, UTS 39 classifies IdentifierType=Limited_Use characters as IdentifierStatus=Restricted. This property comes from this table. These are all scripts in modern use that are just not used as much, which does not seem like a useful heuristic for figuring out if something should be punycoded.

Can we switch the heuristic to use Identifier_Type={Inclusion, Recommended, Limited_Use} instead? We'd have to tweak the table generation code to work off of IdentifierType.txt instead of IdentifierStatus.txt.

It's unclear to me why Limited_Use scripts are not Recommended according to UTS 39. That specification is more a set of guidelines than hard rules, and it's acceptable to tweak the rules the way I'm proposing here, but I plan to ask about this upstream as well. Either way, we should probably fix this in Firefox, perhaps coordinating with Chrome.

I think Anne and Henri might have some opinions about this.

Flags: needinfo?(hsivonen)
Flags: needinfo?(annevk)

I think we should figure out a solution to this. It's going to take more than one Bugzilla comment though.

My most immediate concern is that the proposed change would also allow Cherokee. Cherokee is inspired by the Latin script, which results in confusability issues. I'm not suggesting that Cherokee shouldn't be allowed: These issues aren't worse than the issues with Cyrillic and Greek scripts, but the effects with Cherokee will require more testing than the effects with Javanese.

These are all scripts in modern use that are just not used as much, which does not seem like a useful heuristic for figuring out if something should be punycoded.

This indeed is a weird classification.

Flags: needinfo?(hsivonen)

Did you ask on the Unicode mailing list?

Flags: needinfo?(annevk)
See Also: → 1332714

Did you ask on the Unicode mailing list?

Yes, I did, on the internal list (do you think I should make the email be external instead?). I'm hoping to write a proposal asking it be changed soon.

UTS 39 is a set of recommendations. The subcategorization is provided for a reason, it's so that individual use cases can tailor to their needs. Unicode need not change this for us to make changes in Firefox (and other browsers), equivalently, if UTS 39 does change this it does not imply all consumers need to follow that. I'm hoping to propose it be moved to the Allowed category with some caveats mentioned below it.

The main thing brought up on the list was that this increases attack surface. It does not seem great to me to divide scripts this way, however.

My most immediate concern is that the proposed change would also allow Cherokee. Cherokee is inspired by the Latin script, which results in confusability issues. I'm not suggesting that Cherokee shouldn't be allowed: These issues aren't worse than the issues with Cyrillic and Greek scripts, but the effects with Cherokee will require more testing than the effects with Javanese.

Agreed. But note that we currently allow роре.com (Cyrillic). Chrome has smarter heuristics here, it specifically punycodes Russian on .com.

This sounds like a job for the mixed script heuristic. Rust has a smarter version of the mixed script heuristic that looks for cases where the only characters from a script used are confusables, which might be useful too.

Note that almost all Cherokee syllables are confusable with capital Latin letters, which are uncommon in domains anyway. That might not be relevant though.

If we are going to make our own decisions I think we should have some kind of specification around that to refer back to when we need to make further changes. That will also be useful for review and in case others want to follow our display recommendations. (And could maybe eventually be used to refine https://url.spec.whatwg.org/#url-rendering-i18n.)

Chrome has a design document, we can write a similar one: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome

Note that Chrome's heuristics are more involved than ours. A spec would be nice.

(In reply to Manish Goregaokar [:manishearth] from comment #7)

Chrome has a design document, we can write a similar one: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome

Note that Chrome's heuristics are more involved than ours. A spec would be nice.

Probably worth talking to :stpeter if we're going to flip any switches in this area. AIUI there's... reluctance... to start using the Alexa-top-N-based normalization stuff that's part of Chrome's design. Edge does something else (or at least used to), and I'm unsure what Safari do, so it's probably worth involving more folks if there is to be a wider conversation around this (again).

Yeah, to be clear I'm not proposing we adopt Chrome's heuristics, I'm proposing that we stop punycoding a certain category of scripts. Chrome doesn't do this either, though I have filed a bug asking for this to be changed there as well (https://bugs.chromium.org/p/chromium/issues/detail?id=1099976)

(In reply to Manish Goregaokar [:manishearth] from comment #0)

Currently, Javanese domains get punycoded. There is at least one Indonesian registrar (Pandi) which allows Javanese domain names (e.g. http://ꦗꦒꦢ꧀ꦗꦮ.id). They have been asking Unicode people to see what needs to change so that things no longer get punycoded.

Our heuristics for this are in isLabelSafe(). It contains multiple heuristics:

It's the last one that hits us. Unfortunately, UTS 39 classifies IdentifierType=Limited_Use characters as IdentifierStatus=Restricted. This property comes from this table. These are all scripts in modern use that are just not used as much, which does not seem like a useful heuristic for figuring out if something should be punycoded.

Can we switch the heuristic to use Identifier_Type={Inclusion, Recommended, Limited_Use} instead? We'd have to tweak the table generation code to work off of IdentifierType.txt instead of IdentifierStatus.txt.

It's unclear to me why Limited_Use scripts are not Recommended according to UTS 39. That specification is more a set of guidelines than hard rules, and it's acceptable to tweak the rules the way I'm proposing here, but I plan to ask about this upstream as well. Either way, we should probably fix this in Firefox, perhaps coordinating with Chrome.

Hello Manish, My name is Chika and i am from PANDI's team. First of all thank you for your discussion related to Javanese script and some solutions. I would like to know if is there something PANDI can provide to change the table's status from "Limited use" to be "Recommended" ?

Hi, thanks for commenting! This page is for tracking changes to Firefox, a decision to move Limited Use to Recommended cannot be made here. What can be done here is that Firefox can decide to stop mangling Javanese and Balinese in URLs, and your comment is helpful in that context. I intend to submit a proposal to Unicode to move Limited Use to Recommended shortly, I can email it to you and mention Pandi's name when that occurs.

(In reply to Manish Goregaokar [:manishearth] from comment #11)

Hi, thanks for commenting! This page is for tracking changes to Firefox, a decision to move Limited Use to Recommended cannot be made here. What can be done here is that Firefox can decide to stop mangling Javanese and Balinese in URLs, and your comment is helpful in that context. I intend to submit a proposal to Unicode to move Limited Use to Recommended shortly, I can email it to you and mention Pandi's name when that occurs.

based on the previous comment; should the unicode proposal proven to be difficult, browser can "whitelist" Javanese codepoint as valid url name regardless of the unicode table classification, since it is a recommendation and not obligatory. is it correct?

currently, domain names in javanese is a bit of a novelty. However, a small yet robust and growing community of Javanese script user is seen today. One of the stumbling blocks they face is softwares and other digital interface that does not support Javanese adequately, and its making it a bit difficult to show new users on the applicability of this script. at the very least, the ability for the script to be displayed on screen is very much appreciated, and PANDI can use this to encourage other bodies that errors in digital implementation is a bit complicated and involve outside bodies, but entirely fixable and not something that "oh its the system, we cant fix it, we just have to let it be.

based on the previous comment; should the unicode proposal proven to be difficult, browser can "whitelist" Javanese codepoint as valid url name regardless of the unicode table classification, since it is a recommendation and not obligatory. is it correct?

Correct! This is what I'm proposing here, we tweak the Firefox implementation of the unicode recommendations to allow for Limited_Use scripts (indeed, the spec specifically calls out this case as something one may wish to do).

It's great to hear of the growing digital presence of Carakan, and yes, we should work to remove digital barriers like this.

I've submitted a proposal to Unicode. Either way, we should make a decision for what we want in IDN.

Severity: -- → N/A
Type: defect → enhancement
Priority: -- → P3
Whiteboard: [necko-triaged]

This was discussed in today's Unicode meeting, based on the report (item D4) from the properties and algorithms subcommittee.

The consensus was to not make the change to Limited_Use in the specification since it increases the default attack surface. However it was recognized that Limited_Use is underdefined and it's not fair to put living scripts in "script purgatory" without any way out. I and some others will be working on clearer criteria for a script to be removed from this list. From informal discussions I had later it seems pretty clear that at least Javanese, Balinese, Canadian Aboriginal Syllabics (an official writing system of Nunavut!), and perhaps N'ko would not satisfy any reasonable criteria for being kept in Limited_Use.

I still feel that we should be defaulting to not punycoding Limited_Use scripts here. While the Unicode committee is concerned about all consumers of this specification and is being conservative, I think we can opt to be a bit more bold here.

However, failing this, I do think that we should at least be exempting Javanese, Balinese, Canadian Aboriginal Syllabics, and N'ko, all of which have a decent online presence.

(In reply to Manish Goregaokar [:manishearth] from comment #15)

I still feel that we should be defaulting to not punycoding Limited_Use scripts here. While the Unicode committee is concerned about all consumers of this specification and is being conservative, I think we can opt to be a bit more bold here.

However, failing this, I do think that we should at least be exempting Javanese, Balinese, Canadian Aboriginal Syllabics, and N'ko, all of which have a decent online presence.

I'm not sure who makes a call on this (not me!), but seems like it'd be helpful to have a decision in this bug to make it clear whether patches to change either/some/both of these would be accepted. Anne/Valentin/Jonathan ?

Flags: needinfo?(valentin.gosu)
Flags: needinfo?(jfkthame)
Flags: needinfo?(annevk)

(In reply to Manish Goregaokar [:manishearth] from comment #15)

However, failing this, I do think that we should at least be exempting Javanese, Balinese, Canadian Aboriginal Syllabics, and N'ko, all of which have a decent online presence.

Seems reasonable to me.

AFAICT, some characters in Javanese and Balinese could be confusables with each other in e.g. Noto if you see certain characters for the first time without having seen the fonts for both first. However, users can be assumed to have seen the font for their own language already. Furthermore, it seems that this is less of a problem than the situation that already exists on Windows in Nirmala UI between Tamil and Malayalam or in pretty much any font between Latin and Cyrillic.

Agreed, I think it would be reasonable to allow these. (I'd be much more hesitant to simply switch the treatment of Limited_Use without individually reviewing each of the things it includes.)

Flags: needinfo?(jfkthame)

AFAICT, some characters in Javanese and Balinese could be confusables with each other in e.g. Noto if you see certain characters for the first time without having seen the fonts for both first.

I believe we also have a mixed script heuristic so we're probably fine unless there's a Balinese domain name that completely clashes with Javanese.

I might write a patch for this. I think it's fair to be cautious about doing it wholesale.

Ugh, annoyingly the patch is not that simple, because some code points can be multiple categories at once, and the categorization gives Limited_Use priority over the others, so to tailor it I need to derive the other properties. This is mostly a matter of weeding out non-XID and non-NFC code points.

Assignee: nobody → manishearth
Status: NEW → ASSIGNED

FWIW, the ranges that get switched to Allowed are:

07C0..07E7    ; Limited_Use                    # 5.0   [40] NKO DIGIT ZERO..NKO LETTER NYA WOLOSO
07EB..07F5    ; Limited_Use                    # 5.0   [11] NKO COMBINING SHORT HIGH TONE..NKO LOW TONE APOSTROPHE
07FD          ; Limited_Use                    # 11.0       NKO DANTAYALAN
1401..166C    ; Limited_Use                    # 3.0  [620] CANADIAN SYLLABICS E..CANADIAN SYLLABICS CARRIER TTSA
166F..1676    ; Limited_Use                    # 3.0    [8] CANADIAN SYLLABICS QAI..CANADIAN SYLLABICS NNGAA
1677..167F    ; Limited_Use                    # 5.2    [9] CANADIAN SYLLABICS WOODS-CREE THWEE..CANADIAN SYLLABICS BLACKFOOT W
18B0..18F5    ; Limited_Use                    # 5.2   [70] CANADIAN SYLLABICS OY..CANADIAN SYLLABICS CARRIER DENTAL S
1B00..1B4B    ; Limited_Use                    # 5.0   [76] BALINESE SIGN ULU RICEM..BALINESE LETTER ASYURA SASAK
1B50..1B59    ; Limited_Use                    # 5.0   [10] BALINESE DIGIT ZERO..BALINESE DIGIT NINE
1B6B..1B73    ; Limited_Use                    # 5.0    [9] BALINESE MUSICAL SYMBOL COMBINING TEGEH..BALINESE MUSICAL SYMBOL COMBINING GONG
A980..A9C0    ; Limited_Use                    # 5.2   [65] JAVANESE SIGN PANYANGGA..JAVANESE PANGKON
A9D0..A9D9    ; Limited_Use                    # 5.2   [10] JAVANESE DIGIT ZERO..JAVANESE DIGIT NINE

The only one of these that's a bit questionable are the Balinese musical symbols, and IMO that's because they should be categorized as Limited_Use+Technical.

Submitted some feedback on that

So, back to comment 6, I continue to think we do need some kind of specification around what we are doing here. This also warrants following https://wiki.mozilla.org/ExposureGuidelines to give people a heads up and ensure we're not missing anything. Mike Conca might be able to identify additional prerequisites.

Flags: needinfo?(valentin.gosu)
Flags: needinfo?(mconca)
Flags: needinfo?(annevk)

(In reply to Manish Goregaokar [:manishearth] from comment #19)

AFAICT, some characters in Javanese and Balinese could be confusables with each other in e.g. Noto if you see certain characters for the first time without having seen the fonts for both first.

I believe we also have a mixed script heuristic so we're probably fine unless there's a Balinese domain name that completely clashes with Javanese.

For clarity, I meant the case where a DNS label in its entirety consists of confusables as with your роре.com example. I don't know the languages enough to say if sensible words can be constructed with confusables only. (Again, superficially, I believe this to be less of a problem between Javanese and Balinese than between Tamil and Malayalam or between Latin and Cyrillic.)

(In reply to Anne (:annevk) from comment #25)

So, back to comment 6, I continue to think we do need some kind of specification around what we are doing here. This also warrants following https://wiki.mozilla.org/ExposureGuidelines to give people a heads up and ensure we're not missing anything. Mike Conca might be able to identify additional prerequisites.

As it seems most everyone is in favor of this change, we should definitely send out an Intent-to-Ship email immediately (Manish). Are there existing web-platform-tests around this feature? If so, we should make sure we add/extend those tests to cover this. And I agree with Anne, it would be best if we can point to a specification effort related to this. Finally, can we put this change behind a pref? That gives Mozilla an easy way to immediately revert this change in the field should anything unpleasant arise (increasing our risk tolerance for shipping).

Flags: needinfo?(mconca)

There are no web-platform tests since this cannot be tested and is not a web standard. Chrome has different heuristics here.

Given that Chrome's heuristics are different (and we have reasons for not choosing the exact same heuristics), at best what I can do is create a wiki page with the heuristics we use. There is a specification effort on the Unicode side that will likely move towards, see https://www.unicode.org/L2/L2020/20172.htm#164-A45.

The current implementation hardcodes this into the existing large byte array, I feel like adding a second array so that it can be pref-switched would be a major impact on binary bloat.

(In reply to Manish Goregaokar [:manishearth] from comment #28)

The current implementation hardcodes this into the existing large byte array, I feel like adding a second array so that it can be pref-switched would be a major impact on binary bloat.

Yes, I'd agree that having a second version of the array just to support a pref here would be undesirable.

If we did see "something unpleasant" and wanted to revert the behavior in the field, presumably we could do that by pushing a change to the network.IDN.extra_blocked_chars pref. (That might not be a solution for users -- if any -- who use a custom setting of that pref, overriding what we ship. But I expect that's rare enough for us to ignore.)

Any updates? Are folks comfortable with shipping this?

I think so, modulo an intent to ship and wiki page documenting our behavior in some detail. I guess the other question is if there's someone available to deal with fallout, if any, but maybe pushing a change to the pref is trivial enough?

The bug assignee didn't login in Bugzilla in the last 7 months.
:dragana, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: manishearth → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(dd.mozilla)
Flags: needinfo?(dd.mozilla)

Bump... this seems to have fallen through the cracks. Shall we move forward?

Flags: needinfo?(manishearth)
Flags: needinfo?(jfkthame)

Clear a needinfo that is pending on an inactive user.

Inactive users most likely will not respond; if the missing information is essential and cannot be collected another way, the bug maybe should be closed as INCOMPLETE.

For more information, please visit BugBot documentation.

Flags: needinfo?(manishearth)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: