1648889 - isLabelSafe (punycode heuristic) should allow Limited_Use scripts

Reporter

Description

•

5 years ago

Currently, Javanese domains get punycoded. There is at least one Indonesian registrar (Pandi) which allows Javanese domain names (e.g. http://ꦗꦒꦢ꧀ꦗꦮ.id). They have been asking Unicode people to see what needs to change so that things no longer get punycoded.

Our heuristics for this are in isLabelSafe(). It contains multiple heuristics:

Checking a manual blocklist of unsafe characters
Using a rough version of UTS 39's mixed script detection that does not apply to script extensions, only scripts. This is mostly fine because so far characters with a distinct script extension property are not ones we should worry about in URLs
UTS 39's mixed-number detection
UTS 39's identifier status

It's the last one that hits us. Unfortunately, UTS 39 classifies IdentifierType=Limited_Use characters as IdentifierStatus=Restricted. This property comes from this table. These are all scripts in modern use that are just not used as much, which does not seem like a useful heuristic for figuring out if something should be punycoded.

Can we switch the heuristic to use Identifier_Type={Inclusion, Recommended, Limited_Use} instead? We'd have to tweak the table generation code to work off of IdentifierType.txt instead of IdentifierStatus.txt.

It's unclear to me why Limited_Use scripts are not Recommended according to UTS 39. That specification is more a set of guidelines than hard rules, and it's acceptable to tweak the rules the way I'm proposing here, but I plan to ask about this upstream as well. Either way, we should probably fix this in Firefox, perhaps coordinating with Chrome.

Manish Goregaokar [:manishearth]

Reporter

Comment 1

•

5 years ago

Filed a chromium bug for this as well: https://bugs.chromium.org/p/chromium/issues/detail?id=1099976

Valentin Gosu [:valentin] (he/him)

Comment 2

•

5 years ago

I think Anne and Henri might have some opinions about this.

Flags: needinfo?(hsivonen)

Flags: needinfo?(annevk)

Henri Sivonen (:hsivonen)

Comment 3

•

5 years ago

I think we should figure out a solution to this. It's going to take more than one Bugzilla comment though.

My most immediate concern is that the proposed change would also allow Cherokee. Cherokee is inspired by the Latin script, which results in confusability issues. I'm not suggesting that Cherokee shouldn't be allowed: These issues aren't worse than the issues with Cyrillic and Greek scripts, but the effects with Cherokee will require more testing than the effects with Javanese.

These are all scripts in modern use that are just not used as much, which does not seem like a useful heuristic for figuring out if something should be punycoded.

This indeed is a weird classification.

Flags: needinfo?(hsivonen)

Anne (:annevk)

Comment 4

•

5 years ago

Did you ask on the Unicode mailing list?

Flags: needinfo?(annevk)

Comment 5

•

5 years ago

Did you ask on the Unicode mailing list?

Yes, I did, on the internal list (do you think I should make the email be external instead?). I'm hoping to write a proposal asking it be changed soon.

UTS 39 is a set of recommendations. The subcategorization is provided for a reason, it's so that individual use cases can tailor to their needs. Unicode need not change this for us to make changes in Firefox (and other browsers), equivalently, if UTS 39 does change this it does not imply all consumers need to follow that. I'm hoping to propose it be moved to the Allowed category with some caveats mentioned below it.

The main thing brought up on the list was that this increases attack surface. It does not seem great to me to divide scripts this way, however.

My most immediate concern is that the proposed change would also allow Cherokee. Cherokee is inspired by the Latin script, which results in confusability issues. I'm not suggesting that Cherokee shouldn't be allowed: These issues aren't worse than the issues with Cyrillic and Greek scripts, but the effects with Cherokee will require more testing than the effects with Javanese.

Agreed. But note that we currently allow роре.com (Cyrillic). Chrome has smarter heuristics here, it specifically punycodes Russian on .com.

This sounds like a job for the mixed script heuristic. Rust has a smarter version of the mixed script heuristic that looks for cases where the only characters from a script used are confusables, which might be useful too.

Note that almost all Cherokee syllables are confusable with capital Latin letters, which are uncommon in domains anyway. That might not be relevant though.

Anne (:annevk)

Comment 6

•

5 years ago

If we are going to make our own decisions I think we should have some kind of specification around that to refer back to when we need to make further changes. That will also be useful for review and in case others want to follow our display recommendations. (And could maybe eventually be used to refine https://url.spec.whatwg.org/#url-rendering-i18n.)

Manish Goregaokar [:manishearth]

Reporter

Comment 7

•

5 years ago

Chrome has a design document, we can write a similar one: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome

Note that Chrome's heuristics are more involved than ours. A spec would be nice.

:Gijs (he/him)

Comment 8

•

5 years ago

(In reply to Manish Goregaokar [:manishearth] from comment #7)

Chrome has a design document, we can write a similar one: http://dev.chromium.org/developers/design-documents/idn-in-google-chrome

Note that Chrome's heuristics are more involved than ours. A spec would be nice.

Probably worth talking to :stpeter if we're going to flip any switches in this area. AIUI there's... reluctance... to start using the Alexa-top-N-based normalization stuff that's part of Chrome's design. Edge does something else (or at least used to), and I'm unsure what Safari do, so it's probably worth involving more folks if there is to be a wider conversation around this (again).

Manish Goregaokar [:manishearth]

Reporter

Comment 9

•

5 years ago

Yeah, to be clear I'm not proposing we adopt Chrome's heuristics, I'm proposing that we stop punycoding a certain category of scripts. Chrome doesn't do this either, though I have filed a bug asking for this to be changed there as well (https://bugs.chromium.org/p/chromium/issues/detail?id=1099976)

Chika Hs

Comment 10

•

5 years ago

(In reply to Manish Goregaokar [:manishearth] from comment #0)

Currently, Javanese domains get punycoded. There is at least one Indonesian registrar (Pandi) which allows Javanese domain names (e.g. http://ꦗꦒꦢ꧀ꦗꦮ.id). They have been asking Unicode people to see what needs to change so that things no longer get punycoded.

Our heuristics for this are in isLabelSafe(). It contains multiple heuristics:

Checking a manual blocklist of unsafe characters

Using a rough version of UTS 39's mixed script detection that does not apply to script extensions, only scripts. This is mostly fine because so far characters with a distinct script extension property are not ones we should worry about in URLs

UTS 39's mixed-number detection

UTS 39's identifier status

It's the last one that hits us. Unfortunately, UTS 39 classifies IdentifierType=Limited_Use characters as IdentifierStatus=Restricted. This property comes from this table. These are all scripts in modern use that are just not used as much, which does not seem like a useful heuristic for figuring out if something should be punycoded.

Can we switch the heuristic to use Identifier_Type={Inclusion, Recommended, Limited_Use} instead? We'd have to tweak the table generation code to work off of IdentifierType.txt instead of IdentifierStatus.txt.

It's unclear to me why Limited_Use scripts are not Recommended according to UTS 39. That specification is more a set of guidelines than hard rules, and it's acceptable to tweak the rules the way I'm proposing here, but I plan to ask about this upstream as well. Either way, we should probably fix this in Firefox, perhaps coordinating with Chrome.

Hello Manish, My name is Chika and i am from PANDI's team. First of all thank you for your discussion related to Javanese script and some solutions. I would like to know if is there something PANDI can provide to change the table's status from "Limited use" to be "Recommended" ?

Manish Goregaokar [:manishearth]

Reporter

Comment 11

•

5 years ago

Hi, thanks for commenting! This page is for tracking changes to Firefox, a decision to move Limited Use to Recommended cannot be made here. What can be done here is that Firefox can decide to stop mangling Javanese and Balinese in URLs, and your comment is helpful in that context. I intend to submit a proposal to Unicode to move Limited Use to Recommended shortly, I can email it to you and mention Pandi's name when that occurs.

Chika Hs

Comment 12

•

5 years ago

(In reply to Manish Goregaokar [:manishearth] from comment #11)

Hi, thanks for commenting! This page is for tracking changes to Firefox, a decision to move Limited Use to Recommended cannot be made here. What can be done here is that Firefox can decide to stop mangling Javanese and Balinese in URLs, and your comment is helpful in that context. I intend to submit a proposal to Unicode to move Limited Use to Recommended shortly, I can email it to you and mention Pandi's name when that occurs.

based on the previous comment; should the unicode proposal proven to be difficult, browser can "whitelist" Javanese codepoint as valid url name regardless of the unicode table classification, since it is a recommendation and not obligatory. is it correct?

currently, domain names in javanese is a bit of a novelty. However, a small yet robust and growing community of Javanese script user is seen today. One of the stumbling blocks they face is softwares and other digital interface that does not support Javanese adequately, and its making it a bit difficult to show new users on the applicability of this script. at the very least, the ability for the script to be displayed on screen is very much appreciated, and PANDI can use this to encourage other bodies that errors in digital implementation is a bit complicated and involve outside bodies, but entirely fixable and not something that "oh its the system, we cant fix it, we just have to let it be.

Manish Goregaokar [:manishearth]

Reporter

Comment 13

•

5 years ago

based on the previous comment; should the unicode proposal proven to be difficult, browser can "whitelist" Javanese codepoint as valid url name regardless of the unicode table classification, since it is a recommendation and not obligatory. is it correct?

Correct! This is what I'm proposing here, we tweak the Firefox implementation of the unicode recommendations to allow for Limited_Use scripts (indeed, the spec specifically calls out this case as something one may wish to do).

It's great to hear of the growing digital presence of Carakan, and yes, we should work to remove digital barriers like this.

Manish Goregaokar [:manishearth]

Reporter

Comment 14

•

5 years ago

I've submitted a proposal to Unicode. Either way, we should make a decision for what we want in IDN.

Dragana Damjanovic [:dragana]

Updated

•

5 years ago

Severity: -- → N/A

Type: defect → enhancement

Priority: -- → P3

Whiteboard: [necko-triaged]

Manish Goregaokar [:manishearth]

Reporter

Comment 15

•

5 years ago

This was discussed in today's Unicode meeting, based on the report (item D4) from the properties and algorithms subcommittee.

The consensus was to not make the change to Limited_Use in the specification since it increases the default attack surface. However it was recognized that Limited_Use is underdefined and it's not fair to put living scripts in "script purgatory" without any way out. I and some others will be working on clearer criteria for a script to be removed from this list. From informal discussions I had later it seems pretty clear that at least Javanese, Balinese, Canadian Aboriginal Syllabics (an official writing system of Nunavut!), and perhaps N'ko would not satisfy any reasonable criteria for being kept in Limited_Use.

I still feel that we should be defaulting to not punycoding Limited_Use scripts here. While the Unicode committee is concerned about all consumers of this specification and is being conservative, I think we can opt to be a bit more bold here.

However, failing this, I do think that we should at least be exempting Javanese, Balinese, Canadian Aboriginal Syllabics, and N'ko, all of which have a decent online presence.

:Gijs (he/him)

Comment 16

•

5 years ago

(In reply to Manish Goregaokar [:manishearth] from comment #15)

I still feel that we should be defaulting to not punycoding Limited_Use scripts here. While the Unicode committee is concerned about all consumers of this specification and is being conservative, I think we can opt to be a bit more bold here.

However, failing this, I do think that we should at least be exempting Javanese, Balinese, Canadian Aboriginal Syllabics, and N'ko, all of which have a decent online presence.

I'm not sure who makes a call on this (not me!), but seems like it'd be helpful to have a decision in this bug to make it clear whether patches to change either/some/both of these would be accepted. Anne/Valentin/Jonathan ?

Flags: needinfo?(valentin.gosu)

Flags: needinfo?(jfkthame)

Flags: needinfo?(annevk)

Henri Sivonen (:hsivonen)

Comment 17

•

5 years ago

(In reply to Manish Goregaokar [:manishearth] from comment #15)

However, failing this, I do think that we should at least be exempting Javanese, Balinese, Canadian Aboriginal Syllabics, and N'ko, all of which have a decent online presence.

Seems reasonable to me.

AFAICT, some characters in Javanese and Balinese could be confusables with each other in e.g. Noto if you see certain characters for the first time without having seen the fonts for both first. However, users can be assumed to have seen the font for their own language already. Furthermore, it seems that this is less of a problem than the situation that already exists on Windows in Nirmala UI between Tamil and Malayalam or in pretty much any font between Latin and Cyrillic.

Jonathan Kew [:jfkthame]

Comment 18

•

5 years ago

Agreed, I think it would be reasonable to allow these. (I'd be much more hesitant to simply switch the treatment of Limited_Use without individually reviewing each of the things it includes.)

Flags: needinfo?(jfkthame)

Manish Goregaokar [:manishearth]

Reporter

Comment 19

•

5 years ago

AFAICT, some characters in Javanese and Balinese could be confusables with each other in e.g. Noto if you see certain characters for the first time without having seen the fonts for both first.

I believe we also have a mixed script heuristic so we're probably fine unless there's a Balinese domain name that completely clashes with Javanese.

I might write a patch for this. I think it's fair to be cautious about doing it wholesale.

Manish Goregaokar [:manishearth]

Reporter

Comment 20

•

5 years ago

Ugh, annoyingly the patch is not that simple, because some code points can be multiple categories at once, and the categorization gives Limited_Use priority over the others, so to tailor it I need to derive the other properties. This is mostly a matter of weeding out non-XID and non-NFC code points.

Manish Goregaokar [:manishearth]

Reporter

Comment 21

•

5 years ago

Attached file Bug 1648889 - Part 1: Tailor IdentifierStatus to include Javanese, Balinese, N'Ko, and Syllabics; — Details

Phabricator Automation

Updated

•

5 years ago

Assignee: nobody → manishearth

Status: NEW → ASSIGNED

Manish Goregaokar [:manishearth]

Reporter

Comment 22

•

5 years ago

Attached file Bug 1648889 - Part 2: Update unicode property data; — Details

Manish Goregaokar [:manishearth]

Reporter

Comment 23

•

5 years ago

FWIW, the ranges that get switched to Allowed are:

07C0..07E7    ; Limited_Use                    # 5.0   [40] NKO DIGIT ZERO..NKO LETTER NYA WOLOSO
07EB..07F5    ; Limited_Use                    # 5.0   [11] NKO COMBINING SHORT HIGH TONE..NKO LOW TONE APOSTROPHE
07FD          ; Limited_Use                    # 11.0       NKO DANTAYALAN
1401..166C    ; Limited_Use                    # 3.0  [620] CANADIAN SYLLABICS E..CANADIAN SYLLABICS CARRIER TTSA
166F..1676    ; Limited_Use                    # 3.0    [8] CANADIAN SYLLABICS QAI..CANADIAN SYLLABICS NNGAA
1677..167F    ; Limited_Use                    # 5.2    [9] CANADIAN SYLLABICS WOODS-CREE THWEE..CANADIAN SYLLABICS BLACKFOOT W
18B0..18F5    ; Limited_Use                    # 5.2   [70] CANADIAN SYLLABICS OY..CANADIAN SYLLABICS CARRIER DENTAL S
1B00..1B4B    ; Limited_Use                    # 5.0   [76] BALINESE SIGN ULU RICEM..BALINESE LETTER ASYURA SASAK
1B50..1B59    ; Limited_Use                    # 5.0   [10] BALINESE DIGIT ZERO..BALINESE DIGIT NINE
1B6B..1B73    ; Limited_Use                    # 5.0    [9] BALINESE MUSICAL SYMBOL COMBINING TEGEH..BALINESE MUSICAL SYMBOL COMBINING GONG
A980..A9C0    ; Limited_Use                    # 5.2   [65] JAVANESE SIGN PANYANGGA..JAVANESE PANGKON
A9D0..A9D9    ; Limited_Use                    # 5.2   [10] JAVANESE DIGIT ZERO..JAVANESE DIGIT NINE

The only one of these that's a bit questionable are the Balinese musical symbols, and IMO that's because they should be categorized as Limited_Use+Technical.

Manish Goregaokar [:manishearth]

Reporter

Comment 24

•

5 years ago

Submitted some feedback on that

Anne (:annevk)

Comment 25

•

5 years ago

So, back to comment 6, I continue to think we do need some kind of specification around what we are doing here. This also warrants following https://wiki.mozilla.org/ExposureGuidelines to give people a heads up and ensure we're not missing anything. Mike Conca might be able to identify additional prerequisites.

Flags: needinfo?(valentin.gosu)

Flags: needinfo?(mconca)

Flags: needinfo?(annevk)

Henri Sivonen (:hsivonen)

Comment 26

•

5 years ago

(In reply to Manish Goregaokar [:manishearth] from comment #19)

AFAICT, some characters in Javanese and Balinese could be confusables with each other in e.g. Noto if you see certain characters for the first time without having seen the fonts for both first.

I believe we also have a mixed script heuristic so we're probably fine unless there's a Balinese domain name that completely clashes with Javanese.

For clarity, I meant the case where a DNS label in its entirety consists of confusables as with your роре.com example. I don't know the languages enough to say if sensible words can be constructed with confusables only. (Again, superficially, I believe this to be less of a problem between Javanese and Balinese than between Tamil and Malayalam or between Latin and Cyrillic.)

Mike Conca [:mconca]

Comment 27

•

5 years ago

(In reply to Anne (:annevk) from comment #25)

So, back to comment 6, I continue to think we do need some kind of specification around what we are doing here. This also warrants following https://wiki.mozilla.org/ExposureGuidelines to give people a heads up and ensure we're not missing anything. Mike Conca might be able to identify additional prerequisites.

As it seems most everyone is in favor of this change, we should definitely send out an Intent-to-Ship email immediately (Manish). Are there existing web-platform-tests around this feature? If so, we should make sure we add/extend those tests to cover this. And I agree with Anne, it would be best if we can point to a specification effort related to this. Finally, can we put this change behind a pref? That gives Mozilla an easy way to immediately revert this change in the field should anything unpleasant arise (increasing our risk tolerance for shipping).

Flags: needinfo?(mconca)

Manish Goregaokar [:manishearth]

Reporter

Comment 28

•

5 years ago

There are no web-platform tests since this cannot be tested and is not a web standard. Chrome has different heuristics here.

Given that Chrome's heuristics are different (and we have reasons for not choosing the exact same heuristics), at best what I can do is create a wiki page with the heuristics we use. There is a specification effort on the Unicode side that will likely move towards, see https://www.unicode.org/L2/L2020/20172.htm#164-A45.

The current implementation hardcodes this into the existing large byte array, I feel like adding a second array so that it can be pref-switched would be a major impact on binary bloat.

Jonathan Kew [:jfkthame]

Comment 29

•

5 years ago

(In reply to Manish Goregaokar [:manishearth] from comment #28)

The current implementation hardcodes this into the existing large byte array, I feel like adding a second array so that it can be pref-switched would be a major impact on binary bloat.

Yes, I'd agree that having a second version of the array just to support a pref here would be undesirable.

If we did see "something unpleasant" and wanted to revert the behavior in the field, presumably we could do that by pushing a change to the network.IDN.extra_blocked_chars pref. (That might not be a solution for users -- if any -- who use a custom setting of that pref, overriding what we ship. But I expect that's rare enough for us to ignore.)

Manish Goregaokar [:manishearth]

Reporter

Comment 30

•

5 years ago

Any updates? Are folks comfortable with shipping this?

Anne (:annevk)

Comment 31

•

5 years ago

I think so, modulo an intent to ship and wiki page documenting our behavior in some detail. I guess the other question is if there's someone available to deal with fallout, if any, but maybe pushing a change to the pref is trivial enough?

BugBot [:suhaib / :marco/ :calixte]

Comment 32

•

3 years ago

The bug assignee didn't login in Bugzilla in the last 7 months.
:dragana, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: manishearth → nobody

Status: ASSIGNED → NEW

Flags: needinfo?(dd.mozilla)

Dragana Damjanovic [:dragana]

Updated

•

3 years ago

Flags: needinfo?(dd.mozilla)

Randell Jesup [:jesup] (needinfo me)

Comment 33

•

2 years ago

Bump... this seems to have fallen through the cracks. Shall we move forward?

Flags: needinfo?(manishearth)

Flags: needinfo?(jfkthame)

BugBot [:suhaib / :marco/ :calixte]

Comment 34

•

2 years ago

Clear a needinfo that is pending on an inactive user.

Inactive users most likely will not respond; if the missing information is essential and cannot be collected another way, the bug maybe should be closed as INCOMPLETE.

For more information, please visit BugBot documentation.

Flags: needinfo?(manishearth)

Bug 1648889 - Part 1: Tailor IdentifierStatus to include Javanese, Balinese, N'Ko, and Syllabics; 5 years ago Manish Goregaokar [:manishearth] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1648889 - Part 2: Update unicode property data; 5 years ago Manish Goregaokar [:manishearth] 47 bytes, text/x-phabricator-request		Details \| Review