Hebrew point characters are not allowed in IDN domains

RESOLVED WONTFIX

Status

()

Core
Internationalization
RESOLVED WONTFIX
3 years ago
2 years ago

People

(Reporter: Matt Giuca, Unassigned)

Tracking

Trunk
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

3 years ago
User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2421.0 Safari/537.36

Steps to reproduce:

OS: Linux (Ubuntu 14.04)

1. Navigate to "קוֹם.com" (or "xn--hdb9cza1b.com").

(Note: I have chosen "קוֹם" because it is pronounced "com" in Hebrew and is proposed as a gTLD; see http://icannwiki.com/.%D7%A7%D7%95%D6%B9%D7%9D.)


Actual results:

Address bar shows "xn--hdb9cza1b.com".


Expected results:

Address bar shows "קוֹם.com".

Note that the domain xn--hdb9cza1b.com *is* the real domain being used on the network (the IDN encoded domain), but Firefox generally shows the decoded (Unicode) domain label if it is a valid character sequence in some script.

"קוֹם" consists of 3 Hebrew letters and 1 Hebrew combining mark (U+05B9 HEBREW POINT HOLAM), yet Firefox seems to be rejecting the domain label on account of the combining mark. I have tried with other scripts and they seem to work fine with combining marks (eg. "www.සිකුරාදා.com" in Sinhala). The only characters I've seen erroneously omitted (on both Chromium and Firefox) are the Unicode characters starting with "HEBREW POINT" (U+0591 to U+05C7).

Cross-reporting from Chromium where we have the same bug; see http://crbug.com/496473. I'm not familiar with the Firefox code, but in Chromium, we use ICU and get the permitted characters for a given script by calling ulocdata_getExemplarSet with ULOCDATA_ES_STANDARD. That does *not* return the Hebrew point characters when called with the "he" locale.
(Reporter)

Comment 1

3 years ago
Please ignore the User Agent; I reported a Firefox bug using Chrome.

Updated

3 years ago
Component: Untriaged → Internationalization
OS: Unspecified → All
Product: Firefox → Core
Hardware: Unspecified → All
Version: 38 Branch → Trunk
(In reply to Matt Giuca from comment #0)
> Cross-reporting from Chromium where we have the same bug; see
> http://crbug.com/496473. I'm not familiar with the Firefox code, but in
> Chromium, we use ICU and get the permitted characters for a given script by
> calling ulocdata_getExemplarSet with ULOCDATA_ES_STANDARD. That does *not*
> return the Hebrew point characters when called with the "he" locale.

In Firefox we use the "recommended" characters from http://www.unicode.org/Public/security/latest/xidmodifications.txt (plus other considerations as outlined in https://wiki.mozilla.org/IDN_Display_Algorithm#Algorithm)

Currently the Hebrew point characters are listed as "restricted: limited-use" in xidmodifications.txt (with the odd exception of U+05B4 HEBREW POINT HIRIQ which I talked about in bug 858417 comment 5).

As you say, they don't appear in the CLDR Exemplar set for Hebrew either, though some of them do appear in the set for Yiddish (Apparently for CLDR those are the only languages that use Hebrew script!). I don't understand this, because they are an integral part of the writing system even if they are omitted in many contexts.

All that said, I think we should think very carefully about whether Hebrew point characters in URLs are a good idea, for a number of reasons: most people aren't used to typing them, and they can be very small, just one or two pixels in some cases, which I think will potentially lead to confusion. One could say that it's up to the registrars to bundle domains which differ only in point characters, but in that case who needs the pointed versions anyway?

If I can find a channel to do so, I will recommend to ICANN that it would be much better to add .קום without the Hebrew point and not .קוֹם
I would agree with Simon's analysis. We are doing what the Unicode Consortium recommends to avoid spoofing, and the spoofing potential of Hebrew pointing certainly seems fairly high to me.

"If I can find a channel to do so, I will recommend to ICANN that it would be much better to add .קום without the Hebrew point"

The trouble with that is the applicants apply for a particular string. I don't know if they are allowed to change string mid-application. I suspect this applicant may be about to get a nasty shock...

Gerv
We don't plan to change to allow Hebrew pointing, unless the Unicode Consortium changes xidmodifications.txt.

Gerv
Status: UNCONFIRMED → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WONTFIX
(Reporter)

Comment 5

2 years ago
Thanks for the investigation.

Also: it looks like the wiki I linked to in the report is either out of date, or wrong. According to the IANA database, VeriSign owns .xn--9dbq2a [1] (which is .קום without the point character), but .xn--hdb9cza1b is not registered (which is .קוֹם with the point character). I'm still a bit confused because a Google search for "xn--hdb9cza1b" turns up a bunch of results so it seems VeriSign applied for this domain at some point but perhaps they did manage to change.

So it looks like we don't have a problem displaying this particular gTLD.

[1] http://www.iana.org/domains/root/db/xn--9dbq2a.html
(In reply to Matt Giuca from comment #5)
> Thanks for the investigation.
> 
> Also: it looks like the wiki I linked to in the report is either out of
> date, or wrong. 

Yes. I emailed Verisign and they confirmed that they were able to change the requested string with ICANN to the version without the holem (or HOLAM, if you prefer).

So our conclusion stands.

Gerv
You need to log in before you can comment on or make changes to this bug.