Closed Bug 854041 Opened 11 years ago Closed 11 years ago

IDN: Summary of broken domains due to new algorithm

Categories

(Core :: Networking, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: mwobensmith, Unassigned)

References

Details

Attachments

(9 files)

Using the Verisign IDN list from here:

https://bugzilla.mozilla.org/show_bug.cgi?id=840036#c24

I created some JS that parses it, canonicalizes each one (via the a.href property) and then - using a punyscript JS library - converts all rejected IDNs into Unicode for clearer inspection. 

Of almost 900,000 domains, we're rejecting around 20,000. See attachment for this list.

Some are obvious, others may be not. We should review this list for breakage where we might be in error.
This is awesome work. 20,000 is a long list to review; is there any way we can write code to classify them according to the reason for rejection? I suspect that might require basically reimplementing our current algorithm in JS, with associated need for big blocks of Unicode data...

I will scan down it looking for things which stick out, and I hope Simon will be able to as well.

Gerv
I created this and the following attachments with a customized firefox build which writes to stdout the label and the reason why it gets displayed as punycode. Caveat: this only displays the first non-valid character; there may be others later in the label that aren't flagged.

The most common issue (10766 labels) is mixed Simplified and Traditional Chinese characters. I'm not competent to assess these.
9874 labels have characters that aren't "recommended" in xidmodifications.txt. About half of these are U+30FB KATAKANA MIDDLE DOT. Some of the others are obvious attempts at spoofing, e.g. bloɡspot.com with U+0261 LATIN SMALL LETTER SCRIPT G, and then there's all sorts of freaky stuff like ʎʞɐǝɹɟ.com and other things that I don't know what to make of.
Attached file IDN with mixed scripts
The other categories are much smaller than the first two. 154 labels have mixed scripts. These mostly look like attempts at spoofing (lots of variants of amazon and google etc.) or evading filters. A few look as if they might be (or have been) serious sites, like http://πr2.com, which doesn't actually exist.
29 labels display as punycode because of characters that weren't defined in the version of Unicode that IDNA2003 used. This is bug 853226
100 labels don't conform to the bidi restrictions in RFC 3454. Bug 733350 may be confusing the results here.
28 labels have sequences of repeated non-spacing marks. They all seem to be in Thai, which is a little worrying. I don't know if sequences of repeated non-spacing marks are normal and correct in Thai, but if they are, this criterion probably needs modification.
And finally, just one label has characters from different numbering systems: in this case http://٩۹.com with U+0669 ARABIC-INDIC DIGIT NINE and U+06F9 EXTENDED ARABIC-INDIC DIGIT NINE
CCing 2 members of the Thai l10n team. Patipat and Pittaya: can you please look at the domain names in attachment 728791 [details] and tell us if they are valid Thai words or constructions, or rubbish, or spoofing attempts, or something else?

I guess we need to fix bug 733350 before we can assess the Bidi rules category properly.

The Unicode-unassigned lot will be fixed when we upgrade to IDNA2008 (bug 479520).

Simon: how does our unconditional character blacklist interact with these categories? I'd expect KATAKANA MIDDLE DOT to be on our blacklist because it might be used to spoof ".". But maybe it's not. We should see what other browsers do with this character.

I wonder if it's worth having this debugging code always present and pref-controlled?

Gerv
The Simplified/Traditional mixing ban is optional in TR39:
http://www.unicode.org/reports/tr39/

I will email Mark Davis to ask him how much of a risk he thinks it is to allow this.

Gerv
This test helps us to understand where we might be failing to display as Unicode when we should be doing so. And Matt's other testing - by generating domain names which should fail - has helped us test for bugs where we are displaying as Unicode when we shouldn't. 

So, looking at these results, it seems to me that this code is working pretty well. There are a couple of bugs that it would be good to fix, but they only affect a small number of domain names. And given that (with the exception of TLDs removed from the whitelist at the same time - i.e. .com, .net and .name) we are only expanding the set of sites displayed as Unicode, we can't break existing sites.

Given that, I think that we should uplift this code (and any patches for the outstanding issues) to Aurora. dveditz: what do you think?

Gerv
dveditz: not sure if you are still around - and aurora uplifts to beta on April 1st :-( Have we missed the boat here?

I emailed Mark Davis. On Simplified vs. Traditional, he said:

"The test can only be applied if the characters are meant to be chinese.​ So "写真だけの結婚式​" is Japanese, and shouldn't be tested."

I asked if there was a programmatic way of telling, with only access to the label. He said:

"Not unless the domain is restricted to only allow Chinese names. I think, for example, that .CN doesn't allow arbitrary CJK characters, just the ones in Chinese."

He also said:

"B. The test for S vs T needs to be not whether the character has a T or S variant​, but whether the character is an S or T variant. In any event, we need to be much clearer in that section exactly how to use Unihan."

-- So I think we should remove the S vs T test for now, until it's more clear exactly what, if anything, we can do.

On KATAKANA MIDDLE DOT, he said:

"This appears to be a production problem. The list marked as:

xxxx         ; allowed ; inclusion

should be those characters that are in http://www.unicode.org/reports/tr39/#Identifier_Modification_Key under 'inclusion'. Those characters should match the characters in http://www.unicode.org/reports/tr31/#Table_Candidate_Characters_for_Inclusion_in_Identifiers. However, they do not reflect them. The missing characters are:

U+0027 ( ' ) APOSTROPHE
U+003A ( : ) COLON

U+058A ( ֊ ) ARMENIAN HYPHEN
U+2010 ( ‐ ) HYPHEN
U+2027 ( ‧ ) HYPHENATION POINT
U+30A0 ( ゠ ) KATAKANA-HIRAGANA DOUBLE HYPHEN
U+30FB ( ・ ) KATAKANA MIDDLE DOT

Of course, for the purpose of IDNA, the ASCII characters are determined by the base spec, but the others allowed. (They are subject to confusability tests, also, but that's a different story.)

So I'll put in a document to the UTC for correcting the characters. My recommendation would be to allow at least the KATAKANA MIDDLE DOT​ in your code in the meantime, documenting the situation.​"
​
-- Of the characters Mark lists, only one other has any entries in the 9874, and that is U+2010 ( ‐ ) HYPHEN, which has one entry, because someone has registered "‐.com". (I suspect some of the remaining ones in Mark's list are on our character blacklist anyway...) And we certainly shouldn't allow U+2010 because it spoofs a protocol character. So I suggest we permit KATAKANA MIDDLE DOT for the moment.

Gerv
Depends on: 857481
Depends on: 857490
(In reply to Gervase Markham [:gerv] from comment #12)
> -- So I think we should remove the S vs T test for now, until it's more
> clear exactly what, if anything, we can do.

Bug 857481

> So I suggest we permit KATAKANA MIDDLE DOT for the moment.

Bug 857490
Depends on: 858417
We're down to ~5k rejections now. Obviously we've eliminated the mixed S/T Chinese plus the Katakana middle dot issues. See output.
CCing three more members of the Thai l10n team. Can you please look at the domain names in attachment 728791 [details] and tell us if they are valid Thai words or constructions, or rubbish, or spoofing attempts, or something else?

Thanks,

Gerv
Comment on attachment 728791 [details]
IDNs with repeated non-spacing marks

ुद्ो््ब	U+094D ् is not a thai word. Any details you want me to tell?
Hi, Gervase. All domain in the list  at https://bug854041.bugzilla.mozilla.org/attachment.cgi?id=728791 contains wrong character placing that every "First incompatible character" is the duplicate char of  previous one, example the first one:

การฟื้นฟูแบตเตอรี่่ : There're two U+0E48 char at the end.

And the second one:

ฝากรููป : There also have two U+0E39 at the end too.

The duplicated chars is always the vowel (upper, lower vowel that user can mistake type that repeatedly). Another that, all char in " ुद्ो््ब" is not a Thai character as Wichai said.
Siriwat: thank you, that is extremely helpful. So we will consider that none of the domains listed as blocked in attachment 728791 [details] are errors.

It seems to me like bug 733350, and upgrading to IDNA2008 (bug 479520) are the only outstanding issues, and neither is a blocker.

We also need a bug to track any changes that the Unicode Consortium makes to the algorithm or input data prompted by our feedback, e.g. on KATAKANA MIDDLE DOT.

Gerv
I face the problem with idn domain with firfox

วิธีลดความอ้วนเร่งด่วน.com (xn--42cg1bcqic4dtpuebe8fvd2at2t0fco.com)

Please suggestion what I have to do to useing this domain.

Atsadawat
Atsadawat, you are seeing bug 892370, which was just fixed today. Until the fix appears in a release you can use the workaround described there, i.e. set network.IDN_show_punycode to true in about:config
We are now using the new algorithm. smontagu: does this bug serve any further purpose?

Gerv
It was always kind of a research/meta bug, so there's nothing directly to fix here anyway. OTOH, do we want to close it while it still has open dependants?
I think that dependency is bogus.

Gerv
Status: NEW → RESOLVED
Closed: 11 years ago
No longer depends on: 858417
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: