Closed Bug 854041 Opened 13 years ago Closed 12 years ago

IDN: Summary of broken domains due to new algorithm

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: mwobensmith, Unassigned)

References

Details

Attachments

(9 files)

List of rejected IDNs and corresponding Unicode conversions 13 years ago Matt Wobensmith [:mwobensmith][:matt:] 3.92 MB, text/html		Details
IDNs with mixed Simplified and Traditional Chinese characters 13 years ago Simon Montagu :smontagu 722.73 KB, text/html		Details
IDNs with non-recommended characters 13 years ago Simon Montagu :smontagu 672.46 KB, text/html		Details
IDN with mixed scripts 13 years ago Simon Montagu :smontagu 8.69 KB, text/html		Details
IDNs with characters unassigned in Unicode 3.2 13 years ago Simon Montagu :smontagu 2.13 KB, text/html		Details
IDNs that don't conform to Bidi rules 13 years ago Simon Montagu :smontagu 3.96 KB, text/html		Details
IDNs with repeated non-spacing marks 13 years ago Simon Montagu :smontagu 2.68 KB, text/html		Details
IDN with mixed numerals 13 years ago Simon Montagu :smontagu 316 bytes, text/html		Details
Updated list of rejected IDNs 13 years ago Matt Wobensmith [:mwobensmith][:matt:] 842.24 KB, text/html		Details

Matt Wobensmith [:mwobensmith][:matt:]

Reporter

Description

•

13 years ago

Attached file List of rejected IDNs and corresponding Unicode conversions — Details

Using the Verisign IDN list from here: https://bugzilla.mozilla.org/show_bug.cgi?id=840036#c24 I created some JS that parses it, canonicalizes each one (via the a.href property) and then - using a punyscript JS library - converts all rejected IDNs into Unicode for clearer inspection. Of almost 900,000 domains, we're rejecting around 20,000. See attachment for this list. Some are obvious, others may be not. We should review this list for breakage where we might be in error.

Gervase Markham [:gerv]

Comment 1

•

13 years ago

This is awesome work. 20,000 is a long list to review; is there any way we can write code to classify them according to the reason for rejection? I suspect that might require basically reimplementing our current algorithm in JS, with associated need for big blocks of Unicode data... I will scan down it looking for things which stick out, and I hope Simon will be able to as well. Gerv

Simon Montagu :smontagu

Comment 2

•

13 years ago

Attached file IDNs with mixed Simplified and Traditional Chinese characters — Details

I created this and the following attachments with a customized firefox build which writes to stdout the label and the reason why it gets displayed as punycode. Caveat: this only displays the first non-valid character; there may be others later in the label that aren't flagged. The most common issue (10766 labels) is mixed Simplified and Traditional Chinese characters. I'm not competent to assess these.

Simon Montagu :smontagu

Comment 3

•

13 years ago

Attached file IDNs with non-recommended characters — Details

9874 labels have characters that aren't "recommended" in xidmodifications.txt. About half of these are U+30FB KATAKANA MIDDLE DOT. Some of the others are obvious attempts at spoofing, e.g. bloɡspot.com with U+0261 LATIN SMALL LETTER SCRIPT G, and then there's all sorts of freaky stuff like ʎʞɐǝɹɟ.com and other things that I don't know what to make of.

Simon Montagu :smontagu

Comment 4

•

13 years ago

Attached file IDN with mixed scripts — Details

The other categories are much smaller than the first two. 154 labels have mixed scripts. These mostly look like attempts at spoofing (lots of variants of amazon and google etc.) or evading filters. A few look as if they might be (or have been) serious sites, like http://πr2.com, which doesn't actually exist.

Simon Montagu :smontagu

Comment 5

•

13 years ago

Attached file IDNs with characters unassigned in Unicode 3.2 — Details

29 labels display as punycode because of characters that weren't defined in the version of Unicode that IDNA2003 used. This is bug 853226

Simon Montagu :smontagu

Comment 6

•

13 years ago

Attached file IDNs that don't conform to Bidi rules — Details

100 labels don't conform to the bidi restrictions in RFC 3454. Bug 733350 may be confusing the results here.

Simon Montagu :smontagu

Comment 7

•

13 years ago

Attached file IDNs with repeated non-spacing marks — Details

28 labels have sequences of repeated non-spacing marks. They all seem to be in Thai, which is a little worrying. I don't know if sequences of repeated non-spacing marks are normal and correct in Thai, but if they are, this criterion probably needs modification.

Simon Montagu :smontagu

Comment 8

•

13 years ago

Attached file IDN with mixed numerals — Details

And finally, just one label has characters from different numbering systems: in this case http://٩۹.com with U+0669 ARABIC-INDIC DIGIT NINE and U+06F9 EXTENDED ARABIC-INDIC DIGIT NINE

Gervase Markham [:gerv]

Comment 9

•

13 years ago

CCing 2 members of the Thai l10n team. Patipat and Pittaya: can you please look at the domain names in attachment 728791 [details] and tell us if they are valid Thai words or constructions, or rubbish, or spoofing attempts, or something else? I guess we need to fix bug 733350 before we can assess the Bidi rules category properly. The Unicode-unassigned lot will be fixed when we upgrade to IDNA2008 (bug 479520). Simon: how does our unconditional character blacklist interact with these categories? I'd expect KATAKANA MIDDLE DOT to be on our blacklist because it might be used to spoof ".". But maybe it's not. We should see what other browsers do with this character. I wonder if it's worth having this debugging code always present and pref-controlled? Gerv

Gervase Markham [:gerv]

Comment 10

•

13 years ago

The Simplified/Traditional mixing ban is optional in TR39: http://www.unicode.org/reports/tr39/ I will email Mark Davis to ask him how much of a risk he thinks it is to allow this. Gerv

Gervase Markham [:gerv]

Comment 11

•

13 years ago

This test helps us to understand where we might be failing to display as Unicode when we should be doing so. And Matt's other testing - by generating domain names which should fail - has helped us test for bugs where we are displaying as Unicode when we shouldn't. So, looking at these results, it seems to me that this code is working pretty well. There are a couple of bugs that it would be good to fix, but they only affect a small number of domain names. And given that (with the exception of TLDs removed from the whitelist at the same time - i.e. .com, .net and .name) we are only expanding the set of sites displayed as Unicode, we can't break existing sites. Given that, I think that we should uplift this code (and any patches for the outstanding issues) to Aurora. dveditz: what do you think? Gerv

Gervase Markham [:gerv]

Comment 12

•

13 years ago

dveditz: not sure if you are still around - and aurora uplifts to beta on April 1st :-( Have we missed the boat here? I emailed Mark Davis. On Simplified vs. Traditional, he said: "The test can only be applied if the characters are meant to be chinese. So "写真だけの結婚式" is Japanese, and shouldn't be tested." I asked if there was a programmatic way of telling, with only access to the label. He said: "Not unless the domain is restricted to only allow Chinese names. I think, for example, that .CN doesn't allow arbitrary CJK characters, just the ones in Chinese." He also said: "B. The test for S vs T needs to be not whether the character has a T or S variant, but whether the character is an S or T variant. In any event, we need to be much clearer in that section exactly how to use Unihan." -- So I think we should remove the S vs T test for now, until it's more clear exactly what, if anything, we can do. On KATAKANA MIDDLE DOT, he said: "This appears to be a production problem. The list marked as: xxxx ; allowed ; inclusion should be those characters that are in http://www.unicode.org/reports/tr39/#Identifier_Modification_Key under 'inclusion'. Those characters should match the characters in http://www.unicode.org/reports/tr31/#Table_Candidate_Characters_for_Inclusion_in_Identifiers. However, they do not reflect them. The missing characters are: U+0027 ( ' ) APOSTROPHE U+003A ( : ) COLON U+058A ( ֊ ) ARMENIAN HYPHEN U+2010 ( ‐ ) HYPHEN U+2027 ( ‧ ) HYPHENATION POINT U+30A0 ( ゠ ) KATAKANA-HIRAGANA DOUBLE HYPHEN U+30FB ( ・ ) KATAKANA MIDDLE DOT Of course, for the purpose of IDNA, the ASCII characters are determined by the base spec, but the others allowed. (They are subject to confusability tests, also, but that's a different story.) So I'll put in a document to the UTC for correcting the characters. My recommendation would be to allow at least the KATAKANA MIDDLE DOT in your code in the meantime, documenting the situation." -- Of the characters Mark lists, only one other has any entries in the 9874, and that is U+2010 ( ‐ ) HYPHEN, which has one entry, because someone has registered "‐.com". (I suspect some of the remaining ones in Mark's list are on our character blacklist anyway...) And we certainly shouldn't allow U+2010 because it spoofs a protocol character. So I suggest we permit KATAKANA MIDDLE DOT for the moment. Gerv

Simon Montagu :smontagu

Updated

•

13 years ago

Depends on: 857481

Simon Montagu :smontagu

Updated

•

13 years ago

Depends on: 857490

Simon Montagu :smontagu

Comment 13

•

13 years ago

(In reply to Gervase Markham [:gerv] from comment #12) > -- So I think we should remove the S vs T test for now, until it's more > clear exactly what, if anything, we can do. Bug 857481 > So I suggest we permit KATAKANA MIDDLE DOT for the moment. Bug 857490

Simon Montagu :smontagu

Updated

•

13 years ago

Depends on: 858417

Matt Wobensmith [:mwobensmith][:matt:]

Reporter

Comment 14

•

13 years ago

Attached file Updated list of rejected IDNs — Details

We're down to ~5k rejections now. Obviously we've eliminated the mixed S/T Chinese plus the Katakana middle dot issues. See output.

Gervase Markham [:gerv]

Comment 15

•

13 years ago

CCing three more members of the Thai l10n team. Can you please look at the domain names in attachment 728791 [details] and tell us if they are valid Thai words or constructions, or rubbish, or spoofing attempts, or something else? Thanks, Gerv

Wichai Termwuttipreecha: Chengings

Comment 16

•

13 years ago

Comment on attachment 728791 [details] IDNs with repeated non-spacing marks ुद्ो््ब U+094D ् is not a thai word. Any details you want me to tell?

Siriwat Uamngamsup

Comment 17

•

13 years ago

Hi, Gervase. All domain in the list at https://bug854041.bugzilla.mozilla.org/attachment.cgi?id=728791 contains wrong character placing that every "First incompatible character" is the duplicate char of previous one, example the first one: การฟื้นฟูแบตเตอรี่่ : There're two U+0E48 char at the end. And the second one: ฝากรููป : There also have two U+0E39 at the end too. The duplicated chars is always the vowel (upper, lower vowel that user can mistake type that repeatedly). Another that, all char in " ुद्ो््ब" is not a Thai character as Wichai said.

Gervase Markham [:gerv]

Comment 18

•

13 years ago

Siriwat: thank you, that is extremely helpful. So we will consider that none of the domains listed as blocked in attachment 728791 [details] are errors. It seems to me like bug 733350, and upgrading to IDNA2008 (bug 479520) are the only outstanding issues, and neither is a blocker. We also need a bug to track any changes that the Unicode Consortium makes to the algorithm or input data prompted by our feedback, e.g. on KATAKANA MIDDLE DOT. Gerv

Atsadawat N.

Comment 19

•

12 years ago

I face the problem with idn domain with firfox วิธีลดความอ้วนเร่งด่วน.com (xn--42cg1bcqic4dtpuebe8fvd2at2t0fco.com) Please suggestion what I have to do to useing this domain. Atsadawat

Simon Montagu :smontagu

Comment 20

•

12 years ago

Atsadawat, you are seeing bug 892370, which was just fixed today. Until the fix appears in a release you can use the workaround described there, i.e. set network.IDN_show_punycode to true in about:config

Gervase Markham [:gerv]

Comment 21

•

12 years ago

We are now using the new algorithm. smontagu: does this bug serve any further purpose? Gerv

Simon Montagu :smontagu

Comment 22

•

12 years ago

It was always kind of a research/meta bug, so there's nothing directly to fix here anyway. OTOH, do we want to close it while it still has open dependants?

Gervase Markham [:gerv]

Comment 23

•

12 years ago

I think that dependency is bogus. Gerv

Status: NEW → RESOLVED

Closed: 12 years ago

No longer depends on: 858417

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.