Implement new IDN Unicode display algorithm

RESOLVED FIXED in mozilla22

Status

()

RESOLVED FIXED
7 years ago
2 years ago

People

(Reporter: gerv, Assigned: smontagu)

Tracking

({qawanted})

unspecified
mozilla22
qawanted
Points:
---
Dependency tree / graph
Bug Flags:
sec-review +
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [sec-assigned:dveditz], URL)

Attachments

(5 attachments, 5 obsolete attachments)

This bug tracks the implementation of the new IDN Unicode display algorithm (which decides when we show Unicode and when we show Punycode).

https://wiki.mozilla.org/IDN_Display_Algorithm#Algorithm

I have attempted to specify what is to be done in sufficient detail on that wiki page, but whoever implements this should feel free to seek clarification.

Given the impending launch of a number of additional top-level domains, and the increased use of IDN domain names on the web, we should implement this with all speed.

Gerv
smontagu: are you able to put this bug into your or your group's planning processes?

Gerv
Jet: can you and Simon work on prioritizing this? I'd argue that extending the conditions under which we display IDN domains as Unicode is important for our acceptance in non-Latin locales.

Gerv
smontagu will be in the USA from 3/10 to 4/1. We'll add this to the IDN work list while he's here.
(Assignee)

Comment 4

7 years ago
In https://wiki.mozilla.org/IDN_Display_Algorithm#Algorithm, it's not clear to me whether displaying Unicode or Punycode is an all or nothing thing for all the labels in a given domain name (as it is currently with the whitelist), or whether its going to be possible for a single domain name to be displayed partly in Unicode and partly in Punycode according to whether each label passes or fails the checks in the algorithm.
Hi Simon. The answer is that the algorithm is applied on a per-label basis. Labels which pass are displayed as Unicode, labels which do not pass are displayed as Punycode.

Does that present problems, or will that work fine?

Feel free to update the document in the place you think this information is best clarified.

Gerv
(Assignee)

Comment 6

7 years ago
(In reply to Gervase Markham [:gerv] from comment #5)
> Hi Simon. The answer is that the algorithm is applied on a per-label basis.
> Labels which pass are displayed as Unicode, labels which do not pass are
> displayed as Punycode.
> 
> Does that present problems, or will that work fine?

I don't think it's a problem in principle, but it makes implementation a bit trickier, because the current code assumes in a number of places that domain names are either non-ASCII or Punycode, but never both.
Component: Internationalization → ImageLib
QA Contact: i18n → imagelib
(Assignee)

Updated

7 years ago
Component: ImageLib → Internationalization
QA Contact: imagelib → i18n
(Assignee)

Updated

7 years ago
Assignee: nobody → smontagu
(Assignee)

Comment 7

7 years ago
>Display as Punycode labels which use more than one numbering system (we would need >a list of numbering systems in Unicode) 

I'm not sure what this means. What is the definition of "numbering system" for the purposes of this restriction? There are Unicode classes for "Decimal_Number", "Letter_Number" and "Other_Number", and some Unicode characters are used as both letters and as part of numbering systems (think Roman numerals).

However, almost all Unicode characters in the Number classes are defined as belonging to particular scripts, except for ASCII digits which are Common, so I'm not sure what cases this is supposed to cover which are not already covered anyway.
(Assignee)

Comment 8

7 years ago
After writing the above I saw the text in http://www.unicode.org/reports/tr39/ which gives examples:

>Check for mixing numbers from different systems, such as U+0660 ( ٠ ) ARABIC-INDIC
> DIGIT ZERO with U+06F0 ( ۰ ) EXTENDED ARABIC-INDIC DIGIT ZERO, or U+09EA ( ৪ )
> BENGALI DIGIT FOUR with U+0038 ( 8 ) DIGIT EIGHT.

I'm still not happy with this formulation. These could be used for phishing without any mixing of number systems, e.g. http://super৪.com.

That particular example could be excluded by changing the last in the list of permitted combinations to "Common + Inherited + Latin + any single other script except Cyrillic, Greek, or Cherokee, excluding characters with the Decimal_Number property", but that would still leave open the reverse case (i.e. substituting ASCII EIGHT for BENGALI FOUR), and we can't just forbid ASCII digits with a non-Latin script since there are some scripts that don't have their own digits and use only ASCII digits.
smontagu: the "any single script + Latin" thing can be used for phishing in the way you describe if any character in that script looks like a Latin character - whether it's a number or not. We exclude Russian, Greek and Cherokee because they are the ones with the most such characters, leading to a combinatorial explosion of phishing possibilities, but the door is not slammed shut by any means. We are resigned to the fact that any heuristic of this sort will have edge cases, and we rely on the registries to help us deal with them.

I hope that any registry which allows BENGALI FOUR will have rules about making sure it can't be used to phish DIGIT EIGHT...

Your proposed fix basically disallows anyone's digits and Latin together, which seems like a too-strong restriction to me.

Gerv
(Assignee)

Comment 10

7 years ago
So is there a particular reason why you want to allow the "any single script + Latin" combination; i.e. to use UTR 36's "Moderately Restrictive" profile instead of "Highly Restrictive"? In terms of the whole domain name, "Highly Restrictive" would certainly be too restrictive, but if we're thinking in terms of individual labels, it seems to be not unreasonable.
Assignee: smontagu → ashuk
Component: Internationalization → Java APIs for DOM
QA Contact: i18n → dom-apis
(Assignee)

Updated

7 years ago
Assignee: ashuk → smontagu
Component: Java APIs for DOM → Internationalization
QA Contact: dom-apis → i18n
Simon: my understanding is that there are plenty of real-world uses where people combine their local language with ASCII (or even Latin) - in company names and brand names. This is particularly true because historically, the DNS has been effectively LDH-only. 

One example I seem to remember being given was something like http://lg-한국.com/ (lg-korea.com) - LG is "LG" everywhere, even in places where Latin is not used. Although that domain name doesn't resolve.

If we went to Highly Restrictive, people would still be able to combine their local language with Arabic numerals (0-9), right?

Gerv
(Assignee)

Comment 12

7 years ago
(In reply to Gervase Markham [:gerv] from comment #11)
> If we went to Highly Restrictive, people would still be able to combine
> their local language with Arabic numerals (0-9), right?

Yes. All of the ASCII range except for A-Z and a-z are defined as Common script.

> One example I seem to remember being given was something like
> http://lg-한국.com/ (lg-korea.com) - LG is "LG" everywhere, even in places
> where Latin is not used. Although that domain name doesn't resolve.

As it happens, this particular example would be fine with Highly Restrictive anyway, since it's Latin + Hangul which is a permitted combination. Conversely, some other variants on the same theme are forbidden anyway for other reasons, like http://lg-Ελλάς.com/ (Latin + Greek) or http://lg-مصر.com (mixed direction label). The question is whether in the remaining cases where it's not already forbidden or already permitted, such as http://lg-ไทย.com (Thai), people can live with separating the Latin script and local script into separate labels.
How much of what you are doing depends on this decision? Is it just a few lines of code, or does it affect the architecture?

I could certainly go through the pros and cons of it with the IDN special interest group, but I don't want to hold you up.

Gerv
(Assignee)

Comment 14

7 years ago
It's just a few lines of code. It also wouldn't be particularly hard to include both code paths and enable Highly Restrictive or Moderately Restrictive conditionally (per domain, or by pref, or something -- I'm not saying that is necessarily something we want to do, but it's not a problem if we do).
Simon: if it seems easier, make it a (global) pref. We can then have more time to work outwhat the correct default value should be. (I'm not seeing much of a risk that there will be a mass revolt of websites telling people to change it.)

Gerv
(Assignee)

Comment 16

7 years ago
Posted patch w-i-p patch (obsolete) — Splinter Review
There are tryserver builds with this patch at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/smontagu@mozilla.com-b7a042eb454e/ if anybody wants to experiment. There are a whole lot of unexpected fails and passes in the tryserver results, some of which may be because the expected results of the tests need to be updated to new behaviour -- I haven't had time yet to analyze the results.

The patch implements the script restrictions from https://wiki.mozilla.org/IDN_Display_Algorithm#Algorithm. By default it uses the Highly Restrictive profile, but can be changed to use Moderately Restrictive by setting network.IDN.highly_restrictive_profile to false in about:config.

It also implements "Display as Punycode labels which have sequences of the same nonspacing mark".

It doesn't implement "Display as Punycode labels which use more than one numbering system" for reasons given above: I'm not sure that it adds much, and it's not currently sufficiently clearly defined to implement.

It also doesn't implement "Display as Punycode labels which contain both simplified-only and traditional-only Chinese characters". This will require adding a new API to nsUnicodeProperties or somewhere, but it should be straightforward enough to do that.
A few thoughts:

- I think the preference should be "network.IDN.restriction_profile", and the value "high".

> MOZ_SCRIPT_LATIN_JPAN_COMBO 

Any reason this lost the A?

A suggestion: would the legalScriptCombo() function be better implemented as a table-driven FSM, returning the new state, or 0 if the state would be disallowed?

RelevantScripts = [MOZ_SCRIPT_LATIN, MOZ_SCRIPT_HAN, ..., MOZ_SCRIPT_LATIN_CHINESE_COMBO, ...]
lastScriptIdx = findIdx(lastScript)
thisScriptIdx = findIdx(thisScript)

/* thisScript: LATIN  HAN  ... */
table = [
/* lastScript */
/* LATIN */   [LATIN, HAN, ... ],
/* HAN */     [HAN,   HAN, ... ],
...
];

if (lastScriptIdx && thisScriptIdx) {
  return table[thisScriptIdx][lastScriptIdx];
}

That kind of thing? Just a thought...

Gerv
Simon: it appears that there is an updated copy of UTR 39, which now contains the detection algorithm we are using (section 5). Mark Davis asks if it answers our questions, as outlined on the wiki page. I note that it seems to increase the complexity of the algorithm, as a character can now have multiple scripts.
http://www.unicode.org/reports/tr39/tr39-5.html

I would guess that the data which tells us about the multiple-scripts of a character is something new, and not in our platform. Do we plan to ship that? How much bloat would that be? This seems like it would require an algorithm rewrite - is that correct?

Gerv
(Assignee)

Comment 19

7 years ago
The multiple-scripts data is in http://unicode.org/Public/6.1.0/ucd/ScriptExtensions.txt, and is indeed new in Unicode 6.1. The bloat isn't huge: there are only 372 codepoints defined in the file, with between 2 and 6 scripts each. I suggest we open a separate bug to add this data to UnicodePropertyData.cpp (it may well be useful in the font selection code that already uses script codes too) and add support for it in IDN display as a follow-up.

www.unicode.org/reports/tr39/tr39-5.html#Restriction_Level_Detection has another very significant change: it introduces a distinction between "Recommended", "Aspirational" and "Limited" scripts (defined in http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts and following), and while characters from Limited scripts are not permitted in the Moderately Restrictive profile, characters from Aspirational scripts are permitted. This distinction seems rather arbitrary to me.
Simon: your first para sounds like a plan.

How hard would it be to implement the stuff in your second para? I think there is value in sticking to their algorithm, for consistency.

Gerv
(Assignee)

Updated

7 years ago
Depends on: 738101
(Assignee)

Comment 21

7 years ago
Most of the data we need for the second para is in http://www.unicode.org/Public/security/beta/xidmodifications.txt, and that should be a whole lot easier to parse than ScriptExtensions.txt

There doesn't seem to be any data file that includes the list of Aspirational scripts, so I guess we will have to hard-code that for now (it's only five scripts currently).

Opened bug 737931 on Script_Extensions and bug 738101 on Identitifier Modifications.
(Assignee)

Comment 22

7 years ago
Posted patch Patch part 1 (obsolete) — Splinter Review
This includes:

A pref network.IDN.restriction_profile that can be set to "high" or "moderate", and also to "ASCII" to turn off this feature completely and restore the status quo.

Testing for character not in the Identifier Profile as defined in http://www.unicode.org/Public/security/latest/xidmodifications.txt. N.B. This is NOT http://www.unicode.org/Public/security/beta/xidmodifications.txt as referred to in earlier comments. I don't think the file is yet mature enough for use. I submitted some comments on it myself and I heard from the Unicode folks that they will be posting a corrected version soon. I'll update to the newer version in a later iteration.

Checking for mixed scripts as defined in http://www.unicode.org/reports/tr39/tr39-5.html#Mixed_Script_Detection, except for the "Aspirational Scripts" part which only makes sense after updating to the newer version of xidmodifications.txt

Checking for mixed numbering systems as defined in http://www.unicode.org/reports/tr39/tr39-5.html#Mixed_Number

Checking for repeated sequences of the same non-spacing mark.

Checking for both simplified-only and traditional-only Chinese characters in the same label.
(Assignee)

Comment 23

7 years ago
Posted patch Patch part 2 (obsolete) — Splinter Review
I separated out the change to use an FSM in isLegalScriptCombo into a separate patch because it seems a bit clumsy and I'm not sure that I totally get it.
(Assignee)

Comment 24

7 years ago
Posted patch Changes to existing tests (obsolete) — Splinter Review
Some existing tests depend on non-whitespaced TLDs not displaying IDNs as Unicode. This patch sets network.IDN.restriction_profile to "ASCII" for those tests in order to retain that behaviour.
(Assignee)

Comment 25

7 years ago
Posted patch Tests for this bug (obsolete) — Splinter Review
(Assignee)

Comment 26

7 years ago
I'm not sure who should review this. Does it also need Security Review?
I'm not sure this comes to the level where it needs security review (there is no risk of computer compromise, and domain obfuscation is only one component of data loss via phishing) but we could ask curtisk.

I don't know about the most appropriate normal reviewer.

Gerv
I am going to mark this sec-review-needed & privacy-review-needed so we can triage this and make a definitive decision as a group as to whether or not either of these activities are warranted.
Keywords: privacy-review-needed, sec-review-needed
assigning to dveditz and removing privacy flag per his recommendation
Keywords: privacy-review-needed
Whiteboard: [secr:dveditz]
Whiteboard: [secr:dveditz] → [sec-assigned:dveditz]
dveditz: ping? Do you know when you might get to security review for this?

Gerv
Flags: sec-review?(dveditz)
(Pinged dveditz and curtisk by email today.)

Gerv
And again today.

Gerv
Depends on: 843689
Blocks: 843689
No longer depends on: 843689
Depends on: 843739
No longer blocks: 843689
smontagu: can you please update your patch to:

- fix any bitrot
- add a bool pref, network.IDN.use_whitelist, defaulting to "true", which bypasses the 
  whitelist when set to false
- include the removal of .com, .net and .name from the whitelist in all.js (bug 843739)

? I would like to get this checked in in the next few days and then after it's baked for a short time we can consider an uplift to Aurora.

Gerv
Also, can you confirm that we do restrict the scripts permitted to those on the "Recommended" and "Aspirational" lists? I think you said in the security review that we did, but I just wanted to be sure.

If we do, we can remove special processing for Cherokee, because it is in neither list.

Gerv
Also, it looks like there might be some useful test data attached to bug 316727, which I just discovered. It seems like an earlier attempt at doing this.

Gerv
(Assignee)

Comment 36

6 years ago
I just pushed a try build with the changes suggested in comment 33: https://tbpl.mozilla.org/?tree=Try&rev=dd961ea5f775

I don't have time to look at comments 34 and 35 today, will come back to them.
(Assignee)

Comment 37

6 years ago
The try build mentioned in comment 36 had errors. A later successful try build is at https://tbpl.mozilla.org/?tree=Try&rev=59a29fb39801, and builds with this changeset can be downloaded from ftp://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/smontagu@mozilla.com-59a29fb39801/ 

(In reply to Gervase Markham [:gerv] from comment #34)
> Also, can you confirm that we do restrict the scripts permitted to those on
> the "Recommended" and "Aspirational" lists? I think you said in the security
> review that we did, but I just wanted to be sure.

More accurately, we restrict the codepoints permitted to those with "allowed" in http://www.unicode.org/Public/security/latest/xidmodifications.txt. In the current version of Unicode, this only includes "Recommended" scripts. I'll add special-casing for "Aspirational" scripts, which seems to be a change in UTR39 since I originally wrote the patch.
 
> If we do, we can remove special processing for Cherokee, because it is in
> neither list.

As of now There is no special processing for Cherokee.
(Assignee)

Comment 38

6 years ago
Posted patch Updated patchSplinter Review
Attachment #606806 - Attachment is obsolete: true
Attachment #610298 - Attachment is obsolete: true
Attachment #610299 - Attachment is obsolete: true
Attachment #718598 - Flags: review?(honzab.moz)
(Assignee)

Comment 39

6 years ago
Attachment #718601 - Flags: review?(honzab.moz)
(Assignee)

Comment 40

6 years ago
Attachment #610301 - Attachment is obsolete: true
Attachment #718602 - Flags: review?(honzab.moz)
(Assignee)

Comment 41

6 years ago
Attachment #610302 - Attachment is obsolete: true
Attachment #718605 - Flags: review?(honzab.moz)
Comment on attachment 718601 [details] [diff] [review]
Add pref to enable/disable the IDN whitelist

Review of attachment 718601 [details] [diff] [review]:
-----------------------------------------------------------------

r=honzab
Attachment #718601 - Flags: review?(honzab.moz) → review+
Could I get a little context for the new 'convertAllLabels' argument?
(Assignee)

Comment 44

6 years ago
See comment 4 through comment 6. Before this patch, any IRI was either all displayed in Unicode or all displayed in punycode, depending on whether the TLD was in the white list and the IRI passed stringprep validation. With the patch, we check each label individually, and display in Unicode only those labels that pass the new display algorithm.
Comment on attachment 718598 [details] [diff] [review]
Updated patch

Review of attachment 718598 [details] [diff] [review]:
-----------------------------------------------------------------

OK, took a while since I didn't know this code that well.  Also, I may not be the right to review this code.

Technically the code seems to be correct, and as far as I understand and know the algorithm, it does what it has to.

The nsIDNService::isLabelSafe is a piece of work, really!

r=honzab

Comments are mostly nits.

::: netwerk/dns/nsIDNService.cpp
@@ +114,5 @@
> +    if (NS_FAILED(prefBranch->GetCharPref(NS_NET_PREF_IDNRESTRICTION,
> +                                          getter_Copies(profile)))) {
> +      profile.Truncate();
> +    }
> +    if (profile.Equals("moderate")) {

using NS_LITERAL_CSTRING is a bit faster (no rt strlen calculation).

@@ +354,5 @@
>      } else {
>        *_isASCII = true;
>      }
>    } else {
> +    // First convert any ACE labels to UTF8

Can we add more detailed comment why?

@@ +359,5 @@
> +    nsAutoCString temp;
> +    if (isACE) {
> +      ACEtoUTF8(input, temp, false, true);
> +    } else {
> +      temp = input;

is this sharing buffers?

@@ +735,5 @@
> +  if (mRestrictionProfile == eASCIIOnlyProfile) {
> +    return false;
> +  }
> +
> +  nsAString::const_iterator start, end;

Rename start to caret or current or so, it's confusing in the code bellow.

@@ +744,5 @@
> +  uint32_t previousChar = 0;
> +  uint32_t savedNumberingSystem = 0;
> +  HanVariantType savedHanVariant = HVT_NotHan;
> +
> +  mSavedScriptIndex = -1;

According the implementation you may just pass ref to some local var to illegalScriptCombo method, instead of holding this as a member.

@@ +763,5 @@
> +          (script == MOZ_SCRIPT_CANADIAN_ABORIGINAL ||
> +           script == MOZ_SCRIPT_MIAO ||
> +           script == MOZ_SCRIPT_MONGOLIAN ||
> +           script == MOZ_SCRIPT_TIFINAGH ||
> +           script == MOZ_SCRIPT_YI))) {

Should have prefs instead of hard coding?

::: netwerk/dns/nsIDNService.h
@@ +45,5 @@
>                        bool allowUnassigned);
>    nsresult decodeACE(const nsACString& in, nsACString& out,
> +                     bool allowUnassigned, bool convertAllLabels);
> +  nsresult SelectiveUTF8toACE(const nsACString& in, nsACString& out);
> +  nsresult SelectiveACEtoUTF8(const nsACString&in, nsACString& out);

nsACString&in
can you keep the arg names in sync with Convert* methods?  Makes the code more readable.

@@ +52,2 @@
>    nsresult ACEtoUTF8(const nsACString& in, nsACString& out,
> +                     bool allowUnassigned, bool convertAllLabels);

I've spend the longest time by figuring out where all convertAllLabels arg flows with what values.  Docs needed!

For me it means:
* false only when a domain we want to display is not on whitelist and is non-ascii/punicoded (needs conversion) ; false value makes the conversion routines leave intact: 
- domain portions that *are unsafe* to convert when doing ACE->UTF8
- domain portions that *are safe* to convert when doing UTF8->ACE
* true otherwise, what means to convert all or nothing

fix me if I'm wrong

@@ +54,4 @@
>    bool isInWhitelist(const nsACString &host);
>    void prefsChanged(nsIPrefBranch *prefBranch, const PRUnichar *pref);
> +  bool isLabelSafe(const nsAString &label);
> +  bool illegalScriptCombo(int32_t script);

Definitely (and just these two) need more comments.  Can be added after landed, tho, since this is in rush.
Attachment #718598 - Flags: review?(honzab.moz) → review+
Comment on attachment 718602 [details] [diff] [review]
Changes to existing tests

Review of attachment 718602 [details] [diff] [review]:
-----------------------------------------------------------------

I've not checked what other test would need this, tho.
Attachment #718602 - Flags: review?(honzab.moz) → review+
Comment on attachment 718605 [details] [diff] [review]
Tests for this bug

Review of attachment 718605 [details] [diff] [review]:
-----------------------------------------------------------------

Good work.

r=honzab

::: netwerk/test/unit/test_bug722299.js
@@ +1,3 @@
> +// Test algorithm for unicode display of IDNA URL (bug 722299)
> +const testcases = [
> +    //  Original             Punycode or         Expected result by profile

Expected *UTF8* by profile ?

@@ +4,5 @@
> +    //    URL              normalized form      ASCII-Only, High, Moderate
> +    //
> +    // Latin script
> +    ["cuillère", "xn--cuillre-6xa",                  false, true,  true],
> +    // repeated non-spacing marks

I think adding a blank line like bellow can make this even more readable.

// comment
["idn", "puny",      results],

// comment
["idn", "puny",      results],

// comment
["idn", "puny",      results],

@@ +37,5 @@
> +    ["beethovenドイツ語ではルートヴィヒ",
> +                 "xn--beethoven-8d4h2cy4aka26b5bl3npswf7fo191r",
> +                                                     false, true,  true],
> +    // Han with common
> +    ["中国〆",   "xn--v6j902gn7f",                   false, true,  true],

What is the common character here?
Could we have the both common first/common last char variants? (actually for all tests to check the table jumping is correct as much as possible)

@@ +72,5 @@
> +    // Hangul and katakana
> +    ["한글ハングル",
> +                 "xn--qck1c2d4a9266lkmzb",           false, false, false]
> +];
> +

Can we have a test for UCS4 surrogates (preferably also broken 1-part only at the end) ?
Attachment #718605 - Flags: review?(honzab.moz) → review+
smontagu: can we get this checked in (with review fixes) today?

Gerv
Backed out for mochitest failures on B2G.
https://hg.mozilla.org/integration/mozilla-inbound/rev/72c49203f293

https://tbpl.mozilla.org/php/getParsedLog.php?id=20200673&tree=Mozilla-Inbound

15:57:48     INFO -  975 ERROR TEST-UNEXPECTED-PASS | /tests/dom/tests/mochitest/dom-level0/test_setting_document.domain_idn.html | wrong sender - http://sub1.exämple.test should equal http://sub1.exämple.test
15:57:48     INFO -  976 ERROR TEST-UNEXPECTED-PASS | /tests/dom/tests/mochitest/dom-level0/test_setting_document.domain_idn.html | wrong sender - http://sub1.exämple.test should not equal http://sub1.xn--exmple-cua.test
15:57:48     INFO -  979 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/tests/mochitest/dom-level0/test_setting_document.domain_idn.html | wrong response for IDN - got idn-nowhitelist-response wrong-initial-domain(sub1.exämple.test), expected idn-nowhitelist-response
15:57:48     INFO -  981 ERROR TEST-UNEXPECTED-PASS | /tests/dom/tests/mochitest/dom-level0/test_setting_document.domain_idn.html | wrong sender - http://sub1.exämple.test should equal http://sub1.exämple.test
15:57:48     INFO -  982 ERROR TEST-UNEXPECTED-PASS | /tests/dom/tests/mochitest/dom-level0/test_setting_document.domain_idn.html | wrong sender - http://sub1.exämple.test should not equal http://sub1.xn--exmple-cua.test
15:57:48     INFO -  985 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/tests/mochitest/dom-level0/test_setting_document.domain_idn.html | wrong response for punycode - got punycode-nowhitelist-response wrong-initial-domain(sub1.exämple.test), expected punycode-nowhitelist-response
Looks like you're being bitten by bug 414090, specifically that various places origins/URLs are exposed in the DOM are affected by IDN-display decisions.  The comments in the test are reasonably extensive as to what was going on, what should go on, and why various things were tested as they were.  Find a few new (different) representative URLs to put in the tests, and you should be good to go again.  If anything's unclear, feel free to poke me with questions -- I did write the test, even if it was, um, five years ago.  :-)
(Assignee)

Comment 52

6 years ago
Attachment 718602 [details] [diff] includes a workaround for exactly that problem, which works on other platforms and even on B2G for the other affected tests. I don't see why it doesn't help with this particular test. :S
(Assignee)

Comment 53

6 years ago
Repushed with changes to the URLs as Waldo suggested. I think this is in any case a better long term solution than using the network.IDN.restriction_profile pref, which we probably don't want to keep around for ever.

https://hg.mozilla.org/integration/mozilla-inbound/rev/84b9049008e9
https://hg.mozilla.org/integration/mozilla-inbound/rev/d475e6d54520
https://hg.mozilla.org/integration/mozilla-inbound/rev/fd84967d0389
https://hg.mozilla.org/integration/mozilla-inbound/rev/eea81f144307

Comment 55

6 years ago
make.py[2]: Leaving directory 'c:\t1\hg\objdir-sm\mozilla\netwerk\cookie'
make.py[2]: Entering directory 'c:\t1\hg\objdir-sm\mozilla\netwerk\dns'
nsIDNService.cpp
c:\t1\hg\comm-central\mozilla\config\rules.mk:1055:0$ c:/t1/hg/objdir-sm/mozilla/_virtualenv/Scripts
/python.exe -O c:/t1/hg/comm-central/mozilla/build/cl.py cl -FonsIDNService.obj -c -D_HAS_EXCEPTIONS
=0 -I../../dist/stl_wrappers  -DMOZILLA_INTERNAL_API -D_IMPL_NS_COM -DEXPORT_XPT_API -DEXPORT_XPTC_A
PI -D_IMPL_NS_GFX -D_IMPL_NS_WIDGET -DIMPL_XREAPI -DIMPL_NS_NET -DIMPL_THEBES  -DMOZ_SUITE=1 -DNO_NS
PR_10_SUPPORT -DIMPL_NS_NET -Ic:/t1/hg/comm-central/mozilla/netwerk/dns/../base/src -I.  -Ic:/t1/hg/
comm-central/mozilla/netwerk/dns -I. -I../../dist/include  -Ic:/t1/hg/objdir-sm/mozilla/dist/include
/nspr -Ic:/t1/hg/objdir-sm/mozilla/dist/include/nss        -wd4099 -TP -nologo -W3 -Gy -Fdgenerated.
pdb -wd4251 -wd4345 -wd4351 -wd4482 -wd4800 -wd4819 -we4553 -GR-  -DNDEBUG -DTRIMMED -Zi -UDEBUG -DN
DEBUG -O1 -Oy  -MD            -FI ../../dist/include/mozilla-config.h -DMOZILLA_CLIENT  c:/t1/hg/com
m-central/mozilla/netwerk/dns/nsIDNService.cpp
nsIDNService.cpp
c:\t1\hg\objdir-sm\mozilla\dist\include\harfbuzz\hb-common.h(56) : error C2371: 'int8_t' : redefinit
ion; different basic types
        c:\t1\hg\objdir-sm\mozilla\dist\include\mozilla/MSStdInt.h(82) : see declaration of 'int8_t'

Comment 56

6 years ago
https://hg.mozilla.org/mozilla-central/rev/84b9049008e9#l2.11
    2.11  #include "nsUnicharUtils.h"
    2.12 +#include "nsUnicodeProperties.h"
    2.13 +#include "nsUnicodeScriptCodes.h"
    2.14 +#include "harfbuzz/hb.h"
I'm using VS2008.

Ms2ger says:
That someone's using harfbuzz includes outside the five dirs that define HB_DONT_DEFINE_STDINT

Comment 57

6 years ago
This allows my build to continue.
Attachment #720285 - Flags: review?(honzab.moz)
Attachment #720285 - Flags: review?(honzab.moz) → review+
Depends on: 847078
There was a request via email to have some QA eyes on this. I'm setting Matt Wobensmith as QA Contact since he is our security QA lead. I know he is at CanSecWest today but I think he'll be back in the office on Monday. I can help assist if this can't wait until then, pending guidance from Matt.
Keywords: qawanted, verifyme
QA Contact: mwobensmith
Flags: sec-review?(dveditz) → sec-review+
Depends on: 853226
Depends on: 853231
Depends on: 853237
(Assignee)

Updated

6 years ago
Depends on: 857481
(Assignee)

Updated

6 years ago
Depends on: 857490
(Assignee)

Updated

6 years ago
Depends on: 858417
Comment on attachment 718598 [details] [diff] [review]
Updated patch

[Approval Request Comment]

This uplift request is for all patches on this bug, the patch on bug 843739 (which removes .com, .net and .name from the whitelist) and the follow-up fixes from bug 857481 and bug 857490 which were the result of the QA pass.

(smontagu: please add to that list of bugs if I've missed any.)

If you need a rollup patch prepared, perhaps smontagu would be kind enough to do that.

Bug caused by (feature/regressing bug #): This is part regression fix, part feature. In bug 770877, we added .com, .net and .name to the IDN TLD whitelist. However, we subsequently found that, although Verisign's current policies are OK, there were some historical registrations in .com which were spoofy and these have not been revoked. To fix that without regressing legitimate .com domains, we finished off a patch to switch from a whitelist to an algorithm - see https://wiki.mozilla.org/IDN_Display_Algorithm - and married that with a patch to remove .com, .net and .name so those TLDs rely solely on the algorithm.

User impact if declined: possible spoofing security risk. Also, ESR will not have new algorithmic code, which will mean IDNs in all non-whitelisted TLDs will not work. The list of TLDs will soon be greatly expanded and IDN is starting to become widely used. Also, we want to deprecate and eventually remove the whitelist; if ESR is still relying on it, we can't do that totally for a much longer period than would otherwise be the case.

Testing completed (on m-c, etc.): Yes. Extensive testing has been done by the awesome Matt Wobensmith over in bug 854041, and a couple of fixes have been made as a result (see bug numbers above). There are a small number of corner cases outstanding but they are not ship-stoppers. The code is currently on Aurora.

Risk to taking this patch (and alternatives if risky): An alternative would be to take _only_ the patch to bug 843739. That would mean all .com, .net and .name IDNs would all go back to "not displaying correctly" for an entire ESR cycle, and IDN would not display correctly on ESR in any of the non-whitelisted TLDs or new TLDs.

String or IDL/UUID changes made by this patch: None.

Please let me know if there are further questions.

Gerv
Attachment #718598 - Flags: approval-mozilla-beta?
Comment on attachment 718598 [details] [diff] [review]
Updated patch

Discussed/Clarified offline with Gerv on this.Given the risk of uplifting such a huge change in second half of the cycle, we feel this should ride the trains instead . In addition the user impact/risk around ESR is not applicable here given our next ESR is ESR24.
Attachment #718598 - Flags: approval-mozilla-beta? → approval-mozilla-beta-

Comment 63

6 years ago
What on earth does "ride the trains" mean?
3ric: Firefox development is organized into "release trains", which pass from nightly through Aurora and Beta to Release. "Riding the trains" means letting the patch follow the normal path rather than forward-porting it.

The code here is due for release tomorrow in Firefox 22, after which I hope you will find that the problematic URLs you have been discussing no longer show as Unicode in released Firefox.

Gerv
mass remove verifyme requests greater than 4 months old
Keywords: verifyme

Comment 66

5 years ago
FWIW, I filed https://code.google.com/p/chromium/issues/detail?id=336973 about implementing this algorithm in Chrome.  Patches welcome in case any Mozilla folks want to write a Chrome patch and/or contribute to more cross-browser behavioral alignment :)
You need to log in before you can comment on or make changes to this bug.