Closed Bug 286534 Opened 19 years ago Closed 19 years ago

Implement IDN punycode display by .tld

Categories

(Core :: Networking, defect)

defect
Not set
major

Tracking

()

RESOLVED FIXED

People

(Reporter: gerv, Assigned: jshin1987)

References

Details

Attachments

(1 file, 2 obsolete files)

We need to replace the network.IDN_show_punycode pref with something more
fine-grained, based on the TLD of the domain. This allows us to enable/disable
correct display of IDNs in individual TLDs based on whether or not they have
good homograph-protection policies in place.

I suggest a pref branch, so:

network.IDN.disable.com
network.IDN.disable.net
...

Then, we can assess a particular URL with a single pref lookup. (The only way
this might one day run into trouble is if we get non-ASCII characters in TLDs
themselves. How likely is that?)

Gerv
Actually, it's quite likely in the long run. But then, you cound probably do
something like network.IDN.disable.xn--asdfghjk ... does this mean that pref
names are ASCII-only?
We have too many hostname based lookups. In theory, nsIPermissionManager could
do this as well, but those setting aren't this easy to update from xpi's
*** Bug 290275 has been marked as a duplicate of this bug. ***
The following was taken from bug 290275 (assuming this bug is not only about
blacklisting), 

All IDN urls are currently displayed as Punycode by default. Whitelist of safe
TLDs like Opera is needed to resolve this IDN problem more smartly.
http://www.opera.com/windows/changelogs/800b2/

I suggest new pref like this:
pref("network.IDN_TLDs_whitelist", "no, jp, de, se, kr, tw, cn, at, dk, ch, li");

If IDN url matches whitelist, the url should be displayed as Unicode whether
"network.IDN_show_punycode" pref is true or false.
Opera 8.0 beta 3 allows the user to switch from a whitelist to a blacklist:

http://www.opera.com/windows/changelogs/800b3/

How about using the same format in Mozilla? E.g.:

user_pref("network.IDN_TLD", ":no:jp:de:se:kr:tw:cn:at:dk:ch:li:museum:hu:");
user_pref("network.IDN_TLD", "~:com:");

This allows users of both Mozilla and Opera to copy and paste their preferences
from one product to the other.
> This allows users of both Mozilla and Opera to copy and paste their preferences
> from one product to the other.

And who apart from you is ever going to do that? :-)

This bug is currently being worked on by Neil and Lorenzo, with me providing
input. The current design calls for a pref tree (given that this sort of thing
is what they were invented for), allowing a zero-parsing, single-pref-lookup
solution:

network.idn.whitelist.com
network.idn.whitelist.org
...

This also allows users and extensions to easily set or unset a single TLD
without worrying about the others.

Gerv
> And who apart from you is ever going to do that? :-)

:-)

> network.idn.whitelist.com

Is this a boolean pref? (I assume network.idn.blacklist.* is not planned?)
I think that the idea of comment 5 is better than the idea of comment 6.
Because the idea is more simple and better for performance.
And the idea of comment 5 doesn't need to parse the value of pref too.
Erik: yes, they are boolean prefs. We could have a whitelist or a blacklist;
having both doesn't make much sense. I argued for a blacklist, but I think it's
more important that we do something which is consistent with the stance of the
other browser manufacturers.

Masayuki: both ideas require a single pref lookup; Mozilla has support for
things called "pref trees" which allow you to get a handle on e.g.
network.idn.whitelist and then look up relative prefs. The "all in one string"
idea requires extra parsing. I agree the performance difference probably isn't
big - the key requirements are ease of updating and maintenance.

If it's all one string, how do you suggest the Foundation publish updated
whitelists to people who have modified their whitelist? In the pref branch plan,
it's easy - the new prefs.js is shipped, and everything automatically works.

Gerv
Nominating for 1.8b3. We need this for Firefox 1.1.

Gerv
Blocks: 290872
Flags: blocking1.8b3?
Flags: blocking1.8b3?
Flags: blocking1.8b3+
Flags: blocking-aviary1.1+
Gerv:

Who fix this bug?
If nobody can create the patch, I try it.
Masayuki: thank you for your kind offer. Two guys are supposed to be working on
it; I'll give them a ping and make sure they assign this bug to themselves.

Gerv
For a possible implementation, see bug 279099 comment 230. I'm working on this,
but I probably won't have time to do much until next week. If someone else
thinks they can have a patch ready before then, feel free to step up and do it.
Status: NEW → ASSIGNED
> ------- Additional Comment #13 From Lorenzo Colitti  2005-04-19 16:21 PDT 
[reply] -------
>
>For a possible implementation, see bug 279099 comment 230. I'm working on this,
>but I probably won't have time to do much until next week. If someone else
>thinks they can have a patch ready before then, feel free to step up and do it.


Lorenzo,  did you get a chance to work on this...

jshin,   darin and I thought you would be a good person to dedicate some time to
 this bug so it doesn't fall thought the cracks.   can you help out with coming
up with a fix or talking a look at the proposals made by lorenzo or others?

thanks
Assignee: darin → jshin1987
Status: ASSIGNED → NEW
Ok. I'll try to drive this. Lorenzo, have you had a chance to work on this
since? I'll begin my work later today.
Attached patch 1st shot (obsolete) — Splinter Review
I introduced 'network.IDN.safe.TLD' (the name is tentative). For TLDs with
'network.IDN.safe.TLD set to true, we show IDNs in UTF-8 rather than in
punycode. If pref. is not defined for a TLD or set to false, we use punycode. 
These preference entries are only referred to when network.IDN_show_punycode'
is 'false'. If it's set to true, we always use punycode for all TLDs regardless
of the values of 'network.IDN.safe.TLD'. 

I'm not sure if I need	caching and/or adding observer(s). 

darin, what would you say?
Status: NEW → ASSIGNED
If we decide to support both whitelist and blacklist (not at the same time),
'network.IDN.show_punycode' (with an appropriate name change) can be used to
indicate which to use, blacklist or whitelist. We can do as follows:

if ((whitelist && not explicitly safe) || (blacklist && explicitly set unsafe))
  use punycode
else
  use UTF-8

Needless to say, we can only support blacklist if it's decided that it's better. 
any comment, good or bad? 
Jungshik, it's great that you're doing this work. I won't attempt to answer
all of your questions, but Gerv said that he would like to adopt the same
stance as other browser vendors, and Opera has chosen the whitelist approach
(by default), so Mozilla should too.

I don't think we can allow the user to switch to a blacklist since Gerv would
like it to be possible to update the whitelist from the mozilla.org side.
E.g. additional pref entries for whitelisted domains.

Gerv, is Mozilla going to adopt Opera's whitelist?
We're planning to adopt Opera's whitelist unless either someone comes to me with
a good reason why we should remove one of their entries, or I have time to do
some in-depth research myself. 

So yes, we do not want to allow switching between whitelist and blacklist. This
is a security feature, and should have no UI. 

I would prefer the pref stub to be "network.IDN.whitelist.". Other than that,
this patch looks pretty good in terms of function.

Gerv
Jungshik: sorry for not replying earlier, but I wasn't CC'd on this bug.

The patch looks like what I wanted to do, except it's better written. :) I have
a couple of questions though:

1. You use PromiseFlatCString(host).get() to convert the hostname to ASCII. Is
this safe? And if any Unicode TLDs are ever assigned, it will need to be changed.

2. Why not add another pref observer and put the code that loads the pref into
PrefsChanged()? If I understand the code correctly (but maybe I don't) that
would be simple and would give you the benefit of being able to react to pref
changes.
(In reply to comment #21)
> 1. You use PromiseFlatCString(host).get() to convert the hostname to ASCII. Is
> this safe? And if any Unicode TLDs are ever assigned, it will need to be changed.

That function does not convert to ASCII... it essentially just allows you to
call .get().
I think we should consider using nsIPermissionManager for hosts the user wants
to explicitly allow or explicitly block.  The preferences systems is a bad
choice for storing this information since it is affected by changes to the
default prefs.  

IMO, we should use a non-pref backend for storing the information we download. 
Perhaps we could store a file in the profile, and fetch it periodically (using a
conditional HTTP query).  We could look at extending the permission manager to
aggregate multiple data files, or we could layer something on top of the
permission manager.
> The preferences systems is a bad choice for storing this information since it is 
> affected by changes to the default prefs.  

IMO, that's what makes it a good choice. If the Turks and Caicos Islands come
out with appropriate policies on IDN, the next update we ship needs to turn on
IDN for that TLD for everyone who has not explicitly disabled it in the past.
Changing the default prefs achieves exactly that, doesn't it?

Gerv
(In reply to comment #23)
> I think we should consider using nsIPermissionManager for hosts the user wants
> to explicitly allow or explicitly block.  

'Per host' en/disabling is beyond the scope of this bug even if it's desirable. 

 
Jungshik: any chance of an updated patch with the tweak I requested? I'll then
review it and we can chase SR.

Gerv
I changed the name of the pref. branch to network.IDN.whitelist as suggested by
gerv.
Attachment #184849 - Flags: review?(gerv)
jshin:

Your patch is not valid file. Please check it.
Attached patch patch update Splinter Review
ooops. thanks for pointing that out. this is 'the' real thing.
Attachment #182326 - Attachment is obsolete: true
Attachment #184849 - Attachment is obsolete: true
Attachment #184858 - Flags: review?(gerv)
Attachment #184849 - Flags: review?(gerv)
Why is only de, jp and kr included in the patch ? What about no, se, tw, cn, at,
dk, ch, and li (rest of the Opera domains) ?
Comment on attachment 184858 [details] [diff] [review]
patch update 

r=gerv. 

I note that this patch doesn't include all the domains. That's fine - we can do
a separate patch adding (or removing) domains later. This is certainly not a
reason to hold this up.

Gerv
Attachment #184858 - Flags: superreview?(darin)
Attachment #184858 - Flags: review?(gerv)
Attachment #184858 - Flags: review+
Comment on attachment 184858 [details] [diff] [review]
patch update 

>Index: netwerk/base/src/nsStandardURL.cpp

>+
>+    nsCOMPtr<nsIPrefService> prefs( do_GetService(NS_PREFSERVICE_CONTRACTID) );
>+    if (prefs) {
>+        nsCOMPtr<nsIPrefBranch> branch;
>+        if (NS_SUCCEEDED(prefs->GetBranch( NS_NET_PREF_IDNWHITELIST,
>+                                           getter_AddRefs(branch) )))
>+            NS_ADDREF(gIdnWhitelistPrefBranch = branch);
>     }

hmm... can't you just QI prefBranch to nsIPrefService instead of calling
GetService again?


>     NS_IF_RELEASE(gIDN);
>     NS_IF_RELEASE(gCharsetMgr);
>+    NS_IF_RELEASE(gIdnWhitelistPrefBranch);

I think it'd be good to stick with "IDN" instead of using "Idn" here.


>+    if (gShowPunycode || !IsInWhitelist(PromiseFlatCString(host).get())) {

Can you make it so that IsInWhitelist does not need a null
terminated host string?  How about using the string API 
all the way down into IsInWhitelist?


>     static nsIIDNService               *gIDN;
>+    static nsIPrefBranch               *gIdnWhitelistPrefBranch;
>     static nsICharsetConverterManager  *gCharsetMgr;
>     static PRBool                       gInitialized;
>     static PRBool                       gEscapeUTF8;
>     static PRBool                       gAlwaysEncodeInUTF8;
>     static PRBool                       gShowPunycode;

nit: please list these variables in the same order in the cpp file
as they are listed here.


general comment: so, we anticipate using the update service to add or
remove TLDs periodically?  and just bail on trying to deal with the
problems associated with user-set preferences matching a changed
default pref value, thus resulting in the user-set value being cleared?
Attachment #184858 - Flags: superreview?(darin) → superreview+
(I appear to remember having this conversation before, but I may need
enlightening again.) Regarding Darin's general comment: have I misunderstood how
default prefs work? Surely user-set prefs override default prefs? That's why
they are called "default"?

Gerv
Comment on attachment 184858 [details] [diff] [review]
patch update 

asking for a.
I made sure this works well. We for sure want this feature in 1.1

Darin's comments were addressed in my tree. As for IsInWhitelist, I tried a few
alternatives and ended up doing this. At one point, I need a null-terminated
string because GetBoolPref needs 'char *'. Because |host| is usually less than
64 bytes, this seems to be a good way to avoid the memory allocation.

    if (gIDNWhitelistPrefBranch && 
	(pos = nsCAutoString(host).RFind(".")) != kNotFound &&
	NS_SUCCEEDED(gIDNWhitelistPrefBranch->
		     GetBoolPref(nsCAutoString(Substring(host, pos + 1)).get(),
				 &safe)))
	return safe;
Attachment #184858 - Flags: approval-aviary1.1a2?
Attachment #184858 - Flags: approval-aviary1.1a2? → approval-aviary1.1a2+
fix checked in. I added no, se, tw, cn, at,dk, ch, and li to the pref. file. 
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Jungshik, any reason we're still blocking hebrew and arabic tld(s)?
It's not about scripts/languages but about TLDs. As Gerv wrote in comment #31,
we have to deal with that in a separate 'bug'. Anyway, basically what we did is
the same as what Opera did: (
http://www.opera.com/windows/changelogs/800b2/)

# TLDs are considered safe if they have implemented anti-homographic character
policies or otherwise limited the available set of characters to prevent spoofing.
# Current whitelist contains the following top-level domains: no, jp, de, se,
kr, tw, cn, at, dk, ch, and li
> what we did is the same as what Opera did:
> (http://www.opera.com/windows/changelogs/800b2/)
> 
> Current whitelist contains the following top-level domains:
> no, jp, de, se, kr, tw, cn, at, dk, ch, and li

The cited document is incorrect. Starting several months ago, the whitelist
in the Opera distribution package has been:

IDNA White List=:no:jp:de:se:kr:tw:cn:at:dk:ch:li:museum:hu:




> > what we did is the same as what Opera did:
> > (http://www.opera.com/windows/changelogs/800b2/)

From the changelog in http://www.opera.com/windows/changelogs/800b3/ :

Improvement to IDNA  handling.
    * Script detection now allows more IDNs to be displayed in Unicode. Only a
single script and specific combinations are allowed in each label (separated by
"." or "-").
    * Added support for switching from IDN whitelist to blacklist by adding a
'~' as the first character in the string, that is, "~:com:tw:"
    * Added .hu and .museum to list of trusted domains.
Please don't make any changes to the list until I've had a chance to review the
current one. If you think a TLD has the appropriate policies in place and should
be added, please send me an email with URLs proving that this is the case.

Gerv
OK. I've reviewed the existing list, and also .museum and .hu. Here's the result:
http://www.mozilla.org/projects/security/tld-idn-policy-list.html

I've filed bug 299927 on getting those last two added to the list.

Gerv
You need to log in before you can comment on or make changes to this bug.