Closed Bug 279099 (punycode) Opened 19 years ago Closed 16 years ago

Protect against homograph attacks (spoofing using punycode IDNs)

Categories

(Core :: Networking, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla1.8beta3

People

(Reporter: ericj, Assigned: gerv)

References

(Blocks 1 open bug, )

Details

(Whiteboard: [sg:spoof])

Attachments

(9 files, 3 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; rv:1.7.3) Gecko/20040913 Firefox/0.10
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; rv:1.7.3) Gecko/20040913 Firefox/0.10

firefox (and other unamed browsers) incorrectly handles punycode encoded domain
names.   This allows attackers (namely phishers) to spoof urls of just about any
domain name, including ssl certs. 

Proof of concept url:

http://www.shmoo.com/testing_punycode/

The links are directed at "http://www.pаypal.com/", which the punycode
handlers render as "www.xn--pypal-4ve.com"

The domain was just registered, so the root servers may not have gotten it yet.
 Point your dns servers at '216.254.21.212' if you have problems. 

Here's what I think the bug is:

1.  firefox (and mozilla) should warn the user if punycode is in use at all
2.  You should consider validating the ssl cert with the non-decoded version of
the website

Just in case it's not clear, an attack case could be an ebayer/phisher who
includes links to paypal in their auction. When the auction ends, the buyer
clicks on the paypal link (which is a punycode/proxy to the real paypal), and
proceeds to steal all of their private green bits. 

I have not done any platform testing, or tested any other versions of
mozilla/firefox/etc.  I assume this bug is cross-platform.

The proof of concept urls are hosted on a personal server, and as such, I'd like
to have a chance to bring them down before this bug becomes public.  Please
email me at ericj@shmoo.com before marking this bug public.  

/me goes and reads up on the mozilla bounty program. 

Reproducible: Always
This bug impacts many other browsers, and I'm working on notifying them
right now.

Based on the critical nature of this bug, I believe it's best to:

1.  not notify the public until all vendors have been notified & have a
chance to release updates
2.  set a fixed date on which this vulnerability will become public (so
no one company releases details before others have a chance to release updates).

That date will be 2/5/05, unless folks convince to delay this action.

Thanks,
Ericj
206.321.3411
Attached file more examples
from a spreadfirefox.com blog I found out this morning about
http://www.retrosynth.com/misc/phishing.html which plays with the same idea:
  www.xn--amazn-mye.com
  www.xn--micrsoft-qbh.com
  www.xn--papal-fze.com

These three were registered to Jesse C Lee (Witchita, KS) on Jan 8, 2005. The
retrosynth page was last updated (created?) Jan 16, 2005, presumably by the
site owner Cary Roberts in Mountain View, CA. What's the connection? What's the
connection between retrosynth and the spreadfirefox blogger? This may already
be widely known.
Darin: any ideas?
Assignee: firefox → darin
Status: UNCONFIRMED → NEW
Component: General → Networking
Ever confirmed: true
Product: Firefox → Core
QA Contact: general → benc
Whiteboard: [sg:fix]
Version: unspecified → Trunk
Opera has responded:

Date: Thu, 20 Jan 2005 18:06:30 +0100
From: bug-161715-s10@bugs.opera.com
To: ericj@shmoo.com
Subject: Your bug report


Hello Eric,

What you illustrate is an inherent problem with IDNA and the international
Unicode characterset. On many systems success may depend on which fonts and
languages the user have installed (and what is included in the default installation)

There was a discussion about a similar issue in our forums a couple of days ago:
<URL:
http://groups.google.com/groups?threadm=tmgou051aaovjqh2isd5shkcel8rp4j96q%404ax.com
>

Unfortunately, I do not believe your suggestion of warning the user about IDNA
encoded names in the name of secure servers is particable. It might look
that way when you are dealing with spoofsites such as your example, but it would
be maddening for Chinese and Japanese websurfers, in fact it would also
irritate many European (e.g. French, German and Scandinavian) surfers who are
using languages with characters that will generate punycode servernames.

The problem about spoofing websites using IDNA is IMO best solved by the
domainname registrars, by limiting on their side the character-combinations they
want to accept in a domainname. AFAIK such limitations are implemented in (e.g.)
the Norwegian zone, but Verisign has not yet implemented something
similar, which is understandable given the worldwide use of .com domains.

Please note that Wand or cookies will not be tricked by this kind of servernames.

--
Sincerely,
Yngve N. Pettersen

********************************************************************
Senior Developer                             Email: yngve@opera.com
Opera Software ASA                   http://www.opera.com/
Phone:  +47 24 16 42 60              Fax:    +47 24 16 40 01
********************************************************************
Component: Networking → Bookmarks
Product: Core → Firefox
Version: Trunk → 1.0 Branch
We should consider adding opera to the CC list on this bug:

bug-161715-s10@bugs.opera.com

Cheers, 
Eric
It turns out this attack was talked about several years ago; it was called the
'homograph attack'

http://www.cs.technion.ac.il/~gabr/papers/homograph.html

The problem today is that several browsers support this right out of the box. 
This introduces a huge security risk for users. 

Filtering at the registrar level is possible, but VERY hard.  They should not
allow mixed-byte or multi-language encodings, and should consider blacklisting
some of the chars from the punycode encode process. 

However, as a user of firefox, I see no method for me to disable punycode support. 

This is not just a browser bug - - it's a standards bug.  But early adoption
means that firefox & CO needs to deal with it at some level (even if it means
disabling puny support, or ssl + puny support). 

I don't know what the right answer is.  I'm just saying: "TODAY THIS IS A HUGE
PROBLEM FOR FIREFOX SECURITY". 
That opera address is not registered in bugzilla can't be CC'd, but we have
contacts at Opera and will work through those.
Summary: CRITICAL SECURITY VULN: punycode allows attackers to spoof urls/ssl certs → punycode allows attackers to spoof urls/ssl certs
After talking about this bug with a few other security folks, I have some ideas
I'd like to share. 

1.  Different validation of ssl certs.  Currently, the browser encodes the
unicode into punycode, loads the website, and validates that the puny encoded
domain matches the ssl cert.  I think this is a problem.  The browser should
validate the  cert name with the raw unicode text (you can generate ssl certs
with unicode CNs - - I tested this). 

2.  Filtering should happen at both the browser level and the registrar level. 
Example filtering should include:   

   A. not mixing double-byte & single byte punycode wrapped domain names.  This
makes it much harder to spoof domain names, as most other codepages don't have
standard latin in them.
   B. Validation of codepage.  Ensure that all chars in a domain are part of ONE
codepageset, not mixed. 
   C. Don't allow for bad unicode chars (see MS Press "Writing Secure Code, 2nd
Edition", page 379) such as non-shortest encoding of UTF8->punycode. 
   D. Block some 'non-alpha' chars in other code pages.  Examples are Unicode
05B4, which looks like the latin period '.'

IDN Filtering is a complex subject, and is highly prone to errors.  

3.   Display a country flag next to the addressbar/domain name.  Display icon or
something showing the current language the domain is in. 

4.  Must have feature:  Disable/enable IDN in all mozilla products. 

Anyway, I hope some of these ideas get some traction or result in some better
solution...

If I can assist in any way (testing, providing evil ssl certs, whatever) please
let me know. 

Cheers, 
Eric
>3.   Display a country flag next to the addressbar/domain name.  Display icon or
>something showing the current language the domain is in. 

hm? how would mozilla know the language?

> 4.  Must have feature:  Disable/enable IDN in all mozilla products. 

network.enableIDN, this already exists
Both omniweb & konqueror correctly discover that the ssl cert isn't valid for
that domain.  That means they are not checking the puny encoded version of the
domain with the CN, they are checking the UTF-8 version of the domain with the
CN of the cert.  

This is what I expect firefox to do.  I'll attach a screenshot shortly. 

Also note that they display the alternate script with a different font - - -
making it (more) clear that something phishy is going on. 
If the behavior you describe means that IDN sites simply can't use SSL, then,
sure, it would fix this bug, but that would be a pretty serious bug in itself. 
If it doesn't mean that they can't use SSL, then it doesn't help this bug at all.
The two attachments demostrates the 'other' behavior that browsers have when
validating CNs with IDN sites. 

Omniweb & Konqueror validate the UTF8 domain with the CN
firefox/mozilla, safari, any gecko-powered browser validate the puny encoded
domain with the CN

At this point, I'm not sure which one is correct; but there should be a correct
method for using ssl with IDN.  Purhaps this is because the existing RFCs don't
really talk about ssl + IDN.  
Flags: blocking-aviary1.0.1?
This bug really has two parts:
-- should we be expecting the domain names in SSL certs to be punycode-encoded,
or raw Unicode?
-- how do we deal with homograph attacks using punycode-encoded domain names?

The first question should be quite easy to resolve, and if necessary, fix. I've
filed it as bu 280839. Let's focus this bug on discussion of the second point,
which will be much harder to address.
OS: Windows 2000 → All
Summary: punycode allows attackers to spoof urls/ssl certs → Protect against homograph attacks (spoofing using punycode IDNs)
Alias: punycode
*** Bug 281381 has been marked as a duplicate of this bug. ***
Group: security
*** Bug 281428 has been marked as a duplicate of this bug. ***
See http://www.unicode.org/Public/4.0-Update/Scripts-4.0.0.txt for a Unicode
code-point to script mapping table. 

Now consider the following algorithm as a first hack:

We first divide the different Unicode script families into "potentially
confusable" equivalence sets: for example, LATIN, CYRILLIC and GREEK are
potentially confusable, as they each contain characters with lowercase glyphs
that look like 'c' or 'a'. However, LATIN and ARABIC do not contain any similar
characters, so they are not "potentially confusable". We put this information in
a (suitably compressed) look-up table. This now leads naturally to a simple
algorithm for spotting "stranger" characters in the context of another
"potentially confusable" script (ie different script, but same script
equivalence set). 

Note that there are still more things to look out for:
* we should canonicalise the string with NAMEPREP first, since we can't rely on
the registrar to do so
* font variant characters
* double-width and half-width characters
* expansion of ligatures, roman numerals etc.

Even than, some tricky, but potentially dangerous cases are still left out, such
as the fact that the ANGSTROM SIGN is in the LATIN script family, even though it
is visually indistinguishable from LATIN CAPITAL LETTER A WITH RING ABOVE. This
makes it very difficult to put a solution in place without creating a false
sense of security.

On the other hand, the Unicode .pdf charts _do_ appear to contain a detailed
cross reference of visually confusable characters, as do the charts in the
Unicode book. However, I cannot find this information anywhere online. With the
scripts information, and the cross-reference information, we could probably
construct a serious character-level "confusion table" which would very
effectively catch spoofing attacks. Does anyone have any good contacts in the
Unicode Consortium who could release this information to us in machine-readable
form? (For example, letting us know the decrypt password for the existing
character chart .pdfs would enable us to extract this information; the original
data would be even better).
A spoofed domain name doesn't have to mix character sets. As an extreme example,
you could use simply letters from the MATHEMATICAL SANS-SERIF SMALL series.

Also, it's probably going to become quite common to mix sets, e.g. with mixed
English-Japanese site names.
(In reply to comment #18)
> * font variant characters
> * double-width and half-width characters
> * expansion of ligatures, roman numerals etc.

Aren't these all taken care of by NFKC normalization (which we already do before
display)?
OK, after some more grovelling around in the Unicode mailing list archive, I've
found the following file: http://www.unicode.org/Public/UNIDATA/NamesList.txt

This has the cross-reference data in it, giving both exact and approximate
visual similarities between the characters, and also code-point equivalents for
ligatures etc.  Together with the script-family data, this is probably a good
starting point for an anti-spoof algorithm.
After reading TR#15, yes, NFKC normalization won't hurt at all: we should do it
as a first step, before anything else. Indeed, we should do a full NAMEPREP.

A question; DNS is case-insensitive, but sometimes visual collisions may be
case-sensitive. For example, Greek capital Alpha collides with Latin capital A,
but not for the lowercase versions. NAMEPREP implies NFKC normalization and the
use of STRINGPREP tables B.1 (deletion of silly characters) and B.2 (case
folding; RFC 3454 implies folding to lowercase). Should we look for collisions
in either upper or lowercase, or is it safe to restrict to lowercase only?
> 4.  Must have feature:  Disable/enable IDN in all mozilla products. 
network.enableIDN, this already exists

but does not appear to be working. I set that in prefs.js
user_pref("network.enableIDN", false);
restarted firefox when to http://www.shmoo.com/idn/ clicked on the
URL and got 'meeow'. 

about:config name/value shows: network.enableIDN false
This issue is being intensely discussed in the CAcert newsgroup. There *may* be
some useful insight there.

Subject:   Bug in Mozilla based browsers could cause big security problems...
Newsgroup: gmane.comp.security.cacert
Thread:    news://news.gmane.org:119/4207362A.2020208@cacert.org
-> core:networking
Component: Bookmarks → Networking
Product: Firefox → Core
Target Milestone: --- → mozilla1.8beta
Version: 1.0 Branch → Trunk
> At this point, I'm not sure which one is correct; but there should be a correct
> method for using ssl with IDN.  Purhaps this is because the existing RFCs don't
> really talk about ssl + IDN.  

I think it makes more sense to compare the punycode value of the hostname to the
cert since that is the value of the hostname used with DNS to resolve the IP
address.  It seems like a bug to me that KHTML and Opera do otherwise.

As with many of the older internet specifications (DNS, HTTP, Cookies, etc.),
IDN names are intended to be converted to punycode before being used.  So, it is
an odd choice to treat certs as somehow different.
> but does not appear to be working.

See bug 261934.  The bug was fixed recently on the trunk.  The patch applies
cleanly on the 1.7 branch.
Status: NEW → ASSIGNED
*** Bug 281439 has been marked as a duplicate of this bug. ***
(In reply to comment #21)
> OK, after some more grovelling around in the Unicode mailing list archive, I've
> found the following file: http://www.unicode.org/Public/UNIDATA/NamesList.txt
> 
> This has the cross-reference data in it, giving both exact and approximate
> visual similarities between the characters, and also code-point equivalents for
> ligatures etc.  Together with the script-family data, this is probably a good
> starting point for an anti-spoof algorithm.

An algorithm which looks purely at specific character pairs will remain a point
of weakness. If a flaw leaves the user has no other protection, then each flaw,
big or small, will be announced with all the gravitas of a full security
vulnerability. The spreadfirefox people don't need this. 

Detection of potential problems needs to operate on several levels, and I think
we need a top down approach, with warnings on by default and user configurable,
so that the browser is safe `out of the box'. 

For example the warning could be displayed 
 1. the first time a new codeset is encountered in a URL
 2. the first time a particular pair of codesets are used together in a URL.
The user may disable this warning for future encounters with that character set
or combination of character sets, or may leave the warning enabled but create an
exception for that particular site.

This would catch almost all of the problem without getting into the detail of
similar appearing characters. Below this would be the more detailed algorithm
for flagging potentially ambiguous constructions. However with such broad
general protections in place, this could now be implemented on a per codesetpair
basis. 
With respect, confirmation alerts ds not make you "safe out of the box"; 
they merely makes you *annoyingly* unsafe, since people don't read them. If 
mostly-reliable homograph attack detection turns out to be at all practical, 
I suggest a Thunderbird-style banner along the top of the page: 
"&brandShortName; thinks this site is a fraud. (Tell Me More) (Not a Fraud)" 
Disable form controls + applets + plug-ins unless "Not a Fraud" is clicked.
RFC 3490 section 10 (http://www.apps.ietf.org/rfc/rfc3490.html#sec-10)
apparently outlines some high-level suggestions for dealing with this problem.
I heard about this (after hearing the initial warning in 2000) and have followed
several sets of directions to disable it, from going to about:config and turning
the network.enableIDN off to going to the compreg.dat and editing out the lines
mentioning IDN (which was then overwritten by firefox), but have been unable to
turn it off. I have restarted (all copies) of my Firefox 1.0 browser so the
settings should have taken effect. I use Suse 9.1 and cannot tell the difference
between the two urls unless I watch the status bar while its loading, meaning I
would have to go out of my way to verify authenticity of some sites.... Is there
a way to turn this off as the other ways I have been told of dont seem to be
working? Spoofstick plugin doesnt help either. Any help would be appreciated.
> the network.enableIDN off to going to the compreg.dat and editing out the lines
> mentioning IDN (which was then overwritten by firefox), but have been unable 

The preference is indeed broken.  See bug 261934, which has the fix for the
preference.

You should be able to get around this problem by editing compreg.dat as
suggested, just make sure that you edit the compreg.dat that lives in your
Firefox profile directory.  Keep in mind that Firefox re-generates compreg.dat
whenever a new extension is installed, so this is not a great solution.
Here are some potentially interesting references on this issue:
*  "The Homograph Attack", Communications of the ACM, 45(2):128, February 2002
http://www.cs.technion.ac.il/~gabr/papers/homograph.html
* Method for detecting a homographic attack in a webpage by means of language
identification and comparison http://www.priorartdatabase.com/IPCOM/000010253/
*  Draft Unicode Technical Report #36, Security Considerations for the
Implementation of Unicode and Related Technology
http://www.unicode.org/reports/tr36/tr36-1.html
* IDN Language Table Registry http://www.iana.org/assignments/idn/ 
* IANA registered language table list:
http://www.iana.org/assignments/idn/registered.htm

Regarding the last link: note how the registered tables for Greek, Hebrew and
Arabic do not include any Latin letters. On the other hand, the tables for
Japanese, Thai and Korean _do_ but these scripts are sufficiently unlike Latin
script that no confusion is likely to occur between their native characters and
the Latin characters. As yet, there is no registered table for Cyrillic, but I
doubt that it would need Latin characters in it.

There is also quite a lot of activity on the Unicode mailing list about this
topic. http://www.unicode.org/consortium/distlist.html
Here's are some more useful references:

* ICANN Briefing Paper on IDN Permissible Code Point Problems
http://www.icann.org/committees/idn/idn-codepoint-paper.htm
* ICANN Input to the IETF on Permissible Code Point Problems
http://www.icann.org/committees/idn/idn-codepoint-input.htm
*** Bug 281496 has been marked as a duplicate of this bug. ***
This is a proposed "blacklist" of valid Unicode character ranges which are
unlikely to ever be used in any valid domain name in any language. The names of

ranges are those given by the Unicode Consortium. Note that this blacklist will

not _of itself_ eliminate the homograph problem, but it will substantially
reduce the number of possible characters avaliable for homograph spoofing. At
the moment, I make no proposal as to how the blacklist should be used; it's
just
a collection of character ranges containing characters that make no sense being

included in any domain name, in any language. 

I would appreciate any comments regarding ranges that should be added to or
taken out of this list.

Not that the above assumes that NAMEPREP has been applied first to normalise
the string prior to scanning for blacklisted characters.
*** Bug 281507 has been marked as a duplicate of this bug. ***
A programmatic analysis of homographs from the Unicode data shows 2661 possible
cross-script one-way clashes.

Applying the blacklist in the attachment (but without the ideographic
description characters, which will probably be needed) reduces this to only 462
cross-script one-way clashes. 

More analysis to follow.
only for reference: Secunia Advisory SA14163 (Mozilla / Firefox / Camino IDN
Spoofing Security Issue)

http://secunia.com/advisories/14163/
http://secunia.com/multiple_browsers_idn_spoofing_test/
I think I fixed it on my Mac. I took a reference from
http://users.tns.net/~skingery/weblog/2005/02/permanent-fix-for-shmoo-group-exploit.html

They refer to a file called compreg.dat, but they locate it in the user profile
data. I located one in my main install. 

/Applications/Mozilla.app/Contents/MacOS/components

That is where I changed it. I even reset the about:config back to default. It
seems to work.

You change the one line of text in compreg.dat from ...

(Scroll down to the [CONTRACTIDS] section ...)

@mozilla.org/network/idn-service;1,{62b778a6-bce3-456b-8c31-2865fbb68c91}

Change the 1 to a 0 so the line reads:

@mozilla.org/network/idn-service;0,{62b778a6-bce3-456b-8c31-2865fbb68c91}

This really worked under Mozilla for Mac. The "paypal" spoof no longer works in
my Mozilla browser.
(In reply to comment #42)
It is easier to update to current branch builds (or trunk if you want) and use
the pref. See bug 281506 comment 1.
There are also ASCII characters that look very similar with some fonts: l
(lowercase L), 1 (digit), I (uppercase i).
FYI, Unicode.org has a proposed draft tech report:

 Proposed Draft Unicode Technical Report #36 (1.0 version dated 2004-10-12) 
 Security Considerations for the Implementation of Unicode and Related Technology

which includes a section of Visual Spoofing:
  http://www.unicode.org/reports/tr36/#visual_spoofing

which lists 2 recommendations:

(1) Cross-Script Spoofing: the user should be alerted to these cases by
displaying mixed scripts with some special formatting to alert the user to the
situation. For example, a different color and special boundary marks, are used
in Example 2c. A tool-tip can be displayed when the user moves the mouse over
the address to display more information about the situation.

(2) Inadequate Rendering Support: Browsers and similar programs should follow
the Unicode Standard guidelines to avoid spoofing problems. There is a technical
note,  UTN #2: Rendering Combining Marks(http://www.unicode.org/notes/tn2/),
which provides information as to how this can be implemented even in the absence
of font support.
*** Bug 281474 has been marked as a duplicate of this bug. ***
Here's a workaround for linux, I'm sure there's something similar in other 
os's, but I don't have access to them to look.  This does disable all idn 
service lookups as far as I can tell.  This should help with the security 
issue at the moment until a more feasible solution can be found. 
 
 open a terminal and type... 
 $ cd ~/.mozilla/firefox/ 
  
 in that folder will be another folder where the name will depend on your 
profile name, if you used to default, the folder will be 
 foobar.default 
  
 change to the *.default folder and type... 
 $ vim (or vi, kvim, gvim, scite, etc) compreg.dat 
  
 now use vi's search function by typing.... 
 /idn-service;1 
  
 You will find two locations that match it, highlight the 1 with the cursor, 
and use the 'r' key to replace the 1 with a 0.  
 Do this for both locations, then go back to www.schmoo.com/idn and test, and 
it won't allow you to navigate to the page.  I've tried testing it on a few 
fake sites and it doesn't allow navigation to them. 
After more analysis of the Unicode cross-reference tables, I can see that an
attempt to enumerate 100% of all possible homograph sets is probably not
feasible without massive effort (although making equivalence classes from the
crossrefs has found a great many). However, it has given me a lot more insight
into the problem.

Homographs are generally unpopular within a single writing system. On the other
hand, many simple symbols have been either re-used or re-invented in many
alphabets. So the secret of homograph spoofing is mixing languages and/or symbol
sets. 

This proposal suggests a method for detecting language mixing.

However, there is not a 1:1 correspondence between writing systems and code
ranges. Some writing systems are split across a number of code ranges; others
use characters from other writing systems -- for example, both Cyrillic and
Japanese use the ASCII numerals. Nor is there a 1:1 mapping between writing
systems and languages; for example, Japanese uses four distinct writing systems.

However, we _should_ be able to map from _sets_ of code point ranges, some
per-character attributes, and one small set of special case characters, to the
plausibility of a DNS label.

So how about the following algorithm for a single label in a domain name:
1. Run the string through NAMEPREP.
2. If there are leading combining characters, reject as malformed.
3. Assign each character to a character range, according to the official Unicode
code point ranges; except that: characters 0123456789 and HYPHEN are special,
and go in a special range of their own.
4. If there are any characters from "blacklisted" code point ranges, reject the
string as suspicious. A blacklist is a powerful way of limiting spoofers' options. 
5. If there are any other Unicode punctuation characters apart from HYPHEN,
reject as suspicious.
6. If there are any Unicode whitespace characters, reject as suspicious.
7. Now look at the set of character ranges used; are they compatible with a
single writing system/language set? This would consist either of one range and
optional ASCII digits + HYPHEN, or any of a number of hard-coded sets dealing
with cases such as Japanese and Chinese. If the set of ranges is not compatible
with a single script, reject the string as suspicious.
8. If all the tests above pass, return OK.

This would certainly raise the bar for spoofers to jump over quite
substantially, and would not be very code intensive; the script-lookup code is
tiny, and the number of special cases rather small, even when considering
obscure languages. 

If this looks plausible, we can then use the test homographs I've discovered,
and the existing spoofing examples, to test the effectiveness of such an algorithm.

There are still other issues to look at, even if this is a possible solution:
* Forwards-compatibility and future Unicode allocation policy
* RFC compliance
*** Bug 281578 has been marked as a duplicate of this bug. ***
I have attached a small Python program that analyzes internationalized domain
names for cross-script homograph attacks, and tries to detect various possible
kinds of spoofing.
This is a slightly more paranoid version of the previous IDN checker. Doubtless
it still has many deficiencies. However, I would be interested in any comments
about how it might be improved, or interesting counterexamples that it
currently cannot detect.
Attachment #173811 - Attachment is obsolete: true
Blocks: 281496
After yet more thinking, I can identify three different categories of potential
problem:

1. Homographs that cross language/writing system boundaries. We should be able
to catch all or nearly all of these programatically, eliminating a vast class of
attacks.

2. Homographs within a single language/writing system. Unfortunately, the
writing system most affected by this is the Latin writing system. For example,
consider the problem of detecting spoofing by using the many variants of the
letter 'i'.

3. Confusion generated by _semantic_ duplicates. This is a major problem for the
CJK writing system family.

TLDs which follow the single-language and label-filtration practices recommended
by IANA will be more resistant to problem 2: for example, readers of domains
within a Turkish-language domain will be sensitive to the difference between
dotted and dotless-i. 

This still fails to address problem 2 within multiple-script-system gTLDs, where
readers will not in general be sensitive to homographs outside their own
language subset of their local writing system. For example, French readers who
will easily notice cedillas on 'c's will probably not notice the i-variants
which are outside the scope of their own language. (English readers, with no
native accented letters, are of course worst off of all).

How can problems 2 and 3 be solved?

The current favourite solution is "bundling", where a single domain registration
also registers all possible variants. However, this can only occur at the domain
registrar end, or, at extra cost, by having domain registrants register all the
possible variants of their domains. 

There are several technical problems with this: 
* how do you know that you have _exhaustively_ enumerated all possible variants?
A spoofer only needs one missed variant, and they've won.
* backwards compatibility with existing registrations
* infrastructure problems; some names may potentially have thousands or even
millions of variants to be bundled  

And there are a number of business problems with this:
* the registrar incurs extra costs without extra revenue, providing a
disincentive to do it at all
* generating bundles means fewer possible strings available to sell
* potential legal liability issues; have they missed a homograph? Once they've
started along this route, should they include entries in the bundle to resolve
'0'-'o', 'l'-'I' and other possible near-misses? What about characters which are
homographs in one font, but not in another, or in one casing, but not in
another? Should they generate possible simple typos in the bundle? And so on.

Ways forward:
* we should consider a programmatic approach for problem 1, which will nip in
the bud a huge number of potential attacks in non-Latin writing systems
* problem 2 needs more thought, and it is also probably the most pressing
problem, given that almost all existing domains are registered within the Latin
writing system 
* problem 3 is a matter for oriental-language experts, and I believe this is
being discussed in the IDN community.
Regarding problem 2, homographs within the Latin writing system, here is a
heuristic that will probably catch a great many current spoofing attempts:

(This is a first hack at the logic, so please excuse any clunkiness).

* we first assume that NAMEPREP, and the checks for cross-script spoofing
(problem 1) have been applied, and passed.

* next, look at the length of the TLD name. If it is two characters long, I
believe that IANA policy requires it to be a country-code TLD (ccTLD). In this
case, assume that the ccTLD owner knows what they are doing with regard to
language and character set filtration, and return OK.

* if the TLD name is not a ccTLD, it's a gTLD. In this case, we are rather less
confident in the registrar doing the right thing. Now look at the whole FQDN. 

* Since most legacy domain names are all-ASCII, "ASCIIfy" the whole FQDN. This
means, for each character in the FQDN, changing it to the corresponding
unaccented ASCII character (a transform which is easily programmatically
computed from Unicode character names).  

* Is the ASCIIfied FQDN identical to the FQDN? Then each of its labels are
either ASCII, or belong to just one non-Latin language (see the check for
Problem 1). In any case, if the ASCIIfied FQDN is identical to the FQDN, return OK. 

* Now look up the "ASCIIfied" FQDN. If this name lookup returns a value, then
the FQDN is _possibly_ spoofed. If the ASCIIfied name lookup fails, we return OK.

Note that this does not catch spoofing attempts for one non-ASCII Latin IDN
domain against another non-ASCII Latin IDN domain; however, since most legacy
domains, including most current high-value targets, are pre-IDN names, this will
go a long way to ameliorating problem 2.

Downside: an extra name lookup for non-ASCII Latin-script IDNs in gTLDs. Name
lookup caching will, of course, greatly reduce this overhead.

Another downside is that it means that if the domain owner registered both the
accented and non-accented variants (very common in Europe!) their accented
version will raise an alarm each time.
Ian, I was going to snap back with "yes, but if they resolve to the same IP
address, we know they're the same!", but then I thought about round-robin DNS,
HTTP redirect, Akamai, anycast, and so on... 

And then I thought about automated WHOIS queries, but only as a joke.

Back to the drawing board with that last bit, then.
Hmm. I'm having quite a hard time working out what might work in general.

* Reverse DNS won't work for virtual hosted domains
* Looking at the MX records won't work; spoofable, not always present
* Looking at SPF records won't work; spoofable, not always present (SPF works
because although the _records_ might be copyable to another domain, the machines
pointed to by those addresses remain under the control of the entity that
controls the real domain)
* Looking at the NS record chain won't work because both domains might be farmed
out to the same outsourced DNS provider
* Even matching A records won't work if you are truly paranoid, since the two
domains, one valid, one not, might both be hosted on the same outsourced virtual
host provider's machine
* And note that if you use an anonymizing DNS registrar who won't publish your
details on WHOIS, even that opens you up to your spoofer registering with the
same registrar

Any ideas for a _reliable_ way of mapping domain names to controlling entities?


Nor, for that matter, after a bit of experimenting, can we say "let's just fetch
and compare their PKI certificates", if present. Most commercial sites won't
allow TLS/SSL connections on most URL paths.

Nor would it be useful to check "do they serve the same content?", even if this
check was cheap to implement. The front page of a spoofed site might well be
identical to that of the real site, with the scam being located on an inside
page. Indeed, spoofed sites would be very likely to copy the exact content of
their target's entry page, to make them more likely to resist human scrutiny.

My current thinking is on the lines of heuristics again: although any one of the
DNS-based tests in the previous comment is possible for a spoofer to get around,
what is the likelihood of them being able to accomplish several of these attacks
at once? Also, most high-profile targets tend not to outsource their web-serving
to shared virtual servers, although of course, we are then back to Akamai, which
some of them _do_ use.
just keep in mind that if you do any UI fixes we'll need to do them for camino
too, so the more you can keep in the backend the better. We'd also like this on
the 1.7 branch so camino 0.8.x can take advantage.
OK, back to the case analysis. 

There are several layers to be considered here:
* The name level, where we consider a name as an identifier as a static entity
in a vacuum
* The DNS lookup level, where names _dynamically_ map to resources such as IP
addresses, MX records and so on
* The protocol level, where DNS names are only part of the overall name; for
example, one HTTP URL might redirect to another, perhaps with a different
version of the DNS name in. Similarly, hosting companies bind content to URLs at
this level.
* The authority level; who _controls_ the given resource. This is _not_
necessarily the entity that hosts it. This appears to be crucial. PKI
certificates supposedly bind this level to the DNS name level.

Ideally, we want a strong link through the entire chain, or at least a strongly
plausible set of links which make spoofing very difficult to do without forging
many different links, and thus more likely to be detected by one of the entities
involved in making the chain.
How about displaying "strange" domain names differently? By "strange", I mean
ones that use characters outside the "normal" set for the user's language
(which we know, because they are using a particular localized version, right?).
This different display could be something as simple as placing a '!' before
the offending characters or a certain amount of space. If we want to go even
further, we could use a bold font or a larger font or both.

IDNs are intended to allow people all over the world to use their own languages
in domain names. They are not intended to allow domain owners to register
"cute" misspellings (or spoofed names). So it's OK to penalize these "strange"
domain names by displaying them differently in the URL and status bars (and
an even more conspicuous display in security dialogs such as certs).
For the tactic of displaying characters outside the user's expected language in
a distinctive way, the following expired Internet-Draft might be useful:
http://www.alvestrand.no/ietf/lang-chars.txt

This seems to be mostly useful for Latin-based languages, and some
Cyrillic-based languages.

However, to quote from the Internet-Draft:

    There are a lot of languages in the world. Estimates vary between
    500 and 6000, with some eternal conflicts about the difference
    between a language and a dialect guaranteeing that any list
    claiming to be authoritative will be the source of endless debate.

    Many of these languages have a writing system. Some have several.
    These are also likely to have changed over time, with the meaning
    of character symbols changing, the shape of the characters
    changing, or completely new characters being added, or old ones
    removed from the set. This means that even within a single
    language, a list of characters is likely to be controversial.

    These problems have made several experts in the field of languages
    and characters refuse to even consider the idea of working out
    such a list.

For other languages, we will probably have to use the Unicode code point ranges.

So, now we have three proposals for anti-spoofing techniques, which are each
potentially complementary:
* detecting broken Unicode and cross-writing-system mixes in IDN labels
* attempting to detect possible spoofs by doing a DNS lookup on an
accent-stripped version of the IDN, and checking if the two resources are the
same if both lookups succeed
* displaying characters outside the user locale in a distinctive way, or
otherwise providing a warning that they are being used

The first approach is linguistic; the second requires lookups; the third is
GUI-based.

Are there any other techniques which can be added to these?
Thinking about high-level requirements:

-We must not place excessive burden on non-latin scripts users
-We must detect spoofed domains 
-We must try to not create situations where users just 'click yes' (This is what
most users do today when getting ssl warnings). 
-We must make the IDN spoofing solution accessible (for example, only shading
background colors wouldn't meet this requirement)


Some ideas for meeting those requirements:

1.  Assuming we can come up with a clear 'name' for each script (German,
Russian, Latin, etc), we can create a whitelist of scripts users which users can
manage.   If a domain gets requested that's not in the users 'whitelist', they
get prompted/warned.  Note that all that is really required then to spoof the
domain is "Latin+ Cyrillic", but it's a step in the right direction.  

2.  Heuristic matching of IDNs - - create a database of commonly 'forged' chars
between scripts.  For example:  Cyrillic-lowcase-a looks just like
ASCII/latin-lowercase-a.  Mark these chars as 'suspect'.  If a domain has either
a very high ratio of suspect chars, or a very low ratio of suspect chars, warn
the user.   This gets rather nasty with Traditional Chinese/Japanese.   Even if
the 'suspect chars' db won't work (due to the effort required to populate it),
you can do a similar thing by matching ratios of codepages; if it's 90% latin &
10% Cyrillic, it's likely we are being spoofed. 

3.  You can detect the script/codepage the target HTML webpage is, in order to
see if it matches the script(s) the domain uses. 

4.  For SSL related sites, the browser could display the punycode version of the
 IDN next to the lock icon (today is displays the UTF8). 
Yes, lists of characters used in languages are controversial, but we do not
have to use authoritative lists or even fixed ones. We should come up with an
API that can hide the strategy in the implementation. The lists of chars or
char ranges (if any) can change over time. For Japanese, for example, we might
even consider testing whether the Unicode Han character is in one of their
standard sets (i.e. JIS X 0208, JIS X 0212, etc). A Chinese character that
falls outside the primary JIS set can be flagged with a different display.

Note that this only affects the *display* of the domain name, not it's actual
lookup. So it doesn't have to be perfect or exhaustive.
We shouldn't put the punycode next to the lock icon in Firefox, the name must be
readable. Yes, it would have helped in this paypal spoof case to be able to say
"hey, random garbage, must not be the real paypal." But for users in regions
where IDN is heavily used it isn't going to help anyone to have a large
percentage of the legit sites display unreadable random garbage. It won't help
the users tell whether they're on the right site, and it just makes the browser
look broken.
Regarding a big homograph pair table: yes, that was my first idea, and I even
attempted to compile one. After generating a _long_ list programmatically, I
realised that it didn't cover many that I could see by eye from the code tables.

Given that there are 12,886 alphabetic or symbol characters in Unicode 3.2 even
discounting CJK, Hangul and so on (which would give a grand total of 95,156 if
included), you would have to inspect n(n+1)/2 - n = 83,018,055 possible
character pairs. Including CJK etc, that would give a ludicrous 4,527,284,590
possible pairs to inspect. Eliminating great swaths of pairs by character set
alone does not work; some of the trickiest homographs are between semantically
unrelated characters in quite unrelated character sets. Remember that in the
long term, we have to consider not only the IDN-to-ASCII spoofing problem, but
IDN-to-IDN spoofing.

On the other hand, restricting each label to a single writing system, (or family
of mutually compatible writing systems, in a few special cases) works rather
well at eliminating the possible spoofing pairs, reducing the total by many
orders of magnitude. Enforcing the limit is also firmly within the spirit of
IANA's recommendations, which recommend that each label come from a single
well-defined language.

Blacklisting symbol and other specialized character sets makes the total smaller
still. Now, with all the characters in a single label limited to a single
writing-system-group (to coin a term), we need only to consider homographs
_within a particular writing-system-group_, or the possibility of constructing a
whole-label homograph. Now, whole-label homographs are possible in theory if an
entire label consists entirely of homographs from the same script-pair (consider
faking ayayay.com using Cyrillic characters), but in practice the statistical
properties of most languages are such as to make these unlikely (consider
finding a script with homographs for each of 'g', 'o', 'l' and 'e' (google), or
'a', 'm', 'z', 'o', 'n' (amazon), 'm', 'i', 'c', 'r', 'o', 's', 'f', 't'
(microsoft)). 

Once we've done this, we can then worry about confusable characters within the
writing system of a particular label, and in particular, collisions with ASCII
domain names. Only when we've done this, should we add extra code to deal with
specific dangerous character pairs we already know about.
FWIW, I created a simple Firefox extension that effectively kills IDN support:
http://friedfish.homeip.net/extensions/no-idn.xpi
(In reply to comment #62)
> 3.  You can detect the script/codepage the target HTML webpage is, in order to
> see if it matches the script(s) the domain uses. 

HTML doesn't normally have any script labelling, but the "codepage" (i.e.
charset) is sometimes indicated. When the charset is a universal one like
UTF-8, it is harder to tell what language the document is written in.

If ICANN and the like do not already have this recommendation, perhaps we
could have them add that security-conscious sites should label their HTML
documents with the language so that browsers can check that the domain name
used to get to their site does not contain characters normally found outside
their language.
(In reply to comment #65)
> On the other hand, restricting each label to a single writing system, (or family
> of mutually compatible writing systems, in a few special cases) works rather
> well at eliminating the possible spoofing pairs, reducing the total by many
> orders of magnitude. Enforcing the limit is also firmly within the spirit of
> IANA's recommendations, which recommend that each label come from a single
> well-defined language.

But some domains do have a few characters that are of a different writing system
then the rest of the domain. For example,

In literal form:
http://www.färgbolaget.nu
http://www.bücher.de
http://www.brændendekærlighed.com
http://www.räksmörgås.se
http://www.färjestadsbk.net
http://www.mäkitorppa.com
http://www.ma&#776;kitorppa.com

In escaped form:
http://www.f&#x00E4;rgbolaget.nu
http://www.b&#x00FC;cher.de
http://www.br&#x00E6;ndendek&#x00E6;rlighed.com
http://www.r&#x00E4;ksm&#x00F6;rg&#x00E5;s.se
http://www.f&#x00E4;rjestadsbk.net
http://www.m&#x00E4;kitorppa.com
http://www.m&#x0061;&#x0308;kitorppa.com
*** Bug 281674 has been marked as a duplicate of this bug. ***
> But some domains do have a few characters that are of a different writing system
> then the rest of the domain. For example,

All of your examples come from the LATIN characters, no different writing
systems in sight. See the IANA language tables referenced in comment 35
How about showing something like this in the status bar when the user moves
the mouse over a suspicious link:

  http://www.payp!a!l.com/

If the user doesn't have the sidebar turned on and clicks on the link, then
we can show a similar string in the location bar. However, cut/copy/paste of
the location bar should not include the '!' characters.

If the user doesn't have the location bar turned on, then they won't see the
lock icon either, which means that they don't know or don't care about
security anyway. Educating the user about security is also important.
(In reply to comment #71)
> If the user doesn't have the sidebar turned on and clicks on the link, then
> we can show a similar string in the location bar.

I meant to say status bar, *not* sidebar. Oops...
(In reply to comment #71)
> How about showing something like this in the status bar when the user moves
> the mouse over a suspicious link:
> 
>   http://www.payp!a!l.com/
>
> If the user doesn't have the [statusbar] turned on and clicks on the link,
> then we can show a similar string in the location bar.

I think it would be just as effective and more user friendly if suspicious
characters were rendered in a different font and/or colour, much like Konqueror
does as shown in comment 12.  Adding superfluous characters like '!' would just
be confusing.
(In reply to comment #73)
> I think it would be just as effective and more user friendly if suspicious
> characters were rendered in a different font and/or colour, much like Konqueror
> does as shown in comment 12.  Adding superfluous characters like '!' would just
> be confusing.

I had a look at comment 12, but the font does not look that different and the
color doesn't either. I still like the '!' chars.

Maybe we should go even further and refuse to load a document from a domain
name containing characters outside the user's language. After all, IE doesn't
even support IDNs out-of-the-box. In countries where IDN is used a lot, we
could support IDNs using characters from the user's language only. Or they
could add languages that they need.

Or we could load any document, but only after the user has selected an item
buried deep inside lots of disclaimers.

If some domain registrars are not following the IDN guidelines, then the
browser may be the last line of defense. We could send a strong message to
these registrars by making it more difficult for users to reach the Web
sites with names that they neglected to filter.
(In reply to comment #73)
> I think it would be just as effective and more user friendly if suspicious
> characters were rendered in a different font and/or colour, much like Konqueror
> does as shown in comment 12. 

Can't just have the characters appear in different colors because of users who
are  color blind. Very hard for some or most to tell the difference in colors
without starring at it...and who really analyzes the address bar and link anyways?

I think there should be a small dialog box (i know, i know we all hate them)
like Thunderbird implemented for suspected phising sites...also provide a link
in the dialog box to explain what exactly they got the warning for.
(In reply to comment #53)
> Regarding problem 2, homographs within the Latin writing system, here is a
> heuristic that will probably catch a great many current spoofing attempts:
> 
> * next, look at the length of the TLD name. If it is two characters long, I
> believe that IANA policy requires it to be a country-code TLD (ccTLD). In this
> case, assume that the ccTLD owner knows what they are doing with regard to
> language and character set filtration, and return OK.

This needs exceptions though, as the .cc ending, for example, is effectively
used like a gTLD.
Also, I don't think ccTLDs are safe. http://www.amazon.de/ for example, the
German local variant of Amazon, could probably be spoofed. (Although admittedly,
this particular URL doesn't contain any characters that could really be spoofed
in German.)
(In reply to comment #60)
> How about displaying "strange" domain names differently? By "strange", I mean
> ones that use characters outside the "normal" set for the user's language
> (which we know, because they are using a particular localized version, right?).

Not right. I might be an exception, but I always try to get the English versions
of programs, even though my native language is German. Given that the internet
is mostly English, I feel more comfortable in an English environment. That
doesn't mean, however, that I don't want to access http://www.öbb.at/ (the IDN
URL of the homepage of the Austrian railway) once in a while, without getting
warnings.

Also, there aren't localizations for all languages out there.
I've been doing some more exploring, and I've found this interesting triple.

http://www.bücher.ch/ redirects you to a German online bookstore
http://www.bucher.ch/ takes you to the web site of Bucher Biotec AG

Both entirely valid domain names, registered by different entities. And before
you think "ah, yes, but in German ü is another way of saying ue" consider that:

http://www.buecher.ch/ takes you to a Swiss online bookstore
A question: what do people consider as confusable characters within the Latin
writing system?

Let's try an easy example first: does the Latin Extended-A s-cedilla in
"micro&#351;oft" jump out at the viewer if they are not looking carefully?

Less noticeable, how about the Latin Extended-A dotless 'i' in "m&#305;crosoft"? 

Or the Latin-1 accented 'i' in "mìcrosoft"? (This last being the nastiest, as it
is both the least visible, and also in the base Latin-1 set).



(In reply to comment #77)
> Not right. I might be an exception, but I always try to get the English versions
> of programs, even though my native language is German. Given that the internet
> is mostly English, I feel more comfortable in an English environment. That
> doesn't mean, however, that I don't want to access http://www.öbb.at/ (the IDN
> URL of the homepage of the Austrian railway) once in a while, without getting
> warnings.

OK, so how about checking against a set of languages for certain TLDs? For
ccTLDs that mostly use one language, we just check against that language.
Countries that normally use more than one language would be checked against
each language, but *not* the union of the languages.

There is an Internet Draft that discusses a "one domain label, one language"
rule:

http://www.ietf.org/internet-drafts/draft-klensin-reg-guidelines-05.txt

However, we can set stricter rules for ourselves if we think that that might
protect the user from phishers. For example, we might have a "one FQDN, one
language" rule.

> Also, there aren't localizations for all languages out there.

True. Those users are out of luck. But we can use other sources for the
language(s), such as ccTLDs, as described above. We could also look at the
set of languages preferred by the user. In Firefox, you can find these
under General > Languages in the preferences/options.
See the list of attachments for some code I've written to enforce "one language,
one label" -- this maps characters to code-point ranges, and sets of code-point
ranges to writing systems, and thence to languages. This would at a stroke catch
all of the Cyrillic/Latin alphabet exploits that are the subject of the recent
announcements, and a lot of potential future nastiness between other pairs of
script systems.

However, this does not quite slam the door shut, as there is still room for
exploits within a single writing system; notably, this is worst within the Latin
writing system with its many local variants. However, we may, be able do
something there based on the user locale. 

As a first question:
* can anyone think of any reason, ever, for allowing multiple writing systems in
a single label, other than in the special exceptions given for Chinese, Japanese
and Korean? Can anyone point to an existing legitimate domain name which breaks
this rule?

IMO the proper fix for the ssl case (https://paypal.com) is to remove the
UserTrust network certificate from the store. Obviously they are not doing their
job and therefore they shouldn't be trusted.
(In reply to comment #81)
> this maps characters to code-point ranges, and sets of code-point
> ranges to writing systems, and thence to languages.

Instead of using code-point ranges and writing systems, it might be better
to use a set of characters for each language, as is done in the IANA registry.
This would be more in the spirit of IANA.
Whilst this is a very complex issue, we seem to be moving towards a rough
consensus about what needs to be done...

How about doing this multi-layered set of fixes:

* firstly, we make sure users can easily turn off IDNA entirely

* secondly, we ENFORCE "one language's writing-system-set, one label", and
blacklist all symbols, dead-language scripts, and other exotica, roughly in the
way I've coded in my example programs. This kills all cross-script exploits
stone dead. This can be done at the name-normalization level, or even better at
the DNS-lookup level, so that lookup up these bogus names simply gives an error:
we can give a distinct error code for this, if needed. Note that we can't
enforce "one language, one FQDN", consider "www.<something in Thai>.com" as an
example.

* thirdly, we display characters in domain names which are outside of the user's
acceptable set, as defined in (for Firefox) Options > General > Languages, in
some distinctive way: for example, by adding a question mark, so that, for an
English user, "http://www.mìcrosoft.com/" would be displayed 
"http://www.mì?crosoft.com/". Whilst this is 'soft-security', not many people
will mistake "mì?crosoft" for "microsoft". Note that by doing a simple
text-substitution on name display, this can quite easily work in every part of
the GUI, without code being needed to do exotic font or style changes. Note that
 this involves more paranoid character set selection than simply referencing
code pages; however, the Alvestrand Internet-Draft cited above seems to be a
good reference for Latin languages, and these are the ones currently with the
greatest risk exposure. 

We could also use a page-top-banner like the one used for popup blocking in
Firefox, or spam-blocking in Thunderbird, to warn the user "This page is from a
web address which contains unfamiliar symbols outside your preferred
language(s): you might want to check if it is genuine" ... can anyone think of
better wording for the banner?

And that's about as far as we can go, for now. Unfortunately, the "spotting the
correct version by lookup" technique is dead in the water for the time being,
due to existing allocations by registrars, and major technical problems in the
concept itself. There are also all sorts of horrible semantic problems lurking
with Chinese names, and the Chinese Unicode and IDN community are well aware of
this, and trying to find a solution through using IDN bundles. However, I think
this is out of the scope of the immediate problem, which needs an immediate fix.
(In reply to comment #76)
> Also, I don't think ccTLDs are safe. http://www.amazon.de/ for example, the
> German local variant of Amazon, could probably be spoofed.

.de domains only allow a very specific set of characters. (bottom of
http://www.denic.de/de/richtlinien.html)
(In reply to comment #85)
> .de domains only allow a very specific set of characters. (bottom of
> http://www.denic.de/de/richtlinien.html)

Have they tried to register this set with IANA?
(In reply to comment #84)
> make sure users can easily turn off IDNA entirely

And the default should probably be IDNA turned on.

> ENFORCE "one language's writing-system-set, one label", and
> blacklist all symbols, dead-language scripts, and other exotica

Why bother with blacklists when we can use whitelists a la IANA?

> This can be done at the name-normalization level, or even better at
> the DNS-lookup level

I'm not sure about doing it at the DNS level. If a user clicks on a link with
suspicious characters in the domain name, we should probably give some kind
of warning. But what about <img src="http://www.foo.com/image.gif">?

> Note that we can't
> enforce "one language, one FQDN", consider "www.<something in Thai>.com" as an
> example.

Yes, we can. Take a look at the Thai IDN registration at IANA. (I think they
may have made a mistake by including Latin capital letters, though.)
> Why bother with blacklists when we can use whitelists a la IANA?

Please read my previous postings here, and the numerous links to background
papers provided. This is a _hard_ problem, and it's clear the IDNA/registrar
community has not been thinking about this hard enough, or we would not have got
into this state in the first place. 

If we want to do "hard" blocking using whitelists, there are a vast number of
whitelists to draw up, and that will take lots and lots of time, and would
postpone a fix almost indefinitely until a whitelist had been drawn up for every
concievable language.  

The IANA whitelists only cover a tiny number of languages, and even then are
registrar-dependent (two different registrars could select different character
sets for the same language, for example). In addition, the IANA whitelists are
dependent on the language used _for the given label_, which is not 1:1
obtainable from the name of the TLD: for example, see the .info registration for
the de: language. Neither is it available from the DNS.   

In any case, it won't work. Consider the dotless-i and i-acute examples I cited
before. Both entirely valid strings in European languages that would pass a
whitelist.

In the long run, both blacklists and carefully-selected whitelists would be a
good idea; but for the moment, blacklists and "soft" whitelists as proposed
above give the greatest security gain for the least investment in time and
effort. I think that some of the examples given above show that this is not a
problem which is amenable to a perfect fix, given that human beings have chosen
to use languages which contain easily-confusable characters in the same writing
system. Nor is it possible to anticpate all possible attacks; during my research
over just a couple of days, I've found a number of possible new attacks (all of
which are dealt with in the latest version of my proposal, by the way). If I can
do that in a couple of days, I'm sure there are many more left. However, we can
make life many, many, orders of magnitude more difficult for spoofers by doing
some relatively simple things, including defeating all of the existing known
attacks.

Sometimes "good" is better than "best", if "best" means waiting a long time
first with the security hole still open. Remember that after we've rolled out a
"good" solution in the very near future, we can always work on making it
tighter, and aiming towards perfection in the long run.
(In reply to comment #88)
> > Why bother with blacklists when we can use whitelists a la IANA?
> 
> Please read my previous postings here, and the numerous links to background
> papers provided.

Um, let's take an example. The spoofed paypal.com one. The gTLD doesn't give
us any languages. Suppose the user uses US English, and hasn't changed
Firefox's General > Languages setting. So the only language we can check
against is en-US. It only contains Latin small letters, digits and a few
others.

So with my proposal, the status bar would show payp!a!l and if the user
clicked it anyway, your top banner would appear with a warning and my '!'
characters would appear in the location bar.

Do you see any blacklists in this picture? (I may have missed references to
blacklists in the RFCs and official guidelines. If so, please let me know
the specific location, chapter and verse.)

> it's clear the IDNA/registrar
> community has not been thinking about this hard enough, or we would not have got
> into this state in the first place.

The IDNA community *has* thought about it. Look at all their RFCs, guidelines
and IANA registrations.

The problem is the registrars. They don't seem to care. The Secunia test case
shouldn't even have been possible to register.

And the other problem is our browser, of course. IDN was checked into the
tree without whitelist checking.

> If we want to do "hard" blocking using whitelists, there are a vast number of
> whitelists to draw up, and that will take lots and lots of time, and would
> postpone a fix almost indefinitely until a whitelist had been drawn up for every
> concievable language.

Nope. We could put the whitelists on the mozilla.org site *too*, and the fixed
version of Firefox could download the most up-to-date versions. The initial
whitelists would come with the product.

> The IANA whitelists only cover a tiny number of languages, and even then are
> registrar-dependent (two different registrars could select different character
> sets for the same language, for example). In addition, the IANA whitelists are
> dependent on the language used _for the given label_, which is not 1:1
> obtainable from the name of the TLD: for example, see the .info registration for
> the de: language. Neither is it available from the DNS.

My wording may have been too vague, but I didn't mean to say that we would
use the IANA registrations themselves. We will have to come up with our own.
I used words like "a la IANA" and "spirit".

> In any case, it won't work. Consider the dotless-i and i-acute examples I cited
> before. Both entirely valid strings in European languages that would pass a
> whitelist.

Dotless-i and i-acute do not occur in *all* European languages.

> Sometimes "good" is better than "best", if "best" means waiting a long time
> first with the security hole still open. Remember that after we've rolled out a
> "good" solution in the very near future, we can always work on making it
> tighter, and aiming towards perfection in the long run.

We are in violent agreement here. :-)
I think we are settling down to a relatively small set of practical methods to
prevent spoofing; perhaps we could start to write some code soon?

Here is another justification for blacklists.

Just to give you nightmares, here is another scenario. As you know, NAMEPREP
will map quite a lot of things to the lowercase ASCII letters. Now suppose that
someone has coded up a spoofed address in just these characters, Punycoded it,
and slipped it past a dumb registrar, perhaps using an automated domain
transfer. Now, we have two different DNS names, foo.com and xn-<something>.com,
both of which will map to the ASCII string foo.com after being un-Punycoded and
(for reasons of caution) re-normalized with NAMEPREP. Unfortunately, the owner
of xn-<something> can now pass out links to their site, and "foo.com" will be
displayed in the browser bar in 100% correct ASCII. Disaster!

So, what we need to do here is to apply some of the blacklist logic prior to
using NAMEPREP, as well as applying blacklist/whitelist logic after. Belt and
braces.

So, the methods seem to be boiling down to:
* Character range blacklists, both before and after NAMEPREP, for totally
unreasonable characters, like Linear B, surrogates, control-image graphics and
so on and so forth.
* Enforce the prevention of script-family mixing in labels, except as permitted
in the CJK languages
* Make the data tables for these script-family lists auto-updatable, so we can
fix it if we get it wrong, or if the Unicode standard changes?
* Per-language strict whitelists for user-specified "accept" languages, to be
worked out on a language-by-language basis and auto-updatable from the Mozilla
site, which are used to add warning characters to displayed text in the GUI.

If this overall approach is OK by people, I can start generating data tables and
Python proof-of-concept code ASAP.

Or -- if this is not a good idea -- please explain why, and propose something
else that's better!
As per an earlier comment requesting chapter and verse on blacklists:

Section 5 of the NAMEPREP RFC, RFC 3495, says:  

>5. Prohibited Output
>
>   This profile specifies prohibiting using the following tables from
>   [STRINGPREP]:
>
>   Table C.1.2
>   Table C.2.2
>   Table C.3
>   Table C.4
>   Table C.5
>   Table C.6
>   Table C.7
>   Table C.8
>   Table C.9

and these tables are defined in RFC 3454.

Also, this ICANN document:

Internationalized Domain Names (IDN) Committee Input to the IETF on Permissible
Code Point Problems http://www.icann.org/committees/idn/idn-codepoint-input.htm, 
which recommends a whitelist-based scheme, but then goes on to specify a
_blacklist_ of what should not be included in any of the whitelists, as follows:

> ...at least the following sets of characters not be included, pending further
> analysis:
>
>    * line and symbol-drawing characters,
>    * symbols and icons that are neither alphabetic nor ideographic language 
> characters, such as typographical and pictographic dingbats,
>    * punctuation characters, and
>    * spacing characters.


Also, I seem to remember some RFC language somewhere saying that application
writers can apply their own extra constraints to IDNA interpretation... I'm
still looking for that.

By the way, note that one way of interpreting the RFC is that these forbidden
outputs are to be removed on a per-character basis: that would be a big mistake,
as it would allow the domain www.micro<forbiddencharacter>soft.com to be
registered, and then NAMEPREP will remove the character to generate a spoofed
name...
 
(In reply to comment #86)
> (In reply to comment #85)
> > .de domains only allow a very specific set of characters. (bottom of
> > http://www.denic.de/de/richtlinien.html)
> 
> Have they tried to register this set with IANA?

Interestingly, no they haven't registered it. Unlike the rather small list
registered for de: by .info, this one contains a vast number of characters,
including the eminently spoof-worthy accented and dotless 'i' variants, and
characters like LATIN SMALL LIGATURE OE, LATIN SMALL LETTER T WITH CEDILLA, and
 LATIN SMALL LETER KRA. 

Again, this raises the issue of how we should compile whitelists: just because
this is the official .de registrar list for .de, does not make it a good list
for spoof detection. On the other hand, a "soft" whitelist which does not
include KRA, for example, will clearly flag that character as unusual in a
domain name, but not prevent users going to a page containing it.
 

  

(In reply to comment #90)
> Just to give you nightmares, here is another scenario. As you know, NAMEPREP
> will map quite a lot of things to the lowercase ASCII letters. Now suppose that
> someone has coded up a spoofed address in just these characters, Punycoded it,
> and slipped it past a dumb registrar, perhaps using an automated domain
> transfer. Now, we have two different DNS names, foo.com and xn-<something>.com,
> both of which will map to the ASCII string foo.com after being un-Punycoded and
> (for reasons of caution) re-normalized with NAMEPREP. Unfortunately, the owner
> of xn-<something> can now pass out links to their site, and "foo.com" will be
> displayed in the browser bar in 100% correct ASCII. Disaster!

I just tried moving the mouse over a link to www.xn--amazn-mye.com and Firefox
showed the same string in the status bar. It did not un-Punycode it.

Am I misunderstanding your example?
*** Bug 281831 has been marked as a duplicate of this bug. ***
(In reply to comment #90)
> transfer. Now, we have two different DNS names, foo.com and xn-<something>.com,
> both of which will map to the ASCII string foo.com after being un-Punycoded and
> (for reasons of caution) re-normalized with NAMEPREP. Unfortunately, the owner
> of xn-<something> can now pass out links to their site, and "foo.com" will be
> displayed in the browser bar in 100% correct ASCII. Disaster!

What we probably should do in that case is actually connect to foo.com.  That
doesn't seem to be the case now, but we also don't currently display foo.com in
the URL bar.
(In reply to comment #93)
> (In reply to comment #90)
> > Just to give you nightmares, here is another scenario. As you know, NAMEPREP
> > will map quite a lot of things to the lowercase ASCII letters. Now suppose that
> > someone has coded up a spoofed address in just these characters, Punycoded it,
> > and slipped it past a dumb registrar, perhaps using an automated domain
> > transfer. Now, we have two different DNS names, foo.com and xn-<something>.com,
> > both of which will map to the ASCII string foo.com after being un-Punycoded and
> > (for reasons of caution) re-normalized with NAMEPREP. Unfortunately, the owner
> > of xn-<something> can now pass out links to their site, and "foo.com" will be
> > displayed in the browser bar in 100% correct ASCII. Disaster!
> 
> I just tried moving the mouse over a link to www.xn--amazn-mye.com and Firefox
> showed the same string in the status bar. It did not un-Punycode it.
> 
> Am I misunderstanding your example?

Yes you are. You are using 0x43e, CYRILLIC SMALL LETTER O, which NAMEPREP
normalizes to itself, not to LATIN SMALL LETTER O. By the way, I note that
someone has already registered that URL. I'll see if I can manufacture an example.
(In reply to comment #95)
> What we probably should do in that case is actually connect to foo.com.  That
> doesn't seem to be the case now, but we also don't currently display foo.com in
> the URL bar.

*If* we decide to decode a Punycoded domain name in a link clicked by the
user, then we should also check whether it is converted back to the original
when we run it through nameprep and punycode. If not, we should either warn
the user or pass the original to DNS or both, since it is malformed.
(In reply to comment #97)
> (In reply to comment #95)
> > What we probably should do in that case is actually connect to foo.com.  That
> > doesn't seem to be the case now, but we also don't currently display foo.com in
> > the URL bar.
> 
> *If* we decide to decode a Punycoded domain name in a link clicked by the
> user, then we should also check whether it is converted back to the original
> when we run it through nameprep and punycode. If not, we should either warn
> the user or pass the original to DNS or both, since it is malformed.


Yes! That would work. Similarly if the user were to type in a Punycoded domain
name in the browser bar.
Also, if the Punycoded name does not convert back to itself, then the original
(malformed) Punycode should be displayed in the status bar when the user mouses
over it.
Coming back to the black/white list issue, it seems that our disagreement was
due to some confusion over nameprepping. Mozilla already runs the string
through nameprep, I believe (mozilla/netwerk/dns/src).

I was talking about what to do *after* nameprepping. I believe we only need
whitelisting after nameprepping.
"Must have feature:  Disable/enable IDN in all mozilla products."

Disabling IDN should not just be a feature, but, since IDN is not in wide use,
should simply be disabled as a default in all Mozilla products and released
ASAP, possibly with a hidden pref to turn it on.
Tying up loose ends: What to do about <img src=...>. Should we just silently
load the image even if the domain name contains characters outside the user's
or ccTLD's language(s)?

Of course, if it's <img src="https://..."> then we need to check the cert.
(In reply to comment #101)
> "Must have feature:  Disable/enable IDN in all mozilla products."
> 
> Disabling IDN should not just be a feature, but, since IDN is not in wide use,
> should simply be disabled as a default in all Mozilla products and released
> ASAP, possibly with a hidden pref to turn it on.

I agree. Disabling IDN by default is the right thing to do _until_ there is a
verified working fix for the bulk of spoofing  scenarios. It's quick, simple,
and will sort the PR problem in a snap, whilst allowing users who want IDN to
re-enable it by using a preference.

Given the range of problems here, it looks like we could be talking for at least
a couple of weeks before we reach consensus on a detailed set of fixed for all
of the more general problems, implement them, make test cases, and validate
them. We can't wait that long.

Perhaps this bug should be divided into a set of smaller bugs:
* disable IDN by default, and provide GUI for a user pref to turn it on, with an
appropriate warning.
* roll this out to existing Mozilla suite and Firebird users by automatic update.
This deals with the immediate pressing problem of poor security and hence bad PR.

* deal with cross-script IDN spoofing, blacklist "must not happen" characters in
IDNs, more paranoid Unicode string syntax checking (eg. no leading combining marks)
* use locale-based whitelists for displaying IDNs to prevent same-script IDN
visual spoofing
* deal properly with literal Punycode in domain names (ie. test for
NAMEPREP/Punycode round-tripping, then treat as if entered as Unicode)
* put all this in Gecko 1.8 / Firefox 1.1
This generates good PR by being the first with a proper fix for spoofing.

Now we have four quite discrete sub-tasks, we can start solving them one-by-one,
without any inter-problem interactions, instead of all together at once, which
creates an apparently much larger problem.

(In reply to comment #102)
> Tying up loose ends: What to do about <img src=...>. Should we just silently
> load the image even if the domain name contains characters outside the user's
> or ccTLD's language(s)?
> 
> Of course, if it's <img src="https://..."> then we need to check the cert.

I think so. After all, we are implicity trusting the page content generator at
this point, who is at liberty to point their image sources wherever they like.
On the other hand, when we go "View Image", or otherwise inspect the image URL,
it should go through the same URL display code that _will_ flag non-locale
characters, just as if they were in a page link or a user-entered link.

By the way, I have now generated language whitelists for all the languages in
the Internet-Draft, namely Afrikaans, Albanian, Basque, Breton, Bulgarian,
Byelorussian, Catalan, Croatian, Czech, Danish, Dutch, English, Esperanto,
Estonian, Faeroese, Finnish, French, Frisian, Gaelic, Galician, German,
Greenlandic, Hungarian, Icelandic, Irish, Italian, Latin, Latvian, Lithuanian,
Macedonian, Maltese, Norwegian, Polish, Portuguese, Rhaeto-Romance, Romanian,
Russian, Sami, Serbian, Slovak, Slovenian, Sorbian, Spanish, Swedish, Turkish,
Ukrainian, and Welsh.

*** Bug 281863 has been marked as a duplicate of this bug. ***
(In reply to comment #103)
> * deal with cross-script IDN spoofing, blacklist "must not happen" characters in
> IDNs, more paranoid Unicode string syntax checking (eg. no leading combining
marks)

If you're talking about adding more steps to the DNS label conversion, I
don't think you can do this and still claim conformance to the nameprep and
punycode RFCs.

This is an interoperability issue: The server sends us an HTML document, the
user clicks on a link, we convert it from HTML to Unicode and then pass it
through nameprep and punycode before sending it to the DNS server. The HTML
document author assumes that we will adhere to the HTML, Unicode, Nameprep
and Punycode specs. If we don't, we have an interop problem.

> * use locale-based whitelists for displaying IDNs to prevent same-script IDN
> visual spoofing

No, my language-based whitelist proposal is intended to alert the user to
potential spoofing across scripts as well as within a single script. For
example, the Cyrillic small letter 'a' is outside the en-US language *and*
outside the Latin script. The i-acute is also outside en-US but inside Latin.
This is a series of character repertoires for various languages.

Source: from expired Internet-Draft "Characters and character sets for various
languages", by Harald Tveit Alvestrand, draft-alvestrand-lang-char-03.txt
The characters given here for a language include the base set, the Required
characters, and the Important characters.
See the Internet-Draft for the definitions of these terms. 
One correction has been made: the entry for German contained a control
character. This has been removed.
This is a draft document generated from another draft; there is absolutely no
guarantee of correctness or fitness for any use; this information is provided
for research and entertainment purposes only.
(In reply to comment #106)
> (In reply to comment #103)
> > * deal with cross-script IDN spoofing, blacklist "must not happen" characters in
> > IDNs, more paranoid Unicode string syntax checking (eg. no leading combining
> marks)
> 
> If you're talking about adding more steps to the DNS label conversion, I
> don't think you can do this and still claim conformance to the nameprep and
> punycode RFCs.
> 
> This is an interoperability issue: The server sends us an HTML document, the
> user clicks on a link, we convert it from HTML to Unicode and then pass it
> through nameprep and punycode before sending it to the DNS server. The HTML
> document author assumes that we will adhere to the HTML, Unicode, Nameprep
> and Punycode specs. If we don't, we have an interop problem.
> 

One person's interoperability problem is another person's security precaution.

We are not the document author's agent; we are the _user_'s agent. Many document
authors would like us to see their pop-up ads. Our users generally don't. We
explicitly choose not to interoperate with standards-based ECMAScript behaviour
by default in this case. Similarly, we should refuse by defult to interoperate
with URLs which contain domain names with no plausible origin in any human
language, or syntactic brokenness in the structure of their Unicode character
stream, in the spirit of IANA's recommendations, and in spite of some
registrars' willingness to register essentially any pattern of bits registrants
are willing to pay for.

Remember, IE has 100% non-interoperability with IDN, and it looks like we will
probably have to go back to that too in the short run. What I'm proposing will
enable all concievable reasonable IDN labels to work, and auto-reject the
three-dollar labels, _without_ the user needing to keep glancing back at the URL
bar every click to see if they are about to fall down a hole.

Can you suggest a plausible reason for mixing writing systems in the same DNS
label (ie not in the same _name_, just a single dot-delimited segment), other
than in the "safe" combinations such as hiragana-katakana-kanji-latin? Or indeed
to use dingbats, character graphics, musical notes, or cuneiform characters in a
domain name? 

Remember, if this turns out to be a long-term problem, we can always re-allow
this behaviour at some time in the future: indeed, this is just the sort of
behaviour to hide under a hidden pref or two with names like:

* dns.unicode.blacklist-bad-codepoints
* dns.unicode.prevent-script-mixing
Indeed, just to amplify my previous comments a bit more, this is _not_ a 100%
solvable problem, given the current state of IDNA, and the apparent lack of
coordination in the standards and domain registration communities to do anything
about it.

However, codepoint blacklisting and preventing script-mixing probably catch >
90% of all possible problems with zero cost in user attention, and intelligent
locale-based display of URLs will catch perhaps a bit more than another 90%, to
get a > 99% coverage of possible spoofing. That's about the best we can do at
the moment, without the cooperation of registrars, or the creation of new protocols.

Notice that even with both proposals in effect, spelling a domain name with an
'i' with an i-acute instead will get past the browsers of anyone who has their
browser set to read any language containing that character (for example, any of
Danish, Faeroese, Icelandic, Greenlandic, Irish, Welsh, Dutch, Catalan, Spanish,
Galician, Portuguese, Italian, Hungarian, Slovak, or Czech). Perhaps an
even-longer term solution is a special font for the URL bar with exaggerated
accents and clearly different letter-forms for near-homographs within the same
script family?




 
Added Thai, Arabic, Hebrew and Greek to the above document, based on IANA
registry data for these languages given by the .pl registrar.
Attachment #173999 - Attachment is obsolete: true
It seems to me that the real problem is not IDN, but that it is not obvious when
you are going to a new site.  To fix that problem, I would recommend that next
to the lock icon, a history icon should be placed.  If the site is not in the
history of the browser, then a 'new' icon could be displayed.  If the site is in
the 'history', than a history icon could be displayed.  And, if the user has
manually approved the site, then a 'trusted' icon could be displayed.  Another
possibility is to display an icon in the location bar, and allow option
background and forground colors on the location bar.  This way, it is obvious
that you are going to a new site, and if you thought that you were going to an
old site, you should think twice.   
As before, but now with lowercase letters only, as NAMEPREP will ensure we
don't have to worry about capital letters.
Attachment #174006 - Attachment is obsolete: true
(In reply to comment #108)
> (1) dns.unicode.blacklist-bad-codepoints
> (2) dns.unicode.prevent-script-mixing

Let me add another to this list:

(3) dns.unicode.whitelist-good-lang-chars

I guess I thought that (3) would be a more fine-grained solution than (1)
and (2), and would make the the first two unnecessary. But perhaps people
want to implement (1) and (2) to reject those domain names, while (3) allows
them but displays them differently to alert the user.

The link click actions could be enumerated along a different axis:

(a) Silently refuse to perform the DNS lookup (and subsequent connection)
(b) Refuse DNS and connection, but warn the user
(c) Refuse DNS and connection, but warn user and allow user to connect
(d) Do DNS and connection, but indicate suspicious domain in location bar
(e) Do DNS and connection and don't indicate suspicious domain anywhere

So what is your proposal? Is it the following:

(1), (2) -> (b)
and
(3) -> (d)

Or maybe you had other actions in mind?
After some more analysis, it appears that attacks between Latin scripts, and
attacks from Cyrillic to Latin or vice-versa are the main threats, with other
threats having lesser numbers of homographs for attackers to play with, the risk
exposure rising exponentially with the number of attackers.

Just to perform get some order-of-magnitude insight, I looked at the number of
English and Russian words which consist _entirely_ of letters which are
homograms between the Latin and the subset of Cyrillic script used in my Russian
dictionart. These are "acxeoyp", and their Cyrillic counterparts.

In each case, I get roughly the same figure, of around 0.00075 (0.075 %) of the
vocabulary being made up of such words; that's 24 out of 31801 words for
Russian, and 35 out of 47158 words for English. 

Unfortunately, when you consider Cyrillic languages other than Russian, there
are more spoofable characters, notably "ijs", which expands the English
spoofable range by more than a factor of 4 to 183 out of 47158 words, 0.004
(0.4%) of the vocabulary.

By the way, compare this with the situation where we allow script mixing, where
_every_ word containing _any_ of the homographs is a threat: that is to say,
99.7% of all words. Clearly allowing script mixing is a bad thing.

So, given that Netcraft assumes that there are approximately 60,000,000 domain
names registered, and assuming that the word statistics are similar to those of
my English dictionary, a 0.075% share would represent roughly 45,000 spoofable
domains in the ASCII namespace, if we allow Cyrillic labels to spoof them.
Similarly, a 0.4% share would represent 240,000 spoofable labels.

On the other hand, the total number of Cyrillic labels to date is presumably
rather small, and a 0.075% share of that rather smaller.

This leads me to make the following proposal (please bear with me, this is only
the first hack at the logic):

We should consider special processing for Cyrillic labels in names if they
consist _entirely_ of homographs for Latin letters, if they are in a
"non-Cyrillic context". 

A "non-Cyrillic context" is a name with no other _unambiguous_ Cyrillic labels,
which is not in a TLD where Cyrillic characters might be expected to have
priority, such as a TLD for a country where Cyrillic script is the norm.

There are two reasonable courses of action:
* either reject them as a probable spoofing attempt, or
* rewrite them to the equivalent Latin alphabet label, so that "what you see is
what you get"

Similarly, when we _are_ in a "Cyrillic context", we should consider similar
treatment for Latin labels consisting _entirely_ of homographs for Cyrillic
characters.

This can easily be extended to other languages, such as Greek and Coptic.

It has the following advantages:
* it accepts the "grandfathering in" of the the existing Latin namespace
* it does not prevent the use of _unambiguous_ labels from any writing system in
any TLD
* it does not disadvantage any user of a ccTLD for a country with a different
native script, instead giving that script priority locally, as users might expect
* it allows the use of any script in any TLD, when "full IDN" is available: the
use of a TLD in a given script will automatically signal which script is to be
given priority in name interpretation
Can we fix this issue by changing UI? For example, display the raw DNS URL but
offer a tooltip to display the IDN when mouse over? 

I do not like the idea of whitelist / blacklist. (To solve the problem without
it, we can disable the feature, which is not nice but it solve the problem.)  I
do want to show people the IDN.  But that does not mean we need to display it
exactly as where we display URL. Can we display in a different mean ? I once
think about putting the raw DNS in the end of URL but that won't work since a
long URL can push it out from the box. Using other color to display IDN will be
meaningful for people know what it mean but it won't be a good solution for
general public. Adding another text field below the URL on the fly to display
the IDN will be good but the UI will be strange. how about  an IDN icon next to
the URL bar that when people click it the it will show the IDN to the user but
we always display the raw DNS on the URL bar? People can still type in the IDN
into the URL bar, but then it will convert to the raw DNS and display there. If
people click the IDN icon, then it will show a tooltip below the URL bar to show
the IDN.
>how about  an IDN icon next to the URL bar that when people click it the it
will show the IDN to 
>the user but we always display the raw DNS on the URL bar?
or an option in the context menu...

The other thing we may want to do is instead of using the black list the block
the use of IDN, use it to show a 'warnning dialogbox' when the user first get
into that domain. This can be done by a observer which observe the change of the
URL box, whenever we change the URL, scan the URL bar, if it is a new domain
from the previous one and it find any character in the URL fall into that black
list, show a warning dialogbox to the user.
(In reply to comment #114)
> * it does not prevent the use of _unambiguous_ labels from any writing system in
> any TLD
> * it allows the use of any script in any TLD, when "full IDN" is available: the
> use of a TLD in a given script will automatically signal which script is to be
> given priority in name interpretation

This would not be in the spirit of the ICANN guidelines:

http://www.icann.org/general/idn-guidelines-20jun03.htm

I will now try a different stance. Bear with me:

Since IDN is not widely used yet, this might be the time to decide that we
simply will not accept non-ASCII characters in the US TLDs (i.e. .com, .org,
etc). I mean, what business do these Cyrillic registrants have in .com,
anyway? Why can't they just stay where they belong, in *.ru and the like?

If we can get Microsoft to agree with this stance, and they decide to reject
non-ASCII characters in the US TLDs too (when they get around to supporting
IDN), then the world's dominant browsers (i.e. IE and Firefox) will
effectively be enforcing the *correct* IDN rules.

The browsers are the last line of defense. We are the user's agent. We will
protect them from lazy/negligent registrars.

--------------

On the other hand, if MSIE and Firefox *do* allow non-ASCII characters in US
TLDs, then the flood-gates are open. We are making ourselves vulnerable. We
are asking for trouble. Kinda like ActiveX in IE and *.exe in Outlook.
.com, .org, .net.... are *not* US TLDs
Cyrillic registrants can have them if they want, if there site is commercial, an
organisation, linked with network activity....
(In reply to comment #118)
> .com, .org, .net.... are *not* US TLDs
> Cyrillic registrants can have them if they want, if there site is commercial, an
> organisation, linked with network activity....

Couldn't agree more, the gTLDs are absolutely not US-specific in any way. 
And even if they were, don't forget the millions of US citizens who speak
languages other than English and have a legitimate need for IDNs.
(In reply to comment #119)
> Couldn't agree more, the gTLDs are absolutely not US-specific in any way. 
> And even if they were, don't forget the millions of US citizens who speak
> languages other than English and have a legitimate need for IDNs.

Continuing to try this other hat on:

What characters are permitted in a person's name on a US driver's license?
What characters are permitted in the name of a US corporation? Are there
legitimate reasons to keep those rules in place? Do the non-English speakers
in the US transcribe their personal/company names into English letters in
some contexts?

Should the ASCII-only rule only apply to *.us, leaving *.com open to the
world (and the spoofers)?

Should we go for a complex system where domain labels consisting entirely
of homographs are caught?

What happened to KISS (Keep It Simple, ...)?
(In reply to comment #109)
> Perhaps an
> even-longer term solution is a special font for the URL bar with exaggerated
> accents and clearly different letter-forms for near-homographs within the same
> script family?

I think we need to do something like this. Even in English, there are
characters that look very similar in some fonts (e.g. letter l and digit 1
but not capital letter I in domain names, since it's capital).

One possible concern might be the expanded width of the URI in the location
bar. This might be too complicated, but maybe we can use a different font
for the domain name part of the URI, and keep the compact sans-serif in the
rest.
Hardware: PC → All
(In reply to comment #120)
> (In reply to comment #119)
> > Couldn't agree more, the gTLDs are absolutely not US-specific in any way. 
> > And even if they were, don't forget the millions of US citizens who speak
> > languages other than English and have a legitimate need for IDNs.
> 
> Continuing to try this other hat on:
> 
> What characters are permitted in a person's name on a US driver's license?
> What characters are permitted in the name of a US corporation? Are there
> legitimate reasons to keep those rules in place? Do the non-English speakers
> in the US transcribe their personal/company names into English letters in
> some contexts?
> 
> Should the ASCII-only rule only apply to *.us, leaving *.com open to the
> world (and the spoofers)?
> 
> Should we go for a complex system where domain labels consisting entirely
> of homographs are caught?
> 
> What happened to KISS (Keep It Simple, ...)?


The answer to that question is: yes, as simple as possible, but _no simpler_.
Unfortunately, this is not a simple problem, unless you can either detect or
visually eliminate homographs. 

Earlier comments referenced the ICANN rules. Registrars and registries are
simply ignoring the ICANN rules; that's one of the principal reasons why we are
in this mess. Unfortunately, we have no power to force them do so. That's why I
think my proposal is actually more "in the spirit" of the ICANN recommendation
than the current status quo, where we have rules, but nobody follows them. Note
that in the presence of an ICANN-compliant setup, my suggested strategy is
completely invisible, and equivalent to the identity function.

And yes, I'm working on just such as "complex system". It's table driven, and
will probably end up as one page of code, plus language tables many of which are
already known. I've now tabulated the current "accidental spoofing" rates
between the major Latin-script languages, and between Cyrillic and these
languages, and I will continue to work on this proposal to take this new data
into account. 
(In reply to comment #122)
> Unfortunately, this is not a simple problem

Well, that depends on the chosen policy. If Mozilla decides to disable IDN
by default for now, that is very simple. Another simple solution is to stick
to ASCII characters in *.com (using a white list, naturally :-)

> And yes, I'm working on just such as "complex system".

I think it's great that you're doing all this work! Have you heard from any
of the module owners at mozilla.org regarding the type of patch that they
would like to consider? Darin, any thoughts?
(In reply to comment #123)
> (In reply to comment #122)
> > Unfortunately, this is not a simple problem
> 
> Well, that depends on the chosen policy. If Mozilla decides to disable IDN
> by default for now, that is very simple. Another simple solution is to stick
> to ASCII characters in *.com (using a white list, naturally :-)
> 
> > And yes, I'm working on just such as "complex system".
> 
> I think it's great that you're doing all this work! Have you heard from any
> of the module owners at mozilla.org regarding the type of patch that they
> would like to consider? Darin, any thoughts?


I agree, pushing out a fix that will turn off IDN by default is the best
short-term option, unless a properly tested and audited fix can be deployed
first. Then we can set IDN to be on by default in the next release (Firefox 1.1?).

Please note that I have very little desire to code C++ at the moment -- however,
I can provide "executable pseudocode" in the form of Python, as that is ideal as
a rapid development and testing platform, and the test vectors for the Python
testbed can be used for any eventual patch. One of the advantages of a
table-driven approach is that it minimizes code size, and allows just this sort
of development route.
FWIW, not necessarily an endorsement, but Mozilla does have some related code:

mozilla/intl/unicharutil/util/nsCompressedCharMap.cpp
mozilla/intl/uconv/public/nsICharRepresentable.h

There are some tables of characters used in languages at fontconfig.org:

fc-lang/*.orth

We have to be careful with tables we get from elsewhere. They may include too
many characters. We need to be conservative in our solution to the spoofing
problem.
(In reply to comment #117)
> Since IDN is not widely used yet, this might be the time to decide that we
> simply will not accept non-ASCII characters in the US TLDs...
> 
> If we can get Microsoft to agree with this stance, and they decide to reject
> non-ASCII characters in the US TLDs too (when they get around to supporting
> IDN), then the world's dominant browsers (i.e. IE and Firefox) will
> effectively be enforcing the *correct* IDN rules.

That sounds very much like trying to use a dominant position to enforce and
create a defacto proprietary standard, which is just a bad idea.


(In reply to comment #122)
> Earlier comments referenced the ICANN rules. Registrars and registries are
> simply ignoring the ICANN rules; that's one of the principal reasons why we
> are in this mess. Unfortunately, we have no power to force them do so. 

That's a shame most registrars don't follow the rules, however that doesn't
apply to all.  I'm fairly certain, given the relatively strict guidelines
enforced by the auDA for the registration of .au domains which are followed by
all accredited .au registrars, that such spoofed variants simply wouldn't be
allowed.  Thus, I believe, any request for a paypal.com.au variant, for example,
would be rejected by the registrar which, IMHO, is the right level to handle
this problem.  Though, I reluctantly agree, given the situation with other TLDs,
that handling this at the UA level may just be something we have to accept.
Re: de facto standards, I was playing devil's advocate. Just ignore those
comments.

Tonal's doing some great work here, ensuring that Mozilla leads the way.
This is a table of some homograms for ASCII lowercase characters, with
confusion distances. Note that this list is neither definitive nor exhaustive,
and only covers the Cyrillic, Greek and Coptic, Latin-1 Supplement, Latin
Extended-A and Latin Extended-B, and is only provisional even within these
tables. The main purpose of this table is to aid research into homograph
spoofing, and to inspire other developers to inspect the Unicode code charts or
Unicode rendering in other fonts and to extend this table.

Key to confusion distances:
0 => visually identical
1 => almost identical
2 => easily confusable at small font sizes
(In reply to comment #90)

> Just to give you nightmares, here is another scenario. As you know,
> NAMEPREP will map quite a lot of things to the lowercase ASCII
> letters. Now suppose that someone has coded up a spoofed address in
> just these characters, Punycoded it, and slipped it past a dumb
> registrar, perhaps using an automated domain transfer. Now, we have
> two different DNS names, foo.com and xn-<something>.com, both of
> which will map to the ASCII string foo.com after being un-Punycoded

This can't happen.  See RFC3490, specifically steps 6 and 7 of
ToUnicode.  The algorithm for decoding a punycode-encoded domain name
checks that it encodes to the same punycode as you started with, and
leaves it unchanged (ie as punycode) if it doesn't.

It would probably be good to verify that Mozilla implements ToUnicode
correctly though
(In reply to comment #128)
> Created an attachment (id=174139) [edit]
> Experimental table of some homograms, with confusion distances

I suspect GREEK SMALL LETTER GAMMA is confusable with y at small font sizes (in
some fonts)
(In reply to comment #18)
> On the other hand, the Unicode .pdf charts _do_ appear to contain a detailed
> cross reference of visually confusable characters, as do the charts in the
> Unicode book.

Under "Cross References" "Explicit Inequality" the Unicode book says:

"The two characters are not identical, although the glyphs that depict them
are identical or very close."

However, they do not seem to include cross references for *all* of the
spoofs. For example, Cyrillic small 'a' does not have a cross ref.

Maybe they would update the Unicode charts if you send them your info?
You say your table is only provisional at this point, so maybe you would
want to wait until it's more or less "ready".
Another solution to this problem is to pop up a dialog the first time the
user clicks on a link containing a domain name with characters normally
found outside the user's language. (We could still check in the homogram
table solution too.)

The dialog I have in mind would explain the issue and then allow the user
to specify that the browser should allow certain other languages, chosen
from a list that we can generate based on the characters found in the
domain name.

The user could even view the entire list of languages and select some
from there, or just tell the browser to allow any language.

The reason I'm mentioning this white list approach again is because I feel
that the homogram approach is essentially a black list approach, and we
cannot deploy a black list approach until it is complete, whereas we can
start using white lists right away, even before they are complete. Over
time, we can expand the white lists with characters that we deem "safe".

But that's just me. I have no idea how others feel about this...
(In reply to comment #132)
> ...
> 
> The user could even view the entire list of languages and select some
> from there, or just tell the browser to allow any language.
> 
> The reason I'm mentioning this white list approach again is because I feel
> that the homogram approach is essentially a black list approach, and we
> cannot deploy a black list approach until it is complete, whereas we can
> start using white lists right away, even before they are complete. Over
> time, we can expand the white lists with characters that we deem "safe".
> 
> But that's just me. I have no idea how others feel about this...

Erik, I see the blacklist and whitelist approaches as complementary, not
competitive. Neither is 100% guaranteed to work. For example: consider a user
who can read both Russian and English, and thus has chosen to accept URLs
containing domain names in either script. 

Unfortunately, this also means that they will not be alerted if a domain name
contains a label that is 100% Cyrillic characters, but exactly spoofs a
Latin-script name, as this:
* is a 100% conformant IDN which follows the current IANA "one label, one
language" policy precisely
* follows the "no-script mixing" principle suggested earlier
* follows the principle of no graphical characters suggested earlier
* is visually indistinguishable _by design_ from the Latin equivalent, and
cannot be distinguised even by a bilingual Russian/English reader

Note that the reverse would also apply for a Latin spoof of a Cyrillic-script
word. (Conside Latin versions of the Russian words &#1086;&#1088;&#1077;&#1093;, &#1086;&#1088;&#1077;&#1093;&#1072;, &#1088;&#1072;&#1089;, &#1088;&#1072;&#1089;&#1072;, &#1088;&#1072;&#1089;&#1077;,
&#1088;&#1086;&#1089;, &#1088;&#1086;&#1089;&#1072;, &#1088;&#1086;&#1089;&#1077;, &#1089;&#1077;&#1088;, &#1089;&#1077;&#1088;&#1072;, &#1089;&#1077;&#1088;&#1086;, &#1089;&#1077;&#1088;&#1086;&#1077;, &#1089;&#1089;&#1086;&#1088;&#1072;, &#1089;&#1089;&#1086;&#1088;&#1077;, &#1089;&#1089;&#1086;&#1088;&#1091;, &#1089;&#1091;&#1093;, &#1089;&#1091;&#1093;&#1086;, &#1089;&#1091;&#1093;&#1086;&#1077;,
&#1091;&#1088;&#1072;, &#1091;&#1093;&#1072;, &#1091;&#1093;&#1086;, &#1091;&#1093;&#1091;, &#1093;&#1072;&#1086;&#1089;&#1077;, &#1093;&#1086;&#1088;).

This is where we introduce the idea of "script preference" for top-level
domains. Supposedly, registries should filter each label they issue with a
character set from a single specified language, and they should register their
character sets and languages with IANA. However, in the absence of compliance
with these rules, we can help them along a bit in cases where there is ambiguity. 

_This_ is where homograph tables, and the principle of assigning language/script
family precedence to TLDs, are useful; we can close the door on all or nearly
all of the possible spoofing options that remain. (Notice that we've already
squeezed down the possible cases to a very small portion of the namespace, less
than 0.1%, by applying the no-script-mixing and no-graphic-characters rules).

If a URI with a Cyrillic domain name label made up entirely of Latin homographs
is in a domain which has "Latin-script precedence",  we should treat it as
potentially spoofed. Similarly vice-versa. It is then a matter for browser
design policy whether the domain name containing this label should be:
* treated as malformed, and lookups return an error
* generate a warning to the user, and prompt them as to whether they are really sure
* simply provide an on-screen warning banner
* or even attempt to guess the "correct" domain name (DON'T DO THIS LAST ONE!
IT'S ONLY AN EXAMPLE!)

Even this algorithm is not perfect. But it can be very good.
Note that the homograph tables don't need to be 100% perfect to reduce the last
remaining options by many orders of magnitude. If there are (say) 4 distinct
characters in a name, all of whose characters are spoofable, forbidding
script-mixing requires every one of them to be spoofed. Now, only about 0.1% of
the name repertoire will have all-spoofable characters (based on experiments
with wordlists -- this is conservative, because many words are short), and a
homograph table that only has 90% coverage will reduce the number of spoofing
possibilities by a factor of 10**4 = 10,000. So, at the end of this process, we
might expect 0.00001% of domains to be spoofable.

According to Netcraft, there are currently roughly 27,000,000 domains active.
So, with a 0.00001% failure rate, we can expect roughly 2.7 domains to fall
through the cracks and remain spoofable. Increasing the accuracy of the
homograph table to 95%, if you believe this analysis, leaves an expected count
of effectively zero sites left spoofable.

So, what are the tables we need to implement this? 
1. A table of _assigned_ codepoint ranges containing characters that will not be
used in any language
2. A mapping of codepoint ranges to script systems
3. A number of special-case lists of script systems for languages which use more
than one script system (essentially the CJK languages)
4. A homograph table, giving equivalence classes of visually confusable characters 
5. A mapping of ccTLDs to script systems via languages, via existing
machine-readable linguistic sources 

How much code is needed to implement this? Probably (judging by my Python test
programs, and allowing for a less concise language) between 1 and 3 pages of C++.

Note that all of the tables involved are likely to have order 100 entries or
less, and are easily compiled from existing sources. Note that none of them
dictate the character assignments within any language, allowing Unicode to add
new codepoints within a language, and if we know the pattern of future Unicode
assignments (which are pre-planned) we can be forward-compatible with new
updates to Unicode, even without the ability to update the tables. Add the
ability to update the tables, and we have a maintainable, forwards- and
backwards- compatible system which could effectively end the current spoofing
worries regarding IDN, and allow its continued rapid deployment, whilst
providing yet another strong incentive to use Mozilla products.

By the way, please don't treat any of this as a rejection of whitelist
techniques: no method is perfect, and attackers may be ably to find a way
through even the best-designed defences given enough time and ingenuity; this
proposal involves multiple layers of defence, and I think that adding a
whitelist scheme is another good way of aiming for the same objective of
preventing spoofing, in particular for users who are non-European and less
accent-aware.
You're absolutely right. If the user reads both English and Russian, we need
to watch out for homographs. Thanks for being so patient with me.
There's a lot of discussion going on here :-)

One idea which met with approval on the Mozilla security list was the following:

Most domain registrars have been correctly implementing the guidelines for
avoiding IDN-related spoofing problems. AIUI, the .jp registry even delayed
issuing IDN names for six months until the guidelines were finished.
Unfortunately, there are a few rather large exceptions to this - .com being one.

So, the suggestion is to have a blacklist of those TLDs, and display the IDN in
raw punycode form throughout the UI until such time as the registrars get their
act together. Later Firefoxe releases, or automatically-pushed updates, can
shrink (or expand) the blacklist.

This has many significant advantages. It's fairly simple to code, and doesn't
penalise IDN domain owners and registrars who have been doing the right thing.
It doesn't place any restrictions on what domains are allowed. It requires no
user configuration, and no assumptions about what characters a given user might
be familiar with. It involves no pop-ups. It places the blame and the
responsibility where it really belongs, and kills any homograph attacks stone dead.

Gerv
(In reply to comment #135)
> There's a lot of discussion going on here :-)
> 
> One idea which met with approval on the Mozilla security list was the following:
> 
> Most domain registrars have been correctly implementing the guidelines for
> avoiding IDN-related spoofing problems. AIUI, the .jp registry even delayed
> issuing IDN names for six months until the guidelines were finished.
> Unfortunately, there are a few rather large exceptions to this - .com being one.
> 
> So, the suggestion is to have a blacklist of those TLDs, and display the IDN in
> raw punycode form throughout the UI until such time as the registrars get their
> act together. Later Firefoxe releases, or automatically-pushed updates, can
> shrink (or expand) the blacklist.
> 
> This has many significant advantages. It's fairly simple to code, and doesn't
> penalise IDN domain owners and registrars who have been doing the right thing.
> It doesn't place any restrictions on what domains are allowed. It requires no
> user configuration, and no assumptions about what characters a given user might
> be familiar with. It involves no pop-ups. It places the blame and the
> responsibility where it really belongs, and kills any homograph attacks stone
dead.
> 
> Gerv
> 


Neat. Alternatively, you can have a _whitelist_ of TLDs which are known to be
following the ICANN / IANA rules. This is more "politically" neutral, avoids the
issues associated with a blacklist, and yet will act in the same way as a strong
incentive for non-conformant TLD registries to follow best practices. This also
deals better with new TLD allocations.
Whether we go for a white or a blacklist probably depends on getting a much
better view of how widespread the problem is. What I'm hearing from the IDN
community is that most people are playing by the rules - it's just a few
high-profile registrars and TLDs which aren't. If that's the case, a blacklist
is probably good - we do want to send a message. After all, their negligence has
put our users at risk.

On the other hand, if the picture is more mixed than I understand, then perhaps
a whitelist approach might be better. 

Gerv
I would tend to agree that if you're going to have a list of TLDs, then a
white list would be better since we can't anticipate whether new TLDs will
be served by registrars that follow rules.

Also, I think the solution discussed here should still be worked on, even
if Mozilla decides to use the TLD black/white list solution in the interim
since we don't know what Microsoft is going to release when they get around
to it. If they implement the ideas discussed here or some other ideas and
end up supporting IDNs in *.com, then we will probably want to be able to
start supporting it in short order too (via auto-updates or whatever).

Finally, I still think we should seriously consider doing something about
the font in the status and location bars. If expanded width is indeed a
concern, how about my idea of using a good font for the domain name part
only? Maybe this part could be separated out to a different bug number.
Eric: I'm not sure what you mean by "what Microsoft will do" - IE doesn't
support IDNs, and I've not seen it mentioned among any of the things they plan
to do for the next release. In any case, that's ages away - in the Longhorn
timeframe. Hopefully, by then, registrars will have sorted their lives out and
all this will be but a distant memory.

I don't know who writes the IDN plugin for IE. I haven't heard any comments from
them on the situation.

I personally bvdon't think we need to change the status bar font - but you are
right, that should be a separate bug. It's not IDN-specific.

Gerv
(In reply to comment #139)
> Eric: I'm not sure what you mean by "what Microsoft will do" - IE doesn't
> support IDNs, and I've not seen it mentioned among any of the things they plan
> to do for the next release. In any case, that's ages away - in the Longhorn
> timeframe. Hopefully, by then, registrars will have sorted their lives out and
> all this will be but a distant memory.

OK, you're probably right.

> I don't know who writes the IDN plugin for IE. I haven't heard any comments from
> them on the situation.

I found an IDN plug-in for IE at a domain owned by Verisign:

http://www.idnnow.com/index.jsp

> I personally bvdon't think we need to change the status bar font - but you are
> right, that should be a separate bug. It's not IDN-specific.

OK, bug 282079.
> I found an IDN plug-in for IE at a domain owned by Verisign:

That plug-in was developed by VeriSign who are energetic about making it easily
available to registries wishing to implement IDN. (Please note that the IDN
policies for a TLD are set by the *registry*, not the *registrars*.). The
registries that prefer to recommend the Mozilla browsers to their clients are
probably disparing of the discussion in the present forum, which appears eager
to have Mozilla voluntarily take itself off the IDN market. VeriSign, whose .com
policies triggered the current concern, has everything to gain by their plug-in
becoming the only game in town.
> VeriSign ... has everything to gain by their plug-in becoming the only game
> in town.

I should have mentioned that there is also an open source IDN plug-in for IE.
This is presented as alpha, among other things, because it leaks punycode in the
status line. As things are now developing, this may end up being a strong
feature. It can also be toggled on and off, which, togther with the status line
display of punycode, may be all that's really called for to ease present
concerns. The IDN-OSS plug-in performs as described here even when stacked on
top of the VeriSign plug-in (although other negative interaction cannot be
discounted) and probably deserves some attention from the participants in the
present thread. The VeriSign plug-in is available at http://idnnow.com and the
open source at http://idn.isc.org.
I agree, the prospect of a propriatary plug-in monopolizing the market is the
worst of all possible worlds. 

I now think the best short-term solution to the spoofing problem is the one
proposed by Gerv, namely that domains run by non-standards-compliant registrars
get their Punycode made visible -- it's neat, easy to code, and does not require
disabling IDN support.

I continue to believe that stricter IDN filtering rules, both at the registrar
and browser, as suggested in my earlier proposals, are necessary in the medium-
and longer term. However, based on input off-line from a number of people, I now
believe that this can best be achieved by working with IANA / ICANN and the
registry community, so that we do not have to take responsibility for an ad-hoc
non-standard implementation, but can instead be seen to be implementing a
solution based on authoritative standards.

Incidentally, the discussion in this bug has kicked off a related discussion in
the Unicode mailing list, where it has been mentioned that there is now a
proposal to create an "official" homograph list.
> I now think the best short-term solution to the spoofing problem is ... that
> domains run by non-standards-compliant registrars get their Punycode made
> visible -- it's neat, easy to code, and does not require disabling IDN 
> support.

How do you propose determining the identity of the registrar?
What standards is registrar behavior to be judged against?
Rather than just showing the punycode in the status bar which many people either
don't have turned on, don't notice, or may even be altered by a script in the
website (unless that ability has been disabled by the user); how about
displaying the information bar at the top, just like the popup blocker, that
explains that it is an internationalised domain name, the possible security
implications and show the punycode version.

It should provide a more information link/button and the option to add the site
to trusted and untrusted lists.  This could be used in conjunction with any of
the other proposed checks so that it's not shown for every single IDN, just
those that mozilla detects as the likely candidates for spoofs.
I like the idea in comment 145. It may harm valid IRIs a bit, but they are not
widely deployed and I guess the option is pref controlled so people can turn it off.

(Where I mean IRIs that are registered without the intention to spoof users by
valid IRIs.)
> (Please note that the IDN
> policies for a TLD are set by the *registry*, not the *registrars*.)

Well that's good, because it's easy for us to determine the registry (just look
at the TLD), but hard for us to determine the registrar (requires a WHOIS).

If the policies for .com do not protect against phishing, then we should not
display IDN domains in their full form in that TLD, because to do so is a
security risk. It's as simple as that.

I don't quite understand how Mozilla not displaying IDN for .com gives Verisign
a monopoly on anything. But if Verisign want to have a monopoly on putting their
customers at risk of phishing, let them.

I strongly believe that whatever solution we implement should allow full,
uncrippled and first class implementation of IDN in those cases, whatever they
may be, where we have established that there is no more risk than in the ASCII
domain name space.

(In reply to comment #145)
> Rather than just showing the punycode in the status bar which many people either
> don't have turned on, don't notice, or may even be altered by a script in the
> website (unless that ability has been disabled by the user); 

My suggestion is not to only show the punycode in the status bar, but to use it
everywhere for TLDs which have poor homograph control policies.

The status bar is always-on in Firefox, unless the user specifically disables
it. This is a security feature. The security area of the status bar (to the
right) cannot be altered by script.

> how about
> displaying the information bar at the top, just like the popup blocker, that
> explains that it is an internationalised domain name, the possible security
> implications and show the punycode version.

A strong characteristic of a good solution is that it does not discriminate
against all IDN domain names. This solution, in its plain form, does.

There is definitely value in using a phishing detection heuristic to display
such a bar - but that's fixing the more general phishing problem, not just the
homograph one.

Gerv
> I don't quite understand how Mozilla not displaying IDN for .com gives
> Verisign a monopoly on anything.

Take a look at the documentation provided by the TLDs that support IDN.
Prominent in every such text is a clear reference to the need for an
IDN-complaint browser, and a list of available alternatives. Such lists are
almost always headed by the VeriSign IE plug-in and Mozilla, often listing no
further alternatives. Regardless of VeriSign's IDN policies in .com, their IE
plug-in is a sound implementation of IDNA. The same goes for Mozilla. What do
you think the maintainers of the TLD documentation are going to do if the one of
these two decides that it is no longer going to provide rigorous support for IDNA?

> if Verisign want to have a monopoly on putting their customers at risk of
> phishing, let them.

How are you defining the concept of VeriSign customer?  Someone who expects to
be able to use the Unicode form of an IDN in .com?  Someone who is using IE for
the task?
(In reply to comment #148)
> Take a look at the documentation provided by the TLDs that support IDN.

Could you give links to such documentation for a few different TLDs?

> Prominent in every such text is a clear reference to the need for an
> IDN-complaint browser, and a list of available alternatives. Such lists are
> almost always headed by the VeriSign IE plug-in and Mozilla, often listing no
> further alternatives. Regardless of VeriSign's IDN policies in .com, their IE
> plug-in is a sound implementation of IDNA. The same goes for Mozilla. What do
> you think the maintainers of the TLD documentation are going to do if the one of
> these two decides that it is no longer going to provide rigorous support for IDNA?

So your argument is "People won't use or recommend your browser if you try and
protect them from security problems"?

Are you saying that providing "Warning! This could be a scam!" information on a
subset of IDN names is "rigorous support", whereas allowing all IDNs except
those in known-risky TLDs is not?

I would hope that, in a few months, the problematic registrars will see the
writing on the wall and fix their policies. By the time IDN use becomes
widespread, everything will be sweetness and light again. However, some pressure
needs to be put on them to achieve this aim.

If we accept responsibility for the problem, say "Yeah, you keep on registering
what domains you like. We'll try and sort out the phishing problem at our end",
then we're opening ourselves up to massive and unnecessary liability and bad
publicity every time a bug is found in e.g. our embedded homograph tables (if
that's the solution chosen - it's an example).

> > if Verisign want to have a monopoly on putting their customers at risk of
> > phishing, let them.
> 
> How are you defining the concept of VeriSign customer?  Someone who expects to
> be able to use the Unicode form of an IDN in .com?  Someone who is using IE for
> the task?

Someone who registers a domain in .com - like Paypal, Inc. or Bank Of America.
These companies are put at greater risk of damaged reputations and irate
customers with monetary losses because of Verisign's (and the other .com
registrars') lack of control over domain registration.

Gerv
>> Take a look at the documentation provided by the TLDs that support IDN.
> 
> Could you give links to such documentation for a few different TLDs?

Most of it is in the national languages of ccTLD registries. Dot-com provides a
good example of the way a large commercial gTLD is doing this, but given their
proprietary interest in browser support, they only point users in the direction
of their own plug-in:

http://verisign.com/products-services/naming-and-directory-services/naming-services/internationalized-domain-names/index.html

The other end of the gTLD scale -- small domain, non-profit operation -- is
dot-museum, http://about.museum/idn/. Their list of supported languages contains
links to a number of ccTLD IDN support sites.

> Are you saying that providing "Warning! This could be a scam!" information on
> a subset of IDN names is "rigorous support", whereas allowing all IDNs except
> those in known-risky TLDs is not?

I believe the warning text to be an excellent means for balancing the two
considerations. What I don't want to see happen is the resolution of IDNs made
conditional.
 
> I would hope that, in a few months, the problematic registrars will see the
> writing on the wall and fix their policies.

Again, registrars are not responsible for the IDN policies of TLDs.

> By the time IDN use becomes widespread, everything will be sweetness and
> light again. However, some pressure needs to be put on them to achieve this
> aim.

The question of who needs to apply what kind of pressure to whom, and how that
might effectively be done, goes way, way, beyond the scope of the present
discussion. The whom is, however, not the TLD registrars.
> I believe the warning text to be an excellent means for balancing the two
> considerations. What I don't want to see happen is the resolution of IDNs made
> conditional.

So people are not going to recommend Mozilla if we disable IDN in problematic
TLDs, but they are if some (random, to the user) uses of an IDN pop up a scary
warning message?

> > I would hope that, in a few months, the problematic registrars will see the
> > writing on the wall and fix their policies.
> 
> Again, registrars are not responsible for the IDN policies of TLDs.

The link's not hard to understand. This is the way we put pressure on:

- disable IDN for problematic TLDs
- less people register IDN names in those TLDs, because
- registrars get less money
- registrars either get together to solve it, or put pressure on the registry
- registry or registrars implement sensible policies
- we lift the block.

How else do you suggest that we persuade them to sort their acts out?

> The question of who needs to apply what kind of pressure to whom, and how that
> might effectively be done, goes way, way, beyond the scope of the present
> discussion. 

It's precisely the present discussion - because if we are going to take
responsibility in the browser for solving this problem (which would be contra to
Opera's stance, and my understanding of current mozilla.org staff opinion) then
our course of action is going to be very different to that if we are making it
clear it's a registry problem.

Gerv 
My apologies; in a rush to catch Match of the Day, I left that message
uncomplete and unnecessarily brusque. Attempt 2 at the middle section:

> Here's how we establish a link:
> 
> - disable IDN for problematic TLDs
> - less people register IDN names in those TLDs, because they appear ugly
> - registrars get less money
> - registrars either get together to solve it, or put pressure on the registry
> - registry or registrars implement sensible policies
> - we lift the block.
> 
> How else do you suggest that we persuade them to sort their acts out?

Gerv
How do we determine which TLD's are "safe" (or unsafe)? There's a fairly short
list registered with IANA (comment 35).  Other comments state belief that other
TLDs are responsible (e.g. au,de). Some suggest going by statements made by the
registries themselves, but Verisign says the right things
(http://verisign.com/products-services/naming-and-directory-services/naming-services/internationalized-domain-names/page_001394.html#01000006)
while "paypal.com" in a mixture of latin and cyrillic shows .com is broken.
It is difficult to decide on a black (or white) list of TLDs. However:

(In reply to comment #135)
> This has many significant advantages. It's fairly simple to code, and doesn't
> penalise IDN domain owners and registrars who have been doing the right thing.
> [It doesn't place any restrictions on what domains are allowed.] It requires no
> user configuration, and no assumptions about what characters a given user might
> be familiar with. It involves no pop-ups. It places the blame and the
> responsibility where it really belongs, and kills any homograph attacks stone
dead.

I have another proposal that meets all of the above criteria, except for the
one enclosed in [ and ] which I feel is inappropriate to begin with.

We simply use the subset of IANA IDN tables that we deem safe as a filter.
If a domain name contains characters outside the TLD's table, we present
the Punycode form of the name in the UI.

The safe IANA tables are ones that either have a single language or have
multiple languages but do not have homographs. The JP table is a good
example of a safe table.

On the other hand, the .biz German table is unsafe because it implies that
other languages such as Russian might be registered in the future. This
means that .biz allows homographs, so it doesn't pass our test.

This places the pressure where it belongs, i.e. on the guidelines authors,
the IANA registry and the domain registries.

Mozilla could either start with the very small number of safe IANA IDN
tables (putting some pressure on registries that haven't registered their
table yet) or a larger number of tables that we come up with on our own
(which the TLD registries can use for their IANA submission if they wish).
TLDs without tables would default to US-ASCII, a safe set.

Mozilla would enlarge its set of tables via new releases, auto-updates or
user-intervention-less secure downloads.

This means that our rules would never allow Cyrillic small letter 'a' to
be used in a *.com domain name, but that's OK because the Latin small
letter 'a' looks the same to a human and its character code should not
be a concern.
(In reply to comment #154)
> This places the pressure where it belongs, i.e. on the guidelines authors,
> the IANA registry and the domain registries.

It also pressures the domain registrars.

> This means that our rules would never allow Cyrillic small letter 'a' to
> be used in a *.com domain name, but that's OK because the Latin small
> letter 'a' looks the same to a human and its character code should not
> be a concern.

Sorry, what I meant to say is that we would allow any domain name but we
would display the Punycode form if it didn't follow the TLD's rules.

I should add that my proposal provides for a second filter. If the domain
registrar fails to filter a new domain name or to remove an old spoof,
then our filter will catch it.
My most recent proposal is in some sense taking Mozilla's standards stance
to its logical extreme. I.e. Mozilla does not simply follow the ECMAScript
standard as is; it blocks pop-up windows.

My proposal does not simply follow the ICANN guidelines as is; it blocks
homographs.

However, a lot of people would point out that this is like the tail wagging
the dog. Me being the tail and ICANN, IETF, IANA, Unicode Consortium,
domain registries and registrars being the dog.

So it would probably be prudent for Mozilla to heed Neil's advice:

(In reply to comment #143)
> I continue to believe that stricter IDN filtering rules, both at the registrar
> and browser, as suggested in my earlier proposals, are necessary in the medium-
> and longer term. However, based on input off-line from a number of people, I now
> believe that this can best be achieved by working with IANA / ICANN and the
> registry community, so that we do not have to take responsibility for an ad-hoc
> non-standard implementation, but can instead be seen to be implementing a
> solution based on authoritative standards.

Having said that, the easiest way to come up with a list of TLDs for Gerv's
proposal is to first choose to use a white list of such TLDs and then to
take a closer look at the IANA IDN registry.

I propose to use the following list of TLDs initially: jp, kr and th.
(In reply to comment #156)
> My most recent proposal is in some sense taking Mozilla's standards stance
> to its logical extreme. I.e. Mozilla does not simply follow the ECMAScript
> standard as is; it blocks pop-up windows.

window.open is not part of any standard, _especially_ not ECMAScript.
(In reply to comment #157)

Chuckle. I really ought to just shut up... :-)
The ICANN guidelines include the following:

"top-level domain registries will (a) associate each registered
internationalized domain name with one language or set of languages"

I wonder if this language or set of languages can be looked up via DNS
itself. I.e. is there a DNS record for the language(s)?

Or are the registrars only expected to apply language rules at the time
of registration itself?
(In reply to comment #145)
> Rather than just showing the punycode in the status bar which many people either
> don't have turned on, don't notice, or may even be altered by a script in the
> website (unless that ability has been disabled by the user); how about
> displaying the information bar at the top, just like the popup blocker, that
> explains that it is an internationalised domain name, the possible security
> implications and show the punycode version.
> 
> It should provide a more information link/button and the option to add the site
> to trusted and untrusted lists.  This could be used in conjunction with any of
> the other proposed checks so that it's not shown for every single IDN, just
> those that mozilla detects as the likely candidates for spoofs.

I like this idea a lot. To me, it'll be the best method of alerting the user
without causing inconvience, and this will also give them a better sense of
security.
(In reply to comment #156)
> I propose to use the following list of TLDs initially: jp, kr and th.

I guess many people would complain that this list is too short. There appear
to be quite a few IDNs registered around the world, e.g. Europe, China.

If Mozilla requires these TLD representatives to register their table with
IANA in order to be included in Mozilla's white list, they might be in too
much of a rush to compile the table and submit it, increasing the risk of
mistakes.

Perhaps we should instead have them point us at any existing tables they
have (e.g. the DE table mentioned here earlier) and have them state their
intent, in writing, to register with IANA.

This way, they can take their time to polish the table(s) and also their
implementations of filters at registrars, etc.
(In reply to comment #151)
> So people are not going to recommend Mozilla if we disable IDN in problematic
> TLDs, but they are if some (random, to the user) uses of an IDN pop up a
> scary warning message?

If someone is actively maintaining a list of IDNA-compliant applications, it
would be reasonable for them to remove an item from the list if it ceased to
fulfill all "must" requirements stated in the protocol, or was otherwise
rendered inapplicable to the entire TLD namespace. If useful functionality is
added to an IDNA-aware application, it would be equally reasonable for the
application to remain on the list.

> - disable IDN for problematic TLDs
> - less people register IDN names in those TLDs, because they appear ugly
> - registrars get less money
> - registrars either get together to solve it, or put pressure on the registry
> - registry or registrars implement sensible policies
> - we lift the block.
>
> How else do you suggest that we persuade them to sort their acts out?

Registrars provide automated front-ends to the TLD registries, with the policy
engines residing on the latter platform. Registrars may freely decide which TLDs
they wish to service, but then need to support each selected TLD full out.

Registrars compete fiercely with each other on a market that is still only a
shadow of what it once was. If you wish to teach individual registries a lesson
by somehow whipping their sales agents into compliance, you'd need to be able to
do this without leaving any remaining registration channel into a shunned TLD.
As it happens the only gTLD where there is real IDN money is .com. Network
Solutions (the largest of the registrars, previously doing business as VeriSign
Registrar) is certain to support this domain regardless of what any other
registar may feel compelled to do -- and all the other registrars know it.
Note that if Mozilla is going to use published language tables, you're going to
have to look rather harder than just at the IANA registry.

The ICANN guidelines only require the registry to publish their language tables;
they can do so by any appropriate means (eg by placing them on their web site).
 Use of the IANA registration mechanism is entirely optional, so it would be
inappropriate to penalize those registries that haven't used it.
This was just posted on the Unicode mailing list:

From: 	Mark Davis <mark.davis@jtcsv.com>
To: 	Unicode Mailing List <unicode@unicode.org>, UnicoRe Mailing List
<unicore@unicode.org>
Subject: 	IDN Security
Date: 	Mon, 14 Feb 2005 09:20:06 -0800  (20:50 IRST)

There were a few items coming out of the UTC meeting in regards to IDN.

1. We will be adding to draft UTR #36: Security Considerations for the
Implementation of Unicode and Related Technology
(http://unicode.org/reports/tr36/). In particular, this will include more
background information and a set of specific recommendations for both
browsers and registrars; both to be refined over time.

2. The UTC has authorized the editorial committee to make updates to #36
between UTC meetings, to allow for faster turn-around in presenting both
background material and recommendations. We will try to incorporate ideas
presented on these lists and others, so suggestions are welcome.

3. The UTR had for some time recommended the development of data on visually
confusables, and we will be starting to collect data to test the feasibility
of different approaches. In regards to that, I'll call people's attention to
the chart on http://www.unicode.org/reports/tr36/idn-chars.html, that shows
the permissible IDN characters, ordered by script, then whether decomposable
or not, then according to UCA collation order. (These are characters after
StringPrep has been performed, so case-folding and normalization have
already been applied.)

&#8206;Mark
Is the IDN display only happen in URL bar?
How about email header? If I regsiser a domain as "ao" + cyriliic l + ".com" and
send a email as "ytang0648@ao" + cyrillic l + ".com" to you and you look at it
from thunderbird, when you reply, will it go back to ytang0648@aol.com or it may
go back to ytang0648@ao + cyrillic l + ".com' ?
about "mix set"- considering someone use "www." + "ÈZÈQÈUÈi" + ".com" for www.ebay.com 

all the characters in "ÈZÈQÈUÈi" are in Cyrillic block. Not a mix set. 
Its going to take longer to sort this out.  minus for 1.0.1 and plus for 1.1
Flags: blocking1.8b2+
Flags: blocking1.8b-
Flags: blocking-aviary1.1+
Flags: blocking-aviary1.0.1?
Flags: blocking-aviary1.0.1-
I think trying to map against specific glyphs that are similar is always going
to be error-prone and difficult unless all browsers standardise on a font across
platforms for display of urls.

My suggestion for immediate response to this bug is as follows:

1) Use a different colour for the address bar for domains that are not in the
range provided by the users default character encoding (even if this is ascii
for say Japanese users). This treats all domains equally. What a user will need
to know is when a domain is not in their default encoding (otherwise they can
basically trust the glyphs I guess).

2) For domains covered by 1) above, also include the raw unicode character code
of the domain (as opposed to friendly view) in brackets after the domain name.
Are these adequate visual clues??

3) The first time any site is visited that meets the condition 1) above, display
a warning to the user, explaining what has happened, and give them the option to
permanently disable this warning.

4) I think blacklists (other than country and international standards/guideline
based) should be handled at the proxy layer using real-time block lists. If
these things aren't handled locally in this way, there is a danger that
legitimate users who are unfortunately caught by them might consider that they
are being denied service maybe?
First off, here's an attachment showing how the IDN URI looked in my address
bar the first time I went to it.

Imagine my suprise when I saw it and said, "that doesn't work at all... what
the hell is that letter?"

I'm not espousing that we make different fonts appear with impossible to read
characters (my OS has upsettingly done that for me already by screwing with my
Intl. fonts it seems)... What I am saying is that we should notice how
different we can make something (and how obvious it can appear) when we use
different fonts, make things bold, etc.

Now, Paul Hoffman suggests using this same sort of solution in his blog
http://lookit.proper.com/archives/000302.html

We have to remember that making the IDN feature obnoxious will only breed
dissatisfaction. but making it noticable will hopefully alert people to what is
happening.

Also, we should not (and i'd imagine "Must Not" in some spec somewhere)
manipulate the URI in such a way that renders it useless. In other words, we
cannot put extra characters in the Address bar like exclamation marks, etc.
While our browser might be able to handle them, we have to remember that people
bookmark, copy/paste, save, and physically write (using pen and paper) URIs all
the time. We can't have "but if I type it in my browser it works fine,
Grandma".

So setting colors, spacing, weight, etc. of various character sets by default
should seem to be a valid solution. Obviously, we should choose fonts that
match the user's localization as best as possible (i.e. cryllic users could
have the cryllic letters appear less intrusively since they'd most likely be
seeing them more.)


Also, as Paul Hoffman suggests, information about the specifics of the issue
should be presented to the user. I think we've all decided that dialogs do more
harm than good, but the new "Alert Bars" that firefox and thunderbird use when
installing software or loading remote images all seem like valid (less)
obtrusive notifications. And having them constantly wouldn't be necessary... we
could simply state to the user, "This page features mixed characters from
different languages. If this is unexpected, this page may be fraudulent." Then
we could have a button (much like the "allowed sites button" for firefox) that
would dismiss this message for particular combinations of character sets.

In this fashion, a user could easily deactivate the warning for certain
character set combinations that they commonly visit. Localizations could even
make the setting for them. Other users will still be notified (both in the
Address bar itself, and in the "Alert Bar" message)


Obviously, no solution will fully remove the danger of this type of spoof, but
a simple consistent system of alerting the user will sufficiently enable them
to make informed decisions.
(In reply to comment #169)
> Created an attachment (id=174352) [edit]
> Screenshot showing Mozilla's default rendering on my computer.
> 
> First off, here's an attachment showing how the IDN URI looked in my address
> bar the first time I went to it.
> 
> Imagine my suprise when I saw it and said, "that doesn't work at all... what
> the hell is that letter?"

Well, it didn't work well in your case because you used an X11core font build of
Mozilla on a Unix-like platform. Mozilla built with a modern font system (i.e.
Xft) on Unix as well as on Windows and Mac OS X makes it all but impossible to
tell Cyrillic 'a' from Latin 'a' because in some (truetype and opentype) fonts
covering both Latin and Cyrillic (there are many of them), a *single* glyph is v
likely to be shared by two letters so that they look 100% identical. 

As demonstrated by your screenshot, what's been regarded as a hindrance to a
good looking rendering (an inflexible partitioning of characters into font
character sets in X11)  could give us a hint about a potential solution.

  
> 
> I'm not espousing that we make different fonts appear with impossible to read
> characters (my OS has upsettingly done that for me already by screwing with my
> Intl. fonts it seems)... What I am saying is that we should notice how
> different we can make something (and how obvious it can appear) when we use
> different fonts, make things bold, etc.
> 
> Now, Paul Hoffman suggests using this same sort of solution in his blog
> http://lookit.proper.com/archives/000302.html
> 
> We have to remember that making the IDN feature obnoxious will only breed
> dissatisfaction. but making it noticable will hopefully alert people to what is
> happening.
> 
> Also, we should not (and i'd imagine "Must Not" in some spec somewhere)
> manipulate the URI in such a way that renders it useless. In other words, we
> cannot put extra characters in the Address bar like exclamation marks, etc.
> While our browser might be able to handle them, we have to remember that people
> bookmark, copy/paste, save, and physically write (using pen and paper) URIs all
> the time. We can't have "but if I type it in my browser it works fine,
> Grandma".
> 
> So setting colors, spacing, weight, etc. of various character sets by default
> should seem to be a valid solution. Obviously, we should choose fonts that
> match the user's localization as best as possible (i.e. cryllic users could
> have the cryllic letters appear less intrusively since they'd most likely be
> seeing them more.)
> 
> 
> Also, as Paul Hoffman suggests, information about the specifics of the issue
> should be presented to the user. I think we've all decided that dialogs do more
> harm than good, but the new "Alert Bars" that firefox and thunderbird use when
> installing software or loading remote images all seem like valid (less)
> obtrusive notifications. And having them constantly wouldn't be necessary... we
> could simply state to the user, "This page features mixed characters from
> different languages. If this is unexpected, this page may be fraudulent." Then
> we could have a button (much like the "allowed sites button" for firefox) that
> would dismiss this message for particular combinations of character sets.
> 
> In this fashion, a user could easily deactivate the warning for certain
> character set combinations that they commonly visit. Localizations could even
> make the setting for them. Other users will still be notified (both in the
> Address bar itself, and in the "Alert Bar" message)
> 
> 
> Obviously, no solution will fully remove the danger of this type of spoof, but
> a simple consistent system of alerting the user will sufficiently enable them
> to make informed decisions. 
Here is my proposal (I'm a native Russian, who visits English, Deutsch, Russian
and Ukriane sites often)
In short: display alphabet name next to second level part of the domain. Let's
take for example abbreviation "pap" written in English and Cyrillic letters, it
looks absolutely the same in Russian and Ukraine but I can't blacklist it
(remember I read Russian, Ukraine and English). Let's look how latin pap can be
displayed in different domains:

pap.com
pap.de
pap.ru
pap.ua

Here is how pap written in Cyrillic letters can be displayed:
pap--Cyrillic.com
pap--Cyrillic.de
pap--Cyrillic.ru
pap--Cyrillic.ua

If we mix Cyrillic and Latin letters the browser will display them as
pap--UNKNOWN.com
pap--UNKNOWN.de
pap--UNKNOWN.ru
pap--UNKNOWN.ua

Good, but now we have a problem with Ukraine language: it
uses Latin i. So we have to add Ukrain language detector.
For example for the word pip where p is Cyrillic and i is
Latin the browser will show:

pip--Ukrainian.com
pip--Ukrainian.de
pip--Ukrainian.ru
pip--Ukrainian.ua


This way URLs must be displayed everywhere, not only in the address
area! Alphabet detector can return three types of results:
alphabet name, UNKNOWN and INVALID. When detector for all Cyrillic
languages is completed, it is possible to forbid any other mixes of
Cyrillic and Latin alphabets, so paypal with some Cyrillic letters 
will be simply an error. You won't be even able to visit it.
  I don't know about others, but I tend to ignore the banners at the top of a
page unless I have a specific reason to look at them.  I like the banner concept
(ie. that it's less annoying than a pop-up).
  In order to avoid a user simply ignoring the banner or mindlessly clicking
away a pop-up box, I think that a user should be unable to submit information to
a site without dismissing the banner.  I'm imagining that if the banner is not
dismissed that a pop-up box saying something like: "Due to a potential homograph
attack, mozilla has blocked submitting data to this site.  Please follow _this
link_ for more information on homograph attacks.  To allow data submission,
please dismiss the homograph attack banner. [Cancel]"
  The wording probably needs work, but I hope it gets the idea across.  I think
the pop-up box has to have only one button on it in order to force a user to
read it.  I don't expect this kind of confirmation if the user has submitted
information to the site previously, but for the first visit I think it would be
acceptable.
(In reply to comment #172)
>   I don't know about others, but I tend to ignore the banners at the top of a
> page unless I have a specific reason to look at them.

When you are about to enter your credit card number, do you look at the
location bar? (And do you like Mozilla's default font there? :-)

>   In order to avoid a user simply ignoring the banner or mindlessly clicking
> away a pop-up box, I think that a user should be unable to submit information to
> a site without dismissing the banner.

What if the page says:

"Due to an increased number of identity theft cases on the Internet
recently, we strongly urge you to use registered mail to confirm your
password. Our secure address is P.O. Box 2369, Miami, Florida 34210."
(In reply to comment #45)
> ... For example, a different color ...

I've been arguing in Bug 22183 for a color-coded URL, which would (I believe)
help with the IDN issue as well. 
* https://bugzilla.mozilla.org/show_bug.cgi?id=22183#c233
* https://bugzilla.mozilla.org/show_bug.cgi?id=22183#c237

Then, today I found on boingboing.net a link to
http://lookit.proper.com/archives/000302.html which talks about having a
different background color for homographs in a tooltip. 

I think that the different background color (or style or color) would work well
in the URL bar.
The following announcement was posted to the mozilla.{seamonkey,security}
newsgroups recently:

http://weblogs.mozillazine.org/gerv/archives/007556.html
Here is some info about 3 IDN plug-ins for MSIE and whether MSIE might
support IDN in the future:

http://support.microsoft.com/?kbid=842848

I found the above link in the following:

http://www.w3.org/International/articles/idn-and-iri/
(In reply to comment #171)

Thank you for sending this info, especially the Ukrainian info.

Do you think that Cyrillic domain registrants might also wish to include
some "Latin" (ASCII) letters in their domain names? (For example, some
foreign names like "IBM" or some better example?)

Neil, would it be possible to use code points instead of ranges in your
proposal's script detection in order to support Ukrainian?
(In reply to comment #173)
> >   In order to avoid a user simply ignoring the banner or mindlessly clicking
> > away a pop-up box, I think that a user should be unable to submit information to
> > a site without dismissing the banner.
> 
> What if the page says:
> 
> "Due to an increased number of identity theft cases on the Internet
> recently, we strongly urge you to use registered mail to confirm your
> password. Our secure address is P.O. Box 2369, Miami, Florida 34210."

That's just suspicious to begin with ;)  Short of incredibly annoying actions on
every potential homograph page (such as blacking out the displayed page),
there's no real way (for mozilla) to stop that.

Given that the confirmation I gave will only (incorrectly) occur if all the
followiong are true:
  * The page is falsely detected as a homograph by mozilla
  * The user ignored the warning banner
  * The site requires the user to type in and submit information
    (as opposed to selecting choices from drop down boxes or clicking links)
  * It's the first time the user submits information to the site
I don't think the forced acknowledgement is unreasonable...
(In reply to comment #171)
> Good, but now we have a problem with Ukraine language: it
> uses Latin i. So we have to add Ukrain language detector.
> For example for the word pip where p is Cyrillic and i is
> Latin the browser will show:

Actually, I don't understand this. The Unicode book says that the following
character exists:

0456 CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
The test from secunia.com (http://www.payp&#1072;l.com/ where the 2nd. a is fake) is
working even when network.enableIDN is set to false at least on SuSE 9.1 FireFox
1.0 i686.
(In reply to comment #179)
> (In reply to comment #171)
> > Good, but now we have a problem with Ukraine language: it
> > uses Latin i. So we have to add Ukrain language detector.
> > For example for the word pip where p is Cyrillic and i is
> > Latin the browser will show:
> 
> Actually, I don't understand this. The Unicode book says that the following
> character exists:
> 
> 0456 CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I

I can confirm that this character should indeed be used in Ukranian. Someone
else on the IDN list also raised that Tajik mixes Cyrillic and Latin, which was
wrong again. As far as I know, the only language that mixed Latin and Cyrillic
is a very old orthography of Kurdish, which is rarely used today. Most Kurdish
writers use either Arabic or Latin.
How about: if each label in the https domain name is composed of characters in
one of the languages the user understands, then in the status bar display
"unicode domain name  padlock" else display "punycode domain name  padlock". 
Top of browser, the address is always in Unicode (to avoid cultural offence),
but there could be a small IDN symbol to the left of the favicon, clickable to
see punycode and IP address.

If the accept-language list was empty, the browser locale language would be
used.  Otherwise the accept-language list should be used and the locale ignored.
 (If I'm in an internet cafe in Germany (I don't speak German), I add "en, eo,
fr" to the accept-languages (as I read English/Esperanto/French but not German).)

What characters are in each language?  For Europe, see
http://www.evertype.com/alphabets/index.html .  Each domain label should be in
one and only one of the user's understood languages (but edge conditions do
exist eg see http://www.toysrus.co.uk -- could the r be cyrillic?).

cheers,
Aaron
http://lingvo.org/idnd
Another way to protect the user against this,
(and phishing attempts) is to check when
the server's DNS record was registered.

If it's less than (say) 1 week ago, then alert the user,
and warn that the server is new, and to be cautious
with personnel / financial information.

Of course, this will encourage phishers to sit on 
a newely registered site before using it....
A couple comments.

1. White/blacklist of TLDs ignores other spoofing possibilities.  It is possible
for IDN characters to appear in more than just the domain name.  Combined with a
DNS cache poisoning attack, it is possible to then spoof a third-level hostname.
(e.g. &omega;&omega;&omega;.foobank.de)

2. Language detection by registrars seems pretty unreliable.  I registered a
Japanese-language IDN .com domain with a French registrar and they assigned the
language to be French.

3. I like the both the information bar and highlighting ideas.  Ideally, only
the non-ascii characters will be highlighted in the host name.  Mousing over
would then provide more information.

-Chris
(In reply to comment #184)

Re: 1., the TLD white/black list proposal would simply apply to *all*
domains under those TLDs, so the 3rd level IDN domain name spoof you
mention is simply reduced to the DNS cache poisoning problem itself,
which I submit is outside the scope of this bug report. No?

Re: 2., when/how did you find out that the registrar assigned the French
language to your domain name? Thanks!
The article "Phishing - Browser-based Defences" at
http://www.gerv.net/security/phishing-browser-defences.html
makes a lot of good points. Here are two comments
about it, though.

First, I think it sells the colorizing of letters
a little short.  There are millions of users who would
NEVER visit a DNS site name that had mixed scripts.
Obviously, the millions of people who only read & write
English are unlikely to WANT to visit a site that uses
non-English letters, since such sites tend to not use English (!).
Only colorizing when they're mixed, and letting people
turn it off if it's "ugly", would help millions of people.
Yes, the colorblind & some users will not be helped, but
if you help the majority, it's unlikely phishers will perform
the attack.  A phishing attack that only works against
the colorblind is less likely to be attempted.

One simple solution: you can often guess the best default
("should I colorize or not?") based on the user's language
settings; then let the user set it differently as an option.

Second, I think you'll need more glyphs than 2
if you do the "symbol glyphs" approach (which is
an interesting idea!).  A phisher
could create a program that randomly morphs a
domain name among many different ways, trying to
find a good substitution, and then hashing the result
to see if the glyphs match.

The way to figure out the required number of glyphs
is to imagine a program that can create a large number of
"phish food" domain names from a given name
(substitute l for 1, substitute O for 0, do both, ...),
and see how many alternatives you can find for a
given name.  Here's a back-of-the-envelope calculation;
say a phished domain name has no more
than 10 characters DNS chararacters in the name, and on
average each character can be reasonably substituted
with 3 other characters. (These are guessed numbers,
but it should be possible to figure out REAL values
from these using common phished domains like
paypal.com, ebay.com, etc.). That means that the set of
alternatives is (3+1)^10-1, i.e., 1,048,575 alternatives -
about one million.  A two-char glyph only gives
64^2 = 4,096 hashes, so a phisher is almost
certain to find several alternatives with the same visual hash.
Four glyphs gives you 64^4=16,777,216.. a phisher
only has approximately 1/16 chance of finding a match.
Five glyphs gives you 1,073,741,824... the phisher
has around a 0.1% chance of finding a match.
Shorter domain names are even harder to forge.

--- David A. Wheeler 
"the millions of people who only read & write
English are unlikely to WANT to visit a site that uses
non-English letters, since such sites tend to not use English (!)."

So some 350 million native english speakers (assuming none of them speaks a
foreign language) are a majority now? ;) But I'm in favor of some sort of color
code as well. It's quite similar to the glyphs Gerv is suggesting in that people
see a difference between their usual site and a phishing site. While a hash can
can be "spoofed" with a hash collision or a similar looking sign the color
coding would only tell you about the character encoding. The difficulty will be
to compartmentalize the character encodings in a way that it is unlikely that
two different encodings with the same color could be used to spoof a similar
looking domain.
(In reply to comment #38)
> Created an attachment (id=173729) [edit]
> Proposed blacklist of Unicode code points that should never occur in URLs

Please don't blacklist the Runic alphabet, as proposed by your list!

In general, I'm against blacklisting by alphabet, because this introduces issues
of favoritism.  There's homograph attacks in almost every alphabet, especially
considering that many of them have common origins (Phoenician, etc.).

A selfish reason: I put up a site just for fun, (Thurisaz).com: xn--9ve.com

There's still users of the Runic alphabet out there (mostly scholarly/religious,
as with other "obsolete" alphabets).  Google "futhark".

Proposed solutions that are more inclusive (and have already been discussed in
this bug):

* For any characters that are not traditional domain name characters (A-Z, 0-9,
hyphen), loudly mark them: perhaps with a red background behind them.  Bounds on
min/max kerning distance would ensure a spammer couldn't sneak invisible
characters in there.
* Pop up a warning box that shows the user-visible domain name, and also the
decoded (xn--) domain name that it maps to.  Ask the user if they truly want to
go to the xn-- site.
* Any other solution, except a blacklist that imposes an outright ban on entire
alphabets....
(In reply to comment #185)

Re: #1, No, it is more than that.  Consider that perhaps the .de domain does IDN
well, checking for homographs in registered domains.  Mozilla then whitelists
IDNs for .de domains, considering that the TLD should be safe.  Registrars
however have no clue about third-level domains, and anyone running a DNS server
(or doing cache poisoning attacks) can create any third-level domain (and lower)
that they want.  So, although the .de TLD checks all the xxxxx.de domains to not
have homograph attacks, there is NO guarantee about homograph attacks further
down the chain, in perhaps yyIDNyy.xxxxx.de .

Re: #2, When I registered the domain, it said the language assignment was
French.. but it doesn't appear in the published whois record.
On http://4t2.cc/mozilla/idn/ I have a small extension for Firefox that warns on
a IDN and shows the according punycode. Could this be a way to prevent IDN pishing?
(In reply to comment #190)
> On http://4t2.cc/mozilla/idn/ I have a small extension for Firefox that warns on
> a IDN and shows the according punycode. Could this be a way to prevent IDN
pishing?

Certainly, that interface is much like what I had in mind for my suggestion in
comment 145.  However, the message needs to be greatly improved.  A typical user
won't have a clue about IDN or punycode.  How about something more user friendly:

Warning: www.paypal.com contains some characters from international alphabets. 
Some international characters look very similar or the same as each other which
may be used to spoof web site addresses. _More information_

I'm sure it can be improved a lot and the more information link/button could
reveal detailed information, much like Paul Hoffman suggested [1] as well as
providing more user friendly explanations.

[1] http://lookit.proper.com/
(in reply to comment 133)

If a domain name is a Russian word written totally in Latin homographically
equivalent letters instead of original Cyrillic letters, this does not have to
be a spoof. Please note: long before IDN was first presented, some Russian sites
already had domain names which contained only Latin letters homographically
equivalent to Cyrillic -- this was a pre-IDN hack to include Russian words to
ASCII domain name.

The Latin homographs are: A/a, B/b, C/c, E/e, H, K, M/m, n, O/o, P/p, T, u, X/x, y

Their respective Cyrillic equvalents (named according to
http://www.unicode.org/charts/PDF/U0400.pdf chart): A (capital/small), VE
(capital)/SOFT SIGN, ES (capital/small), IE (capital/small), EN (capital), KA
(capital), EM (capital)/TE (small, cursive variation), PE (small, cursive
variation), O (capital/small), ER (capital/small), TE (capital), I (small,
cursive variation), HA (capital/small), U (small)

Please also note that Cyrillic letter ZE is pretty much homographical to DIGIT
THREE, and BE (small) is more or less homographical to DIGIT SIX.

Cyrillic letter YERU is homographical to "bI" or "bl" (two symbols together).

So we have more than half of the alphabet -- if you carefully avoid Russian
letters GHE, DE, IO, ZHE, SHORT I, EL, EF, TSE, CHE, SHA, SHCHA, HARD SIGN, E,
YU, and YA, then you may write Russian words with Latin letters (either capital
or small).

There are 33 letters in Russian alphabet (see
http://learningrussian.com/alphabet.htm for details). Only 15 don't have
homographs in Latin or digits. Please note also: it is allowed by Russian rules
to use IE letter ('e' letter, don't think of MSIE) instead of IO letter in most
words. So, effectively, only 14 Russian letters cannot be presented
homographically in pure ASCII.



The Russian Alphabet (Unicode names --> ASCII homographs, if exist):

A   --> A/a
BE  --> 6
VE  --> B
GHE
DE
IE  --> E/e
IO =-=> E/e (not allowed in some words)
ZHE
ZE  --> 3
I   --> u
SHORT I
KA  --> K
EL
EM  --> M
EN  --> H
O   --> O/o
PE  --> n
ER  --> P/p
ES  --> C/c
TE  --> T/m
U   --> y
EF
HA  --> X/x
TSE
CHE
SHA
SHCHA
HARD SIGN
YERU --> bI/bl
SOFT SIGN -> b
E
YU
YA



Some existing (registered and working) domain names using this technique (some I
knew of, some I've found right now combining the above enumerated letters
forming valid Russian words -- these homographs are also used to represent
Russian words below in round brackets, because Bugzilla does not support Unicode
yet, AFAIK):

http://www.XAKEP.ru/ (Russian word 'XAKEP' means 'hacker')
http://www.PEKA.ru/ (Russian word 'PEKA' means 'river')
http://CTEHA.ru/ (Russian word 'CTEHA' means 'wall')
http://www.CblP.ru/ (Russian word 'CblP' means 'cheese')
http://www.TEMA.ru/ (Russian word 'TEMA' means 'theme'; and, backreplacing
IE-->IO, we get a variant of name of this domain's owner)
http://ABTO.ru/ (Russian word 'ABTO' means 'auto')
http://3ByK.ru/ (Russian word '3ByK' means 'sound')
http://KOCMOHABT.ru/ (Russian word 'KOCMOHABT' is equivalent to 'astronaut')
http://MATPAC.ru/ (Russian word 'MATPAC' means 'mattress')
http://MEXA.ru/ (Russian word 'MEXA' means 'furs' -- yes, plural; singular form
http://MEX.ru/ is cybersquatted)
http://MPAMOP.ru/ (Russian word 'MPAMOP' means 'marble')
http://OXPAHA.ru/ (Russian word 'OXPAHA' means 'guard' or process of guarding)

And some not so useful (registered for sale or otherwise cybersquatted) domains:

http://www.BAHHA.ru/ (Russian word 'BAHHA' means 'bath')
http://CyKA.ru/ (Russian word 'CyKA' means 'bitch')
http://MOCKBA.ru/ (Russian word 'MOCKBA' means city of Moscow)
http://EBPO.ru/ (Russian word 'EBPO' means 'euro')
http://KOCMOC.ru/ (Russian word 'KOCMOC' means 'outer space')
http://KPACKA.ru/ (Russian word 'KPACKA' means 'paint, dye, colour')



This is a proof for two more or less separate ideas:

1) Full homography of a domain name can be a legacy of pre-IDN times, a basis
for someone's ethic and legal business, which must not be ruined.

2) We should end at considering only symbol-to-symbol homography; also, two
adjacent symbols of one alphabet may happen to look like a single glyph of
another alphabet.



The hunt for Russian domain names written in pure ASCII will continue.
Sergey: that's very useful - thanks :-)

Everyone else: I'm currently up to my eyeballs in IDN lists and blog posts and
emails. I'm trying to get on top of what everyone is saying this weekend, and
see what emerges.
> Russian words below in round brackets, because Bugzilla does not support 
> Unicode yet, AFAIK):

Well, all you have to do is set 'View | Character Encoding' to UTF-8 before
posting any comment with non-ASCII characters and do the same when viewing any
comment posted in UTF-8. We'd not have NCRs as in comment #133. Please,
everybody, set 'Character Encoding' to UTF-8 before *posting* comments with
non-ASCII characters here and in other bugs at bugzilla.mozilla.org. (be aware
that changing 'character encoding' resets the content of a textarea - you would
lose everything you've written there so that before changing 'character
encoding' you have to make sure to copy it to the clipboard or elsewhere)


> http://www.XAKEP.ru/ (Russian word 'XAKEP' means 'hacker')

  'ХАКЕР' in Cyrillic 

One idea (as a part of *multiple* lines of defense). We may render characters
beloning to the 'minority' scripts for a given domain component in a
'conspicuous' 'color' (and/or font) different from the color used to render
characters in the 'majority' script (the script with the largest count in a
given domain component.). For   'pаypаl' where 'а' is Cyrillic, Cyrillic would
be the minority script while Latin would be the majority script)
Thank you, Jungshik Shin, this helped.

Ok, now 19 more domains I've found last night (use UTF-8 to view Russian words
below).

Website: http://PEMOHT.ru/
Russian word: ремонт
Translation: 'repair' (noun)
Status: cybersquatted

Website: http://caxap.ru/
Russian word: сахар
Translation: 'sugar' (noun)
Status: used by sugar traders

Website: http://COK.ru/
Russian word: сок
Translation: 'juice' (noun)
Status: list of winners of some lottery drawing among apple juice customers

Website: http://COCKA.ru/
Russian word: соска
Translation: 'comforter, dummy teat'
Status: pornocybersquat

Website: http://coyc.ru/
Russian word: соус
Translation: 'sauce, gravy'
Status: sauce recipe list, FAQ, etc.

Domain: cyxapu.ru
Russian word: сухари
Translation: 'rusks, pieces of dried bread' (plural)
Status: DNS works, but no route to host

Website: http://www.MAKCu.ru/
Russian word: МАКСИ
Translation: this is a trademark that has no direct meaning and translation; it
is most likely derived from the word 'максимум', which means 'maximum' or 'at most'
Status: some cellphone-related business and FAQ

Website: http://yxo.ru/
Russian word: ухо
Translation: 'ear'
Status: webmail provider, hosting provider

Website: http://yKcyc.ru/
Russian word: уксус
Translation: 'vinegar'
Status: cybersquatted by international drug dealers

Website: http://xop.ru/
Russian word: хороший
Translation: 'good' or 'fine' (there's a kind of pun in this domain name:
Russian word 'хор' means 'chorus')
Status: furniture shop

Website: http://XPyCT.ru/
Russian word: хруст
Translation: 'crunch' (noun)
Status: website temporarily closed (probably it exceeded its bandwidth limit or
other rent limit)

Website: http://KAPTA.ru/
Russian word: карта
Translation: 'map' or 'card'
Status: communication service card dealer

Website: http://KOBEP.ru/
Russian word: ковёр
Translation: 'carpet' or 'rug'
Status: cybersquatted

Website: http://MAPKA.ru/
Russian word: марка
Translation: '(postage-)stamp' or 'trade-mark' or 'brand'
Status: philatelic activity

Domain: HAyKA.ru
Russian word: наука
Translation: 'science' (noun)
Status: DNS works, but no route to host

Website: http://npoKaT.ru/
Russian word: прокат
Translation: 'hire' (noun)
Status: merchandise for hire

Website: http://PECTOPAH.ru/
Russian word: ресторан
Translation: 'restaurant'
Status: internet shop selling goods and services somehow related to restaurants

Website: http://CTAHOK.ru/
Russian word: станок
Translation: (noun) 'machine-tool' or 'lathe' or 'printing-press'
Status: somehow related to machine-building or machine works; not yet open

Website: http://TypucT.ru/
Russian word: турист
Translation: 'tourist' (noun)
Status: site is under construction
setting bug 237820 as a blocked meta tracker
Blocks: IDN
First let me give a quick reminder of the difference between registries
and registrars:  Each top-level domain has exactly one registry, who
maintains the list of all second-level domains therein.  A TLD may have
many registrars, who interface between the registry and the registrants
(customers).  It's the registry who sets and enforces the policies
regarding which names are allowed; the registrars have no control over
that.  So let's stop picking on the registrars.  :)

A good solution to the problem of homograph attacks is going to take
weeks or months (or longer) to develop.  Therefore it would be good to
immediately deploy something very simple to reduce the severity of the
problem.  I suggest:

1) Have a user-configurable set of TLDs for which domain names show
in ASCII form instead of readable form.  If the browser is following
the IDNA spec then it's already calling ToUnicode() before it ever
displays any domain name; therefore a simple hook or wrapper could be
used to make it call ToASCII() instead for certain TLDs.  The user could
choose whether to use a blacklist (show these TLDs in ASCII form) or
a whitelist (show all TLDs except these in ASCII form).  I think the
default should probably be to just blacklist .net and .com, because I
think those are the only target-rich TLDs whose registries admit IDNs
indiscriminately.  There might be other indiscriminate TLDs (.nu?), but
how many people have important trust relationships with sites in those
TLDs that phishers would be interested in?  In particular, I haven't
heard of problems with .org, which is not managed by Verisign.

2) Make it easy to switch a global setting between "always ASCII",
"always readable", and "use the TLD list".

This is obviously nowhere near a complete solution, but I think it
would improve the situation significantly, and it is very simple--there
is no fancy UI with colors and fonts to design, no character table to
design, and the code changes can be narrowly focused (in theory, but
I'm not familiar with the code).  Importantly, this measure would avoid
penalizing the communities centered around sites in responsible TLDs.

Also, I'm surprised that comment #82 got no responses:

> IMO the proper fix for the ssl case (https://paypal.com) is to remove
> the UserTrust network certificate from the store.  Obviously they are
> not doing their job and therefore they shouldn't be trusted.

I don't really understand the SSL trust model, but this sounds like an
interesting idea.  What exactly is UserTrust's job?  What exactly does
the certificate they issued to the bogus paypal supposedly assert?

Switching topics, I'd like to say something about Nameprep, since people
have mentioned using it for various purposes that aren't clear to me.
Nameprep is intended as a generalization of tolower, which converts
uppercase ASCII letters to the corresponding lowercase ASCII letters,
and leaves other ASCII characters unchanged.  For ASCII domain names,
there are certain situations where it is appropriate to call tolower.
For IDNs, Nameprep plays the analogous role.  In fact, Nameprep behaves
exactly like tolower when its input is ASCII, so you can simply replace
tolower with Nameprep for all domain names.

The important point here is that Nameprep is appropriate *only* in
situations where tolower was already appropriate for ASCII domain names.
If you're in a situation where you wouldn't want to apply tolower to an
ASCII domain name, then you shouldn't be applying Nameprep to an IDN
either.

Usually tolower is not applied to domain names for display purposes; it
is used internally for doing case-insensitive comparisons.  Comparison
of IDNs is done using ToASCII followed by tolower, and ToASCII uses
Nameprep internally.

Firefox seems to use tolower or Nameprep for display:  When I type
http://www.CS.Berkeley.EDU/ into the location bar, it gets changed to
http://www.cs.berkeley.edu/.  I'm not sure that's consistent with the
spirit of the domain name specs.  If DNS servers and resolvers are
required to preserve case when possible, why is my browser altering it?
I suppose there might be a good reason, but I find this surprising (even
though I generally think domain names look better in all lower case).
"It's the registry who sets and enforces the policies
regarding which names are allowed; the registrars have no control over
that.  So let's stop picking on the registrars."

That's an oversimplification though. It is the registrars that apply (or fail to
apply) the policies in the first instance. The registries then enforce (or fail
to enforce) their policies with registrars that are not implementing them properly.

If the registry is failing to enforce good policy, then the position may will be
that some registrars are better than others. Even if the registry is doing the
enforcing, then there will be a lag between registrars allowing bad
registrations and the registry getting the registrar to correct the problem.  So
while it's no good picking on registrars exclusively, they do have a role to play.
Commercial registrars generally do not enforce policies, and simply rely on the
registry to perform the necessary checks. And rightfully so, since things can
get pretty hairy when it comes to IDN. After all, it is the registry's
responsibility to ensure that no rogue names exist in its database.

(In reply to comment #195)
> Website: http://PEMOHT.ru/

Sergey, thank you for all this info and good work! I would just like to point
out one thing regarding IDNs. They use a spec called stringprep that includes
lowercasing, so it may be more interesting for you to find existing Russian
domain names in ASCII that only contain lowercase. It is impossible to
register IDNs with uppercase Cyrillic (unless the registry is breaking the
IDN rules).
(In reply to comment #191)
> 
> Warning: www.paypal.com contains some characters from international alphabets. 
> Some international characters look very similar or the same as each other which
> may be used to spoof web site addresses. _More information_

Thanks for your feedback. In addition to the (extended) popup message and the
changed statusbar icon, I have also added an icon in the location bar.

http://4t2.cc/mozilla/idn/
(In reply to comment #197)
> Firefox seems to use tolower or Nameprep for display:  When I type
> http://www.CS.Berkeley.EDU/ into the location bar, it gets changed to
> http://www.cs.berkeley.edu/.  I'm not sure that's consistent with the
> spirit of the domain name specs.  If DNS servers and resolvers are
> required to preserve case when possible, why is my browser altering it?
> I suppose there might be a good reason, but I find this surprising (even
> though I generally think domain names look better in all lower case).

I don't really know why Firefox lowercases the ASCIIs. It might just have
fallen out of the IDN work. It may be against DNS conventions too. However,
one advantage is that you can more easily spot the difference between
capital I and lowercase l if you lowercase the name.
*** Bug 283013 has been marked as a duplicate of this bug. ***
Depends on: 283016
ok, so rip this comment to shreds if you want (I can take it) -- or maybe its
right -- or maybe it will spark a better thought from someone else ---- but it's
worth thinking about

I though it as I was waking up this morning, it seems simple (which if it's
correct would be nice), but could be oversimplifying.

What I'm thinking is:
* Don't we just have to worry about MIXED character encodings?
* Can you spoof the string "paypal" (for example) without mixed encodings?
* What COULD you spoof without mixed encodings? (the Russian, maybe?)
> 
> I though it as I was waking up this morning, it seems simple (which if it's
> correct would be nice), but could be oversimplifying.
> 
> What I'm thinking is:
> * Don't we just have to worry about MIXED character encodings?
> * Can you spoof the string "paypal" (for example) without mixed encodings?
> * What COULD you spoof without mixed encodings? (the Russian, maybe?)

Oh, well, off the top of my head:

asap, ascii, arab, arabia, arabic, arabs, archie, aries, asia, bach, ceo, cpu,
cpus, cray, crays, europe, ieee, jr, ok, os, ohio, pc, pcs, popek, popeks, rcs,
rsx, rick, roy, sccs, sr, usc, xeroxes, york, yorker, yorkers, yorks, aback,
abase, abaser, abases, abash, abashes, abbe, abbey, abbeys, abhor, abhorrer,
abhors, abjure, abjurer, abjures, abscess, abscesses, abscissa, abscissas,
absorb, absorber, absorbs, abuse, abuser, abusers, abuses, abyss, abysses,
acacia, access, accesses, accessories, accessory, accrue, accrues, accuracies,
accuracy, accuse, accuser, accusers, accuses, ace, acer, aces, ache, aches,
acre, acres, across, aerobic, aerobics, aerospace, ah, air, airer, airers,
airier, airs, airship, airships, airspace, airy, ajar, apace, ape, aper, apes,
apex, apexes, aphasia, aphasic, apiaries, apiary, apiece, apish, apocrypha,
appear, appearer, appearers, appears, appease, appeaser, appeases, appraise,
appraiser, appraisers, appraises, apprise, appriser, apprisers, apprises,
approach, approacher, approachers, approaches, apropos, apse, apses, apsis, arc,
arch, archaic, archbishop, archer, archers, archery, arches, arcs, are, area,
areas, ares, arise, ariser, arises, ark, arose, arouse, arouses, arrack, array,
arrayer, arrays, arrears, arroyo, arroyos, as, ascribe, ascribes, ash, asher,
ashes, ashore, ask, asker, askers, asks, asp, asper, asphyxia, aspic, aspire,
aspirer, aspires, ass, assay, assayer, assayers, asses, assess, assesses,
assessor, assessors, assure, assurer, assurers, assures, aura, auras, aurora,
auspice, auspices, auspicious, ax, axe, axer, axers, axes, axis, aye, ayer,
ayers, ayes, babe, babes, babies, baby, babyish, back, backache, backaches,
backer, backers, backpack, backpacker, backpackers, backpacks, backs, backspace, ...
(In reply to comment #205)

Or, with a tighter definition of "homograph", using only the really good ones:

asap, ascii, asia, ceo, ieee, os, pc, pcs, sccs, acacia, access, accesses, ace,
aces, apace, ape, apes, apex, apexes, apiece, appease, appeases, apse, apses,
apsis, as, asp, aspic, ass, assay, asses, assess, assesses, ax, axe, axes, axis,
aye, ayes, cap, cape, capes, caps, case, cases, cease, ceases, coax, coaxes,
cocoa, coo, coop, coops, cop, cope, copes, copies, cops, copse, copses, copy,
ease, eases, easy, epic, epics, escape, escapee, escapees, escapes, espies,
espy, essay, essays, excess, excesses, excise, excises, expose, exposes, eye,
eyepiece, eyepieces, eyes, ice, ices, icy, is, ix, jay, jeep, jeeps, joy, joys,
oasis, oops, oppose, opposes, ox, pa, pace, paces, papa, pas, pass, passe,
passes, pay, pays, pea, peace, peaces, peas, peep, peeps, pep, pi, pie, piece,
pieces, pies, pipe, pipes, ****, ****, poise, poises, pop, pope, popes,
poppies, poppy, pops, pose, poses, possess, possesses, pox, poxes, sap, saps,
say, says, ...

(In reply to comment #206)

And going the other way, to Russian, I can do:

&#1072;&#1075;&#1072;, &#1075;&#1072;&#1088;&#1100;, &#1075;&#1086;&#1088;, &#1075;&#1086;&#1088;&#1072;, &#1075;&#1086;&#1088;&#1072;&#1093;, &#1075;&#1086;&#1088;&#1075;&#1086;&#1088;, &#1075;&#1086;&#1088;&#1077;, &#1075;&#1086;&#1088;&#1091;, &#1075;&#1088;&#1077;&#1093;, &#1077;&#1075;&#1086;, &#1086;&#1088;&#1077;&#1093;, &#1086;&#1088;&#1077;&#1093;&#1072;, &#1088;&#1072;&#1089;,
&#1088;&#1072;&#1089;&#1072;, &#1088;&#1072;&#1089;&#1077;, &#1088;&#1086;&#1075;, &#1088;&#1086;&#1075;&#1072;, &#1088;&#1086;&#1075;&#1072;&#1093;, &#1088;&#1086;&#1089;, &#1088;&#1086;&#1089;&#1072;, &#1088;&#1086;&#1089;&#1077;, &#1089;&#1077;&#1088;, &#1089;&#1077;&#1088;&#1072;, &#1089;&#1077;&#1088;&#1086;, &#1089;&#1077;&#1088;&#1086;&#1075;&#1086;, &#1089;&#1077;&#1088;&#1086;&#1077;,
&#1089;&#1089;&#1086;&#1088;&#1072;, &#1089;&#1089;&#1086;&#1088;&#1077;, &#1089;&#1089;&#1086;&#1088;&#1091;, &#1089;&#1091;&#1093;, &#1089;&#1091;&#1093;&#1086;, &#1089;&#1091;&#1093;&#1086;&#1075;&#1086;, &#1089;&#1091;&#1093;&#1086;&#1077;, &#1091;&#1075;&#1072;&#1089;, &#1091;&#1088;&#1072;, &#1091;&#1093;&#1072;, &#1091;&#1093;&#1086;, &#1091;&#1093;&#1091;, &#1093;&#1072;&#1086;&#1089;&#1077;,
&#1093;&#1086;&#1088;, &#1094;&#1072;&#1088;&#1100;, ...
(in reply to comment 200)

Ok, Erik, here they are (use UTF-8 to read Russian):

http://caxap.ru/ -- сахар
http://coyc.ru/  -- соус
cyxapu.ru        -- сухари (the last letter is homographic in some fonts)
http://yxo.ru/   -- ухо
http://xop.ru/   -- хор
http://nana.ru/  -- папа (homographic in some fonts)



(in reply to comment 204)

Yes, Zachariah, you can easily spoof a Russian IDN without mixing encodings --
see the above reply to comment 200.



(retyping comment 207 in UTF-8)

> ага, гарь, гор, гора, горах, горгор, горе, гору, грех, его,
> орех, ореха, рас, раса, расе, рог, рога, рогах, рос, роса,
> росе, сер, сера, серо, серого, серое, ссора, ссоре, ссору,
> сух, сухо, сухого, сухое, угас, ура, уха, ухо, уху, хаосе,
> хор, царь

Interesting. How do you spoof "царь" without mixing encodings?
Once again, to make it more clear for those who did not read the whole list of
bug comments (which is large already) -- the above enumerated six existing
domains are not spoofy: these were of old (pre-IDN) way to register Russian
domain names using Latin alphabet instead of Russian homographically equivalent
letters. They were useful before IDN and should remain useful and not broken
after any anti-spoofing measure implemented. Comment 194 with the idea of
majority/minority scripts seems to be the right way of avoiding harm for the
existing pre-IDN homographs.
On the Unicode mailing list, Rick McGowan <rick@unicode.org> has announced that
a new revision of Draft UTR #36: Security Considerations for the Implementation
of Unicode and Related Technology, is now available at

        http://www.unicode.org/reports/tr36/tr36-2.html

and that comments for official consideration can be made at 

        http://www.unicode.org/reporting.html

The review period closes on May 3, 2005. 
The gTLD registries, several ccTLD registries, and ICANN have posted statements
about IDN abuse that are listed on a resource page that ICANN has just started:

          http://www.icann.org/topics/idn.html

They are also opening a new discussion forum which has a potential advantage
over all the others by being immediately visible to ICANN.
(In reply to comment #209)

Sergey, does .ru currently allow IDN registration? If so, what rules are there?
If not, are they thinking about IDN, and if so, what kind of rules? Thanks!
Unfortunately, I could not make myself certain about it. According to some docs
-- http://info.nic.ru/st/10/out_863.shtml for one -- the problem is still in the
state of discussion. However, if I go directly to https://www.nic.ru/dns/ and
enter something like xn--80aswg.ru to register in .Ru, the first three steps are
OK. (I did not finish the process, because I'm not going to spare $20 for the
proof of possibility.)
(translating the above to UTF-8)

http://президент.ru

http://кремль.ru

These are websites registered for President of Russia. May be a technical exception.
(In reply to comment #212)

The text entry window in Bugzilla echos the Cyrillic characters as they should
appear but the posted comment uses the numeric character references. (Bugzilla
bug?). The latter form is, in fact, the only one of the two that is legal in a
URL. Regardless of the promise that IRI has for remedying this, it still
highlights the need for an LDH format for communicating scripts across cultural
boundaries beyond which they are unlikely to be recognized. Punycode and NCR are
obvious candidates for this role, as far as appearance in a URL goes, and we can
debate which is the uglier. When we get around to printing IDN e-mail addresses
on business cards, the parallel communication of Punycode may prove a necessary
adjunct, with no competition in the aesthetics department.
(In reply to comment #216)

See comment #194 from Jungshik Shin. It is not a Bugzilla bug directly. However,
Bugzilla should add charset=UTF-8 to its HTTP response Charset header...
(In reply to comment #217)

I just filed bug 285255 to try to get bugzilla.mozilla.org to announce its
charset as UTF-8.
Erik, Sergey: bug 126266 explains why b.m.o. can't just set the charset to UTF-8.

Gerv
For the record, the new bug form on b.m.o is currently hacked to force UTF-8. 
We can't do that on show bug because of legacy data problems on existing bugs,
so if someone adds the first comment containing non-ascii characters at some
point later than the opening of a new bug it's going to be whatever charset
their browser used.  Please read bug 126266 before making any "but you can just
do ******" comments, and please make any such comments on that bug if you come
up with something that hasn't already been suggested and shot down there already :)
Dear all,

I just thought I'd try to summarize a number of the threads in this discussion
in one place; please bear with me if I'm repeating the obvious in some places.
I've broken the discussion into two parts: "global issues" and "threat analysis
and possible solutions".

== Global issues ==

1. The homograph problem is in the eye and brain of the user, and is therefore
necessarily a fuzzy and subjective problem.

2. Becuase of the above, we can therefore only _approximately_ solve this
problem.  However, that approximation can be very good indeed, and there's no
reason why we should not aim for near-perfection in a solution. _We should think
in terms of probabilities as engineering targets_.

3. Many parties are involved in this, and every one of them will have to
contribute to the solution. They each have different consituencies, policies,
interests, and technical constraints. Fortunately, the problem is also
multi-dimensional, and its solution can be sliced up in such a way that each
group can contribute something to the mix. Although none of these sub-solutions
can be perfect (see above), they can together provide multiple opportunities for
catching homographs, allowing a very high probability of the overall solution
working for any given TLD label.

4. Punycode display eliminates the homograph problem for many purposes, but also
defeats the usefulness of IDN at the same time: still, at least it does not
break links to IDN websites. It's the least-worst fix until we can do something
better. It may also have long-term dangers when IDNs become widespread (see below).

5. The homograph problem is a _combinatorial_ problem. Increasing the size of
the character set from 37 to 40,000 has caused a disproportionate exponential
explosion in the number of possible homograph combinations. Applying
restrictions in a number of intersecting ways will enable us to exponentially
_implode_ those possibilities again.

6. The consensus appears to be that only top and second-level labels matter: top
labels are not currently a problem (but they may be when "full" IDN arrives).
Users are by now well-accustomed to interpreting second-level labels as
identifying a commercial or other entity.

7. The above is good, because it means that we can make everything hinge on the
TLD registries as trust brokers. Doing things on a per-TLD basis allows the
registry part of the solution to scale horizontally, so registries with
effective policies can be unblocked ASAP. It also deals effectively with the
case of non-compliant registries, and market pressure (non-IE market share
heading for 10% and beyond) will do the rest.

8. No-one is talking about the timescale for a fix, or what the definition of "a
fix" would be. What is the expected timescale: a month, three months, six
months, a year, five years? Again, slicing the problem up will allow multiple
bodies to move forwards on multiple tracks, and the browser vendors can act as
gatekeepers for their users to decide what is "good enough".

== Threat analysis and possible solutions ==

Here are the major threats:

* Writing-system-mixing homographs: for example, Cyrillic 'a' in Latin 'paypal'.
 Partial solution:  make sure that individual domain names are allocated from
character sets without internal homographs. [Only needs internal inspection of
each sub-character-set for homographs, so vastly less work than checking the
whole Unicode set for homographs]. At the moment, the ICANN rules justify this
on the basis of language assignments, but it's really about forbidding
unnecessary script mixing. (Note that I say "writing system" here: a single
writing system can use several scripts: for example, Japanese uses four scripts,
but they are not mutually confusable).

* Non-writing-system-mixing homographs: for example, Cyrillic 'assay.tld' vs.
Latin 'assay.tld'. These are less easy to forge, as the structure of languages
provides some entropy that makes collisions less likely than with cross-script
attacks. However, they still exist, and we cannot rely on users to select "safe"
names. Partial solution: bundling at the registry. [Needs a global homograph
list, but is fairly tolerant of error in this list; for example, the above
contains homographs for 'a', 's', and 'y', three characters. If we had a
homograph list that was 95% accurate, we'd have a probability of 1-0.05**3 =
0.99987 of catching this. A list with 98% accuracy would have a 0.99999
probability of catching it. Clearly the rule here would be: if you have a high
value domain, make sure it has lots of different characters in it].

Note that we should distinguish 'blocking' bundling, where registration of new
homographs is blocked to anyone but the registrant of the 'root' name, not
'permissive' bundling, where all the homographs actually resolve to the same
place as the root name. In the case of 'grandfathered' names, we would need some
procedures to resolve conflicts: perhaps where two root names exist in a
homograph tree, neither of the registrants should be allowed to register new
names, or the first registrant should prevail?

Note also that bundling can also mop up the remaining within-writing-system
homographs, if, for example, a new exploit was later found (for example, on the
lines of "rn" for "m", something simple homograph tables could not catch).

[An aside: even if a super-paranoid browser could have the full
homograph-risk-detection algorithm built in, it would not solve the problem of
non-writing-system-mixing homographs, because it could not resolve which was the
"real" name, and which the "fake" name.]

* Attacks on protocol characters: fake slash, dot, hash, percent characters and
so on. This is a severe risk, that allows forgery of TLDs and other evil attacks
that subvert some of the other solutions above. Partial solution: make these
characters illegal at both the browser and registry end. (Belt and braces). How
can we know we've got them all? Someone's got to check really seriously throught
the entire Unicode character set. However, it's only a book-length volume, with
most of it being CJK characters; you could do it in a few days, particularly if
you could a priori ignore many character ranges (see below). Caveat: what if
someone's language actually _requires_ a character that looks like a protocol
character: what do we do then? (This is where intersection with per-label
character set restriction may help).

* In general: any restrictions we can make on character repertoires, either by
conservative whitelisting or aggressive blacklisting (preferably both) reduce
the combinatorial possibilities for homograph attacks by many orders of
magnitude, as well as making the generation of accurate homograph lists much easier.

In particular, there are wide ranges of characters which exist only for
round-trip compatibility reasons with old character sets, such as the Videotex
characters, box graphics, dingbats, and presentation forms for various
alphabets. There is no reason why we should support these. Perhaps the Unicode
people can give us an official list of "deprecated" code points?

* Chinese characters are a special case, because of the tens of thousands of
characters in the CJK repertoire, as well as cultural concerns, such as
traditional/simplified and Japanese/Chinese versions of the same characters.
This is a _huge_ problem requiring scholarly expertise in oriental languages
that the people in this discussion do not have. There are groups working on
solving this: let's let them get on with solving it: their current solution
seems to revolve around bundling. Fortunately, CJK characters look so different
from other scripts that this should not stop us from attacking the homograph
problem for alphabetic scripts and syllabaries.

* Note that Punycode itself can be an attack vector: if users who really need to
access IDN sites become used to clicking on "xn--ASCII NONSENSE.tld", and don't
bother to understand or remember the ASCII nonsense, there is a chance that they
may be fooled into visiting "xn--OTHER ASCII NONSENSE.tld" at a later date:
particularly if they do not read the Latin alphabet as their native script. (For
example, to me, Thai script just looks like squiggles; it's entirely possible
that to many Thai people, Latin script may also look like squiggles). For this
reason, it makes sense to get the registries to sort their end as soon as
possible: Punycode is an excellent mitigation technique that works best for
Latin script readers in a world where > 99.9% of all domains are currently ASCII
LDH-only, but it is not a panacea for the long run, when I expect that at the
very least 50% of all domains will be IDNs.
>
> 2. Because of the above, we can therefore only _approximately_ solve this
problem.  However, that approximation can be very good indeed, and there's no
reason why we should not aim for near-perfection in a solution. _We should think
in terms of probabilities as engineering targets_.
>
> 3. Many parties are involved in this, and every one of them will have to
contribute to the solution. They each have different consituencies, policies,
interests, and technical constraints. Fortunately, the problem is also
multi-dimensional, and its solution can be sliced up in such a way that each
group can contribute something to the mix. Although none of these sub-solutions
can be perfect (see above), they can together provide multiple opportunities for
catching homographs, allowing a very high probability of the overall solution
working for any given TLD label.


Just to elaborate this point slightly more, this gets around two major
objections to a timely, workable solution:

* it means that no-one can duck out of providing their piece of the solution, on
the basis that someone else should solve the problem "perfectly" at their end;
since a layered solution is required, everyone must contribute to make the
overall reliability of the system as high as possible.
* it takes the teeth out of objections to other people's solutions on the basis
that they are not perfect, and that no solution can be implemented until it is
perfect. Proposed solutions can still be criticised by comparing them against
proposals for better solutions, but they cannot be stalled by comparing them
against hypothetical (but unspecified) perfect solutions.

My proposed reliability target? A five nines minimum requirement for SLDs with
three distinct letters; this corresponds to a reliability target of > 98% for
the global homograph list. So, out of the 11195 non-Han, non-Hangul characters
in Unicode 3.2, that's a target of no more than 223 missed between-script
homographs. If we can reduce that to (say) no more that 50, then the
three-character reliability estimate is roughly (1-(50/10000)**3) = 0.99999987,
which is almost seven nines.

Of course, Chinese is another matter entirely, but I believe that substantial
efforts are being devoted to provide a reliable solution for Chinese characters.

OK, you're probably getting bored now, but here are some calculations, of the
sort that I hope will cast some more light on the problem.

For different amounts of coverage of the homograph list, assuming perfect
bundling and statistical independence, using an English word list as a source of
statistics for an estimate of the relative probabilities of different numbers of
distinct characters in typical labels, and assuming there are currently 50
million domains registered (Source: http://www.whois.sc/internet-statistics/), I
get the following:

Homograph list reliability  Est. antispoof reliability  Est. vulnerable domains
95%                         99.998806%                  596
98%                         99.999761%                  119
99%                         99.999911%                   45
99.5%                       99.999962%                   19
99.75%                      99.999983%                    9
99.9%                       99.999993%                    3

Note that this takes the ultra-cautious definition that if "homograph list
reliability" is 95%, fully 5% of the remaining characters are uncaught homographs.

Note that at the bottom end, the stats are entirely dominated by domain names
with one and two distinct characters. Make the requirement that SLD labels need
to have at least two distinct characters, and I get:

Homograph list reliability  Est. antispoof reliability  Est. vulnerable domains
95%                         99.999124%                  438
98%                         99.999888%                   56
99%                         99.999974%                   13
99.5%                       99.999994%                    3
99.75%                      99.999998%                    0.76
99.9%                       99.999999%+                   0.12

Again, the stats are dominated by the names with the smallest number of distinct
characters.

Finally, making the requirement that SLD labels have at least _three_ distinct
characters (but are otherwise distributed as normal for English), I get:

Homograph list reliability  Est. antispoof reliability  Est. vulnerable domains
95%                         99.999723%                  139
98%                         99.999984%                    8
99%                         99.999998%                    0.95
99.5%                       99.999999%+                   0.11
99.75%                      99.999999%+                   0.014
99.9%                       99.999999%+                   0.00092

Finally, as an illustration, I list below the numbers of distinct characters in
the top 100 domains, as according to alexa.com, sorted by number of distinct
chars, and making reasonable assumptions about which part of the DN is allocated
by the TLD registrar. If I was the BBC, CNN, go.com, goo.com, or qq.com, I'd be
nervously eyeing up the Unicode tables.

Although 1 or 2-distinct character strings appear to be a disproportionately
large part of this list, the threat is not as big as it seems: apart from people
who insist on registering "aaaaaaa.tld" (which tend not to be memorable: was
that six 'a's or seven?), most diversity-poor labels will tend to be the very
short ones. One-letter ones are forbidden by RFC, so there are, for example,
only 52022 possible two and three-letter LDH domains currently available to be
registered, roughly 0.1% of the 50 million currently registered domains. Of
these, only 5402 have less than three identical characters, about 0.01% of the
total registered domains, and only 36 of them will have only one distinct
character. If we assume all of these are registered and that we have only a 95%
accurate homograph list, then the expected number of spoofable domains in this
class will be approximately 36 * 0.05 + (5402-36) * 0.05 * 0.05 = 15 spoofable
domains. Making the homograph list 98% accurate reduces this to a more
comfortable 2.8 domains, and 99.5% accurate would reduce it to an expected 0.31
domains per TLD in this elite set of registrations, mostly consisting of the
risk associated with the single-character-repitition domains such as "aaa.tld".

So, in conclusion, a statistical approach to risk estimation and mitigation can
be very powerful, and (in principle) reduce spoofing risks to very low levels.
Most of the risk is concentrated in labels with low levels of character
diversity, information which may be useful to potential registrants who wish to
avoid spoofing. It would be interesting to perform this kind of analysis on some
real TLD registry data.

[data follows: "high value" targets marked as "***"]

= The top 100 websites, as per Alexa.com, sorted by number of distinct chars in
TLD-registrar-allocated label =

== Few distinct characters, look out! ==
www.qq.com, 1 distinct chars in 'qq'
www.bbc.co.uk, 2 distinct chars in 'bbc'
www.cnn.com, 2 distinct chars in 'cnn'
www.go.com, 2 distinct chars in 'go'
www.goo.ne.jp, 2 distinct chars in 'goo'

== >= 3 distinct characters, lower risk ==
www.126.com, 3 distinct chars in '126'
www.163.com, 3 distinct chars in '163'
www.aol.com, 3 distinct chars in 'aol' ***
www.ask.com, 3 distinct chars in 'ask'
www.avl.com.cn, 3 distinct chars in 'avl'
www.dell.com, 3 distinct chars in 'dell' ***
www.free.fr, 3 distinct chars in 'free'
www.msn.co.jp, 3 distinct chars in 'msn' ***
www.msn.com, 3 distinct chars in 'msn' ***
www.nba.com, 3 distinct chars in 'nba'
www.tom.com, 3 distinct chars in 'tom'
www.uol.com.br, 3 distinct chars in 'uol'
www.21cn.com, 4 distinct chars in '21cn'
www.3721.com, 4 distinct chars in '3721'
www.alibaba.com, 4 distinct chars in 'alibaba'
www.apple.com, 4 distinct chars in 'apple'
www.daum.net, 4 distinct chars in 'daum'
www.ebay.co.uk, 4 distinct chars in 'ebay' ***
www.ebay.com, 4 distinct chars in 'ebay' ***
www.ebay.com.cn, 4 distinct chars in 'ebay' ***
www.ebay.de, 4 distinct chars in 'ebay' ***
www.google.ca, 4 distinct chars in 'google' ***
www.google.co.jp, 4 distinct chars in 'google' ***
www.google.co.uk, 4 distinct chars in 'google' ***
www.google.com, 4 distinct chars in 'google' ***
www.google.de, 4 distinct chars in 'google' ***
www.google.es, 4 distinct chars in 'google' ***
www.google.fr, 4 distinct chars in 'google' ***
www.hkjc.com, 4 distinct chars in 'hkjc' ***
www.imdb.com, 4 distinct chars in 'imdb'
www.myway.com, 4 distinct chars in 'myway'
www.nate.com, 4 distinct chars in 'nate'
www.sina.com, 4 distinct chars in 'sina'
www.sina.com.cn, 4 distinct chars in 'sina'
www.sina.com.hk, 4 distinct chars in 'sina'
www.sohu.com, 4 distinct chars in 'sohu'
www.taobao.com, 4 distinct chars in 'taobao'
www.xanga.com, 4 distinct chars in 'xanga'
www.yahoo.co.jp, 4 distinct chars in 'yahoo' ***
www.yahoo.com, 4 distinct chars in 'yahoo' ***
www.about.com, 5 distinct chars in 'about'
www.aisex.com, 5 distinct chars in 'aisex'
www.allyes.com, 5 distinct chars in 'allyes'
www.amazon.com, 5 distinct chars in 'amazon' ***
www.atnext.com, 5 distinct chars in 'atnext'
www.baidu.com, 5 distinct chars in 'baidu'
www.china.com, 5 distinct chars in 'china'
www.gator.com, 5 distinct chars in 'gator'
www.hinet.net, 5 distinct chars in 'hinet'
www.lycos.com, 5 distinct chars in 'lycos' ***
www.match.com, 5 distinct chars in 'match'
www.naver.com, 5 distinct chars in 'naver'
www.sex141.com, 5 distinct chars in 'sex141'
www.yisou.com, 5 distinct chars in 'yisou'
www.adserver.com, 6 distinct chars in 'adserver'
www.comcast.net, 6 distinct chars in 'comcast'
www.download.com, 6 distinct chars in 'download'
www.hao123.com, 6 distinct chars in 'hao123'
www.hkflash.com, 6 distinct chars in 'hkflash'
www.neopets.com, 6 distinct chars in 'neopets'
www.overture.com, 6 distinct chars in 'overture'
www.passport.net, 6 distinct chars in 'passport'
www.pchome.com.tw, 6 distinct chars in 'pchome'
www.poptang.com, 6 distinct chars in 'poptang'
www.weather.com, 6 distinct chars in 'weather'
www.blogspot.com, 7 distinct chars in 'blogspot'
www.chinaren.com, 7 distinct chars in 'chinaren'
www.infoseek.co.jp, 7 distinct chars in 'infoseek'
www.livedoor.com, 7 distinct chars in 'livedoor'
www.myspace.com, 7 distinct chars in 'myspace'
www.netscape.com, 7 distinct chars in 'netscape'
www.nytimes.com, 7 distinct chars in 'nytimes'
www.pconline.com.cn, 7 distinct chars in 'pconline'
www.rakuten.co.jp, 7 distinct chars in 'rakuten'
www.sayclub.com, 7 distinct chars in 'sayclub'
www.webshots.com, 7 distinct chars in 'webshots'
www.casalemedia.com, 8 distinct chars in 'casalemedia'
www.craigslist.org, 8 distinct chars in 'craigslist'
www.fastclick.com, 8 distinct chars in 'fastclick'
www.friendster.com, 8 distinct chars in 'friendster'
www.mapquest.com, 8 distinct chars in 'mapquest'
www.mediaplex.com, 8 distinct chars in 'mediaplex'
www.microsoft.com, 8 distinct chars in 'microsoft' ***
www.net-offers.net, 8 distinct chars in 'net-offers'
www.xinhuanet.com, 8 distinct chars in 'xinhuanet'
www.coolmanmusic.com, 9 distinct chars in 'coolmanmusic'
www.doubleclick.com, 9 distinct chars in 'doubleclick'
www.netvigator.com, 9 distinct chars in 'netvigator'
www.newsgroup.com.hk, 9 distinct chars in 'newsgroup'
www.offeroptimizer.com, 9 distinct chars in 'offeroptimizer'
www.searchscout.com, 9 distinct chars in 'searchscout'
www.adultfriendfinder.com, 10 distinct chars in 'adultfriendfinder'
www.internet-optimizer.com, 10 distinct chars in 'internet-optimizer'
www.mywebsearch.com, 10 distinct chars in 'mywebsearch'
www.tribalfusion.com, 11 distinct chars in 'tribalfusion' 
Note that the calculations above assume the existence of a homograph table with
a certain _absolute_ uncaught error rate: that is, a 95% reliability means that
_fully 5%_ of the entire Unicode character set are homographs.

* Firstly, the number of characters that are potential homographs are almost
certainly less than 10% of the Unicode reportoire (ignoring CJK). If our actual
error rate is measured relative to a population of 10% of codepoints being
actual homographs, only getting 95% of _real_ homographs will actually mean that
only 0.5% of the Unicode codepoints will be uncaught homographs, corresponding
to a reliability of 99.5% in the charts above.

* Even if a given label consists entirely of homographs, the correct combination
of all of the necessary homographs may not be available in any other single
legal label's character set, because of character set restrictions on those
other labels. Any other combinatorial limit on the number of possible
combinations of characters will have the same effect. This may account for a
substantial improvement in the risk estimates. The calculations to go into this
in detail are somewhat involved, though. Still, nothing a computer algebra
system can't cope with.

* We can easily estimate both the number of homographs, and the reliability of
their classification, by using several different independently-compiled lists,
and performing capture-recapture analysis.

* Chance spoof rates will be lower again, due to the fact that you need _2_
spoof candidates to get a confusion pair, and stuff to do with Poisson
statistics and finite population sizes (p[2 or more spoofable] within a single
spoof-set is less than expectation rate, for values of expectation < 1); this
means that the legacy bundling conflict problem may not be as bad as feared,
even with some errors in the inital homograph list used to construct bundles.

* However, a more sophisticated analysis would take into account some of the
well-known human perception phenomena that may make names more spoofable, such
as a tendency to ignore minor typos when speed reading. We probably need to have
a better-defined criterion for the reasonable minimum perceptible difference
between two labels, probably in terms of the number of unique non-spoofed
characters that need to be present. This would tend to wind the figures back
upwards again.

Would anyone be interested in having a list of possible homographs constructed,
or a more detailed analysis performed?

Now it's time for coffee. 
If people attempt phishing with homograph attacks, they rely on people being
familiar with the name of a website. Of course, a homograph attack consisting in
spoofing an unknown site is possible, but it would make little sense (im that
case, the homograph aspect of the fraud would not actually help the
perpetrators). There might be targeted phishing attacks against specific people
or groups of people, but this is very difficult to do on a large scale. So, for
untargeted phishing attacks that can easily be done with spamming, the
perpetrators have to use widely known websites. Also, the problem of spoofing
does not concern all kinds of websites equally. Therefore, it might make sense
to make a list of the most important potential victims of spoofing (maybe a few
hundred common websites, maybe a few thousands), and the browser should give a
strong warning and not proceed without user action when something looks like a
spoof of such a website.
I think this is not only a temporary solution, even if we have a relatively good
way of treating homographs in general, it would probably make sense to make such
a distinction. There cam be legitimate reasons for mixing codes (e.g.
XML-документы for collections of Russian XML documents) and there can be other
reasons for false alerts. Therefore, the warnings in the general case should not
be too disruptive for people who use these possibly legitimate domains. On the
other hand, if a domain name looks the same as a well-known website, such as
Paypal, the warning and disruption level should be different. I think, in the
end, using a list of major well-known websites will be useful, whatever the
solution for the general problem is.

Of course, this presupposes that homonym characters are known. Homonyms between
Latin, Cyrillic, Greek and Coptic are basically known, but then it gets more
difficult - this is, of course, necessary both for the general case and this
suggestion for treating spoofs of "major" websites specially.

I think that the practical problem is not as big as it may seem at first sight.
Theoretically, any homography between characters of any character set is a
problem, but practically, only cases where at least one element of the homonymy
pair belongs to a widespread writing system really matters. Phishing attacks
targeting small groups of people are not very likely (though, on the other hand,
the TLD of e-mail addresses may in many cases facilitate coutry and language
specific phishing attacks). This still means that people have to through the
whole book of Unicode characters, but they only have to check whether a
character looks similar to one of their own writing system.

Are there countries where the IDN system is already widely is use? At least in
Europe, this does not seem to be the case, neither in Central Europe nor in
countries where the Cyrillic alphabet is used. Therefore, looking for characters
that are homonyms with Cyrillic characters or non-ASCII Latin characters is not
urgent because right now there are hardly widespread IDN addresses many people
are so familiar with that it would make sense to spoof them in a phishing attack
(and a Russian site with Latin characters that are homographic with a Russian
word is likely to be a legitimate one). So, what is urgent now is determining
homographs and near-homographs of the ASCII character set.

It seems that widespread adoption of IDN will take enough time that data about
homographs with additional characters - e.g. Cyrillic, non-ASCII Latin
characters in European languages - can be collected in the meantime. Then,
however, the problem will be bigger for Cyrillic than it is for ASCII
characters. While Cyrillic characters in the address of a website where nothing
Cyrillic is expected (e.g. in the language configuration of the browser) are in
any case suspicious, domains with Latin characters will probably remain common
in countries with the Cyrillic alphabet because they have been in use for such a
long time, and as has been pointed out in this discussion, many existing Russian
addresses are legitimate homonyms of Russian words created with Latin
characters. Furthermore, there is Serbia, where both the Cyrillic and the Latin
alphabets are used. A good solution would, of course, be if registrars in these
countries prevented the registration of new domains that look the same as
existing ones (taking into account at least the Cyrillic and Latin alphabet,
homonyms from other character sets could still be treated as suspicious by the
browser).

When the IDN system is widely adopted, good browsers for people in Eastern
Europe will probably have to display Latin and Cyrillic characters differently,
anyway, not only to prevent phishing attacks, but also just to avoid confusion,
especially with domains that are abbreviations and can easily be accidental
homographs.
(In reply to comment #224)

The idea of displaying Latin and Cyrillic characters differently in domain
names is very interesting. It seems to me that there may already be
conventions in ordinary documents (e.g. email, Web pages), such as the
hyphen between Latin and Cyrillic. I think one Russian person on the Unicode
mailing list even said he had trouble thinking of any examples *without* a
hyphen. Is the hyphen the only way it is done? The best way? What other ways
are used? Thanks.
(In reply to comment #225)
> (In reply to comment #224)
> 
> ... Is the hyphen the only way it is done? The best way? What other ways
> are used? Thanks.

I think someone already struck down color-coding different character sets (which
is what I would have liked, especially if it meant color-coding the whole URI,
not just the domain), but it's an example of an alternative way to tell them apart.
Depends on: 286534
Depends on: 286535
I'm glad that there's so much thought going into this issue, but I think much of
it isn't practical. To the end user, this is a browser-level issue, and we need
to treat it as such.
Next, the problem is much simpler. Forgetting about punycode, can you really
tell that www.paypal.com and www.paypa1.com are different? Can you do it in
Times New Roman? Now adding in Punycode, and a multi-language, multi-charset
world, and we've got a headache. Somehow, the user need to be alerted to the
possibility that they're going to a different site, but without being too intrusive.

Why not just have two URL boxes, or a URL box and a label next to it:
   www.paypal.com          www.pxn--ypal.com
Somehow, get the UI to look nice, maybe a mouse-over or an alert bubble. For
users going to IDN sites, they need to deal with this. For everyone else, they
can ignore it. It's one better than "just display everything in punycode".

Another note-- most people see this as a problem with people thinking they're at
ASCII sites, but are actually at puny-coded sites, but if you're living in
Spain, going to www.a&#921;a.com (www.a&0399;a.com, but click on a link that goes to
www.e&#1030;p.com (www.a&0406;a.com), then they both look the same, and neither are
ASCII. With the above method, they'd see:
   www.a&#921;a.com               www.xn--aa-09b.com
They'd probably not know that they're going to the wrong address, but short of
maintaining a list of valid "similar" DNS entries, I don't see any general
solution to this problem.
This bug is on my plate for 1.8, but I'm not exactly working on a solution and
times ticking.  I have many other important things to do for 1.8, and I'm
personally fine with the current solution of rending only punycode because I
believe that the IDN spec is pretty broken (homographs of '/' considered valid
-- come on!).

If someone wants to champion a solution for Mozilla that would enable us to
safely enable IDN in some form, then by all means run with it.  I'll help where
I can, but I don't have the time to develop a solution myself.

I'm reducing the severity of this bug to minor because it only applies when the
default preferences are changed.  The original setting of critical was correct
for Firefox 1.0 and earlier Mozilla-based browsers, but it no longer applies.

I half expect my comments to raise a ruckus in this bug.  Please keep any
comments brief and constructive.  Already, this bug report has grown to a length
that would deter most from venturing to read it, let alone actually work on it.
 Not that there aren't plenty of great comments here... let's just keep it that
way ;-)
Severity: critical → minor
Priority: -- → P3
Target Milestone: mozilla1.8beta1 → mozilla1.8beta2
*** Bug 288667 has been marked as a duplicate of this bug. ***
(In reply to comment #228)
> If someone wants to champion a solution for Mozilla that would enable us to
> safely enable IDN in some form, then by all means run with it.  I'll help
> where I can, but I don't have the time to develop a solution myself.

What would need to be done? Would it be enough to maintain a whitelist of TLDs
in all.js, and then, in nsIDNService::Normalize
(http://lxr.mozilla.org/seamonkey/source/netwerk/dns/src/nsIDNService.cpp#253),
changing:

  if (mShowPunycode)
    return ConvertUTF8toACE(input, output);

to something like:

  if (mShowPunyCode || !domainIsInWhitelist(input))
    return ConvertUTF8toACE(input, output);

If you think this would be enough, I might take a shot at it...
(In reply to comment #0)
> User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; rv:1.7.3) Gecko/20040913 Firefox/0.10
> Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; rv:1.7.3) Gecko/20040913 Firefox/0.10
> 
> firefox (and other unamed browsers) incorrectly handles punycode encoded domain
> names.   This allows attackers (namely phishers) to spoof urls of just about any
> domain name, including ssl certs. 
> 
> Proof of concept url:
> 
> http://www.shmoo.com/testing_punycode/
> 
> The links are directed at "http://www.p&#1072;ypal.com/", which the punycode
> handlers render as "www.xn--pypal-4ve.com"
> 
> The domain was just registered, so the root servers may not have gotten it yet.
>  Point your dns servers at '216.254.21.212' if you have problems. 
> 
> Here's what I think the bug is:
> 
> 1.  firefox (and mozilla) should warn the user if punycode is in use at all
> 2.  You should consider validating the ssl cert with the non-decoded version of
> the website
> 
> Just in case it's not clear, an attack case could be an ebayer/phisher who
> includes links to paypal in their auction. When the auction ends, the buyer
> clicks on the paypal link (which is a punycode/proxy to the real paypal), and
> proceeds to steal all of their private green bits. 
> 
> I have not done any platform testing, or tested any other versions of
> mozilla/firefox/etc.  I assume this bug is cross-platform.
> 
> The proof of concept urls are hosted on a personal server, and as such, I'd like
> to have a chance to bring them down before this bug becomes public.  Please
> email me at ericj@shmoo.com before marking this bug public.  
> 
> /me goes and reads up on the mozilla bounty program. 
> 
> Reproducible: Always

A coworker of mine, while reading the IDNa spec (RFC 3490), found a potentially insidious variant on 
this vulnerability.  Apparently it is possible to encode the label seperator (typically ASCII 0x2E, or '.'),
as well as the other valid label seperators specified in RFC 3490, as an html entity embedded within
a url.  This includes U+002E (full stop), U+3002 (ideographic full stop), U+FF0E (fullwidth full stop),
and U+FF61 (halfwidth ideographic full stop).

A sample of this is here: http://www.sleepwalk.org/279099_test.html

Essentially this means that the following url's all resolve to www.google.com:
http://www.google&#x002E;com
http://www.google&#x3002;com
http://www.google&#x3002;com
http://www.google&#x3002;com

This is insidious because it's somewhat different from the homographic attack described in the bug.  
Instead of a url using punycode to look like something it isn't (and thus redirecting a user to a location 
different from the expected destination), there is an underlying translation going on that makes these 
different label seperators equivalent!  This is bad because it makes it difficult for software to 
programatically parse urls if there is a way to obscure the label seperator.

Also consider that encoding the seperator as an html entity is not the only way to obscure the url.  A 
malicious sender could simply insert one of these equivalent seperators in UTF8 (<E3><80><82> for 
instance).  See the above sleepwalk.org url for an example that also resolves to www.google.com.

It seems to be a bug that firefox is treating these stop characters as equivalent.  
The IDN spec requires that we treat those characters like a period.  That we
display them in the status bar and URL bar in the normalized form is a good
thing:  it means that authors can't get the strange forms that mean the same
thing but look different into the URL bar or status bar.  So early normalization
is a good thing here, and I'm not sure why (or perhaps even whether) you think
otherwise (unless you're testing 1.0 rather than 1.0.1).
(In reply to comment #232)
> The IDN spec requires that we treat those characters like a period.  That we
> display them in the status bar and URL bar in the normalized form is a good
> thing:  it means that authors can't get the strange forms that mean the same
> thing but look different into the URL bar or status bar.  So early normalization
> is a good thing here, and I'm not sure why (or perhaps even whether) you think
> otherwise (unless you're testing 1.0 rather than 1.0.1).

I totally agree that the IDN spec requires these seperator characters be declared equivalent by a 
compliant application.  However, I disagree with Firefox's handling of data that contain unicode 
characters.  According to the IDN spec, properly-encoded (ACE form) domain names should always 
contain *only* ASCII characters.  Firefox is recognizing malformed domains -- those that contain 8-bit 
data.  

If my IDN was "www.google\u3002com", toASCII() would output "www.google.com."  This output is 
correct and is the ACE form that should appear in data accepted and interpreted by Firefox.  Firefox 
should only accept valid ACE-encoded domains in urls.  By recognizing malformed  (8-bit) domains, 
we're opening up a big hole.  A malicious user could easily obscure a domain in this way.

For a demo of the output of toASCII/toUnicode, see
http://www-950.ibm.com/software/globalization/icu/demo/domain



(In reply to comment #233)
> Firefox should only accept valid ACE-encoded domains in urls.

Why?  It defeats a significant part of the point of IDN if we require authors to
have the ACE in their HTML rather than the Unicode, and has no security
advantages whatsoever unless we're depending on view-source for security, which
we're not.

> By recognizing malformed  (8-bit) domains, 
> we're opening up a big hole.  A malicious user could easily obscure a
> domain in this way.

What hole?  We normalize before showing a URL in the status bar, the URL bar, or
even copying to the clipboard (copy link location).  (See attachment 174532 [details] to
test this.)  So there's no way the user will ever see the non-normalized form
unless the view source in the HTML.
(In reply to comment #234)
> Why?  It defeats a significant part of the point of IDN if we require authors to
> have the ACE in their HTML rather than the Unicode, and has no security
> advantages whatsoever unless we're depending on view-source for security, which
> we're not.

> What hole?  We normalize before showing a URL in the status bar, the URL bar, or
> even copying to the clipboard (copy link location).  (See attachment 174532 [details] [edit] to
> test this.)  So there's no way the user will ever see the non-normalized form
> unless the view source in the HTML.

The primary hole that concerns me is in HTML email, specifically spam/phishing scams/etc.  Anti-spam 
software tends to look at urls included in messages for suspect domains, from RBL's or other sources.  
By recognizing malformed domains in Firefox (as well as other browsers), we've just created an easy 
way for spammers to get around mail filters.

I suppose that the anti-spam community could modify their programs to parse these malformed 
domains.  

Note that also according to the IDN spec, "domain name slots" should always domain ACE (ASCII) 
domain labels, the output of toASCII(), and this includes uri's in HTML data:

> A "domain name slot" is defined in this document to be a protocol
> element or a function argument or a return value (and so on)
> explicitly designated for carrying a domain name.  Examples of domain
> name slots include: the QNAME field of a DNS query; the name argument
> of the gethostbyname() library function; the part of an email address
> following the at-sign (@) in the From: field of an email message
> header; and the host portion of the URI in the src attribute of an
> HTML <IMG> tag.

(In reply to comment #235)

RFC 3490 (IDNA) section 3.1 requirement 2 appears to require the Punycode
(ASCII) form, as you say. However, there are implementations that support
numeric character references (e.g. &#x3002;) in domain names in URIs in HTML,
including Mozilla, Opera and i-Nav, I believe. They may have been supporting
this for a while, and there may now be quite a lot of HTML pages out there
that depend on this behavior, so I don't know how realistic it would be to
try to get the implementations to comply with this part of the spec.

In any case, this issue is separate from the homograph issue. If you would
like to pursue it, may I suggest filing a separate bug?
(In reply to comment #236)
> (In reply to comment #235)
> In any case, this issue is separate from the homograph issue. If you would
> like to pursue it, may I suggest filing a separate bug?

Done.  Filed as bug 289183. 

I have checked the recent archive of spam on our lab machines (I work at an anti-spam company) and 
have not seen this in the field.  Not yet.  I assume this is because IE doesn't incorrectly interpret these 
malformed domains (I guarantee if it worked this way in IE, we'd see it).  This is definitely the time to fix 
it in Firefox, before the spam starts coming in!
(In reply to comment #237)

Thanks for filing the new bug.

MSIE doesn't support IDNA yet. That's why I mentioned i-Nav (an IDN plug-in
for MSIE).
Another possible new issue related to this bug: see Erik van der Poel's comments
on the IDN mailing list, idn=at=ops=dot=ietf=dot=org.

According to Erik, U+1160, HANGUL JUNGSEONG FILLER, is displayed in IDNs by
Firefox (and presumably other Gecko-based products) as a wide space, and is
therefore a homograph for ASCII space. This is a potential large security hole
for phishing/spoofing. (The same is apparently true of the Internet Explorer
plug-in).

In a reply to that, Soobok Lee states that U+1160 is not touched by NFC
normalization, and therefore gets through Nameprep/Stringprep. Apparently,
U+1160 is only meaningful in conjunction with Hangul characters, and he
recommends that a standalone U+1160 should always be deleted, regardless of what
the existing IDN standards say.

This also raises interesting questions about stray combining characters in general.
(In reply to comment #239)

Filed bug 289588 to address the U+1160 font display issue itself.

We need to watch IETF and Unicode to see how they respond to the Korean fillers
and leading combining marks in IDNA/Stringprep.
Depends on: 290275
Darin, is any more work here planned to happen in the next few days? If not,
then this probably needs to get pushed out to 1.8b3 or beyond. 
No, I have no plans to work on this for 1.8b2.  I'm not even sure that I will
have time for Gecko 1.8.  Help would be greatly appreciated.
Target Milestone: mozilla1.8beta2 → mozilla1.8beta3
darin doesn't have any time for this in beta2, may not have time to get it in to
1.1. 
Flags: blocking1.8b3?
Flags: blocking1.8b2+
Flags: blocking1.8b-
Flags: blocking-aviary1.0.1-
Bug 286534 fixes part of this bug. 

We also need to have a small blacklist of characters which IDN allows but in
fact we never allow because they are confusable with URL delimiters. I don't
know if there is a bug for this yet. This will not cause significant
interoperability problems because none of those characters are in the character
tables of the TLDs which will be whitelisted.


Gerv
The character blacklist issue is bug 283016 and the IDN tracking bug is
bug 237820.
Blocks: sbb-
No longer blocks: sbb?
Flags: blocking1.8b3? → blocking1.8b3-
Can I reassign this bug to someone (gerv, jshin, ?) who is actually working on
this?  Thanks!
I'm not sure that this bug has significant remaining value, but I'll assign it
to me for the moment.

Gerv
Assignee: darin → gerv
Status: ASSIGNED → NEW
Whiteboard: [sg:fix] → [sg:spoof]
Flags: blocking-aviary1.5+
What is required for detecting mixed-scripts as outlined in UTR#36?

Does Mozilla have internal data structure that stores the properties and script
of  Unicode characters?
jshin: are you able to answer the question in comment #248?

Gerv
gerv, 
intl library has a currently disabled API for Unicode character property, but it
doesn't have an API for script identification. gfx/src/win has an internal
routine for that, but that's not public yet. Perhaps, we have to move it to
intl, refine it and make it accessible by others.  
(In reply to comment #250)
> Perhaps, we have to move it to
> intl, refine it and make it accessible by others.  

Sounds good to me :)
Blocks: 316730
Cross-reference: see bug 316727 for mixed-script detection code, which I currently plan to use to trigger Punycoded display when incompatible scripts are mixed.

This is designed to be consistent with the version 2.0 ICANN IDN recommendations, which directly address homograph spoofing issues.
*** Bug 319397 has been marked as a duplicate of this bug. ***
Flags: testcase+
Flags: in-testsuite+ → in-testsuite?
Regarding blacklisting.
I would be sad if you blacklisted certain characters and/or prevented mixing.
I might want a subdomain in a certain language or with an odd character just for fun.
For example:
http://xn--7xa.m8y.org/
http://φ.m8y.org/
http://xn--h4h.m8y.org/
http://☠.m8y.org/
http://xn--j4j.m8y.org/
http://☢.m8y.org/

Heck.  There are a reasonable number of combinations like that already registered as domains.
They are harmless and fun.  Kind of a nice "extra" for browsers that can handle it.
I would prefer if options like notifying/colouring were not combined with blacklisting.
Or at least, if blacklisting didn't eliminate all non-linguistic symbols.

Thanks.
Oh. And regarding colouring and concerns for the visually impaired.
Even if there was no additional notification text wouldn't someone using a screen reader get an actual character name for a spoof?
Like, if it looked like an i but was a unicode char, wouldn't it read the unicode char name?

I don't know, not having a screen reader handy to test.
We now implement a whitelist of TLDs which have sensible practices.

Gerv
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Any pointers to a bug where that was implemented? Can't seem to find it.
(In reply to comment #257)
> Any pointers to a bug where that was implemented? Can't seem to find it.
> 

bug 286534

http://www.mozilla.org/projects/security/tld-idn-policy-list.html
There's now a useful tool for investigating spoofing at the Unicode Consortium site:

http://unicode.org/cldr/utility/confusables.jsp
Is there a reason why Mozilla wouldn’t whitelist some obviously harmless characters for all TLDs? I am thinking of Latin-1 characters 192 through 255 (excluding 215: ×) for a start. I mean, the current policy makes IDNs practically useless for the most common TLDs, and this would make them work at least for some of the most common latin-based languages.
We don't have a character whitelist.

Gerv
I see, but why? It doesn’t seem like something that is hard to implement. If I had to guess, some lines in nsIDNService::isInWhitelist, a key like 'network.IDN.whitelist_chars' in all.js, and some definitions.
Consider http://www.paypäl.com/, which uses only the characters you are proposing to whitelist. Yet in spite of being entirely made out of common Latin-1 characters, this is clearly a spoofing risk for http://www.paypal.com/

The registry is potentially in a position to prevent this sort of confusion, since it already knows which domains have already been issued, but a browser-based algorithm is not.
(In reply to comment #263)
> Consider http://www.paypäl.com/, which uses only the characters you are
> proposing to whitelist. Yet in spite of being entirely made out of common
> Latin-1 characters, this is clearly a spoofing risk for http://www.paypal.com/

This is not a particular strong argument, otherwise you'd have to disallow 0/o/O or 1/I/l for being too similar. And e.g. German readers aren't likely to take an ä for an a anyway...

> The registry is potentially in a position to prevent this sort of confusion,
> since it already knows which domains have already been issued, but a
> browser-based algorithm is not.

Most homograph attacks are based upon similar characters from different "scripts" or "alphabets", e.g. "a" (0x61) vs "а" (0x430). That's why .eu IDNs must not mix between Latin/Greek/Cyrillic characters.
> otherwise you'd have to disallow
> 0/o/O or 1/I/l for being too similar

You're right about that: it's actually quite a strong argument for disallowing those at the registry, _in addition_ to non-ASCII confusables. In my opinion, the registries should do exactly that.

Regarding German users: yes, German users might well be more likely to see the umlaut than others, but 

1) most Internet users are not literate in German, and 
2) even German-literate readers will generally read what they expect, if they are already expecting to read "paypal".

Finally, regarding "whole-script" confusables, you might want to take a look at this 

http://unicode.org/cldr/utility/confusables.jsp?a=paypal&n=on&x=on&s=on

for examples of how mere contraints on script mixing are not nearly enough to prevent confusion, even when substantial efforts have been made to restrict character repertoire.
Hmm, I would have thought that paypäl is impossible to confuse with paypal, that’s why I said “obviously harmless characters”. Maybe that really is a question of being used to diacritics. But still, unicode.org does not list a and ä as confusables. Also, Opera and IE would show that kind of IDNs. So this is probably not a clear case. I think it should be reconsidered.

Indeed, paypa1.com would be far more dangerous (if it weren’t registered to an anti-fraud company), but you wouldn’t want to disallow 1und1.com and the like, even though registrars don’t check if there is a site lundl.com (actually, there is).
There's still more than can be done here.

See http://www.idnnews.com/?p=7109

Chrome displays the fake www.аmazon.com as http://www.xn--mazon-3ve.com/ on hover, but Firefox still shows it as http://www.аmazon.com
I filed bug 750587 for Brad's concern.
This seems to be back and being publicly discussed.

https://www.wordfence.com/blog/2017/04/chrome-firefox-unicode-phishing/
There is also a report on SUMO (Support Mozilla) as a question asking about the same wordfence blog
> https://support.mozilla.org/t5/Firefox/firefox-phishing-warning/m-p/1391610
The poster of that question marked it solved after using about:config to toggle the pref
> network.IDN_show_punycode
to true. That is the workaround suggested in the wordfence blog.


I note that unlike the examples in commennt 2
(In reply to Daniel Veditz [:dveditz] from comment #2)
> Created attachment 171916 [details]
> more examples
> 
> from a spreadfirefox.com blog I found out this morning about
> http://www.retrosynth.com/misc/phishing.html which plays with the same idea:
>   www.xn--amazn-mye.com
>   www.xn--micrsoft-qbh.com
>   www.xn--papal-fze.com
>  ....
Where the fake and real URLs give distinct and different displays on mousover

The example from wordfence gives a mousover result from the fake URL that visually matches the genuine URL
using
> <a href="https://www.xn--e1awd7f.com/" target="_blank">
to spoof
> https://www.еріс.com/

As an additional twist they have also obtained a SSL cert from the Mozilla affiliated https://letsencrypt.org/ for the fake site.
Agreed that this bug is not yet fixed. This URL also popped up on Hackaday: https://www.xn--80ak6aa92e.com/ which spoofs apple.com. The idea behind how Firefox deals with IDN is explained in https://wiki.mozilla.org/IDN_Display_Algorithm:

> Instead, we now augment our whitelist with something based on ascertaining whether all the characters in a label all come from the same script, or are from one of a limited and defined number of allowable combinations. The hope is that any intra-script near-homographs will be recognisable to people who understand that script. 
> We retain the whitelist as well, because a) removing it might break some domains which worked previously, and b) if a registry submits a good policy, we have the ability to give them more freedom than the default restrictions do. So an IDN is shown as Unicode if the TLD was on the whitelist or, if not, if it met the criteria above. 

The example I linked to uses just the Cyrillic alphabet and is thus displayed with its IDN label per the single-script considerations of the algorithm. 

Perhaps, even if you allow IDN labels, you need to visually distinguish them, for example by marking the domain in a different color.
This bug is old and is already resolved and marked as fixed. 

It is probably not productive to continue commenting further in this bug.

A newer and currently reopened one covering the subject is Bug 1332714. Bugzilla is however not the best place for general discussion of problems of complex issues involving languages and ICANN policy. Any action to attempt to mitigate issues is likely to have downsides like hitting legitimate sites with either blocks or display problems. 

There are always the standard Mozilla forums:
https://www.mozilla.org/about/forums/
https://www.mozilla.org/en-US/about/forums/#dev-security 
https://groups.google.com/forum/#!forum/mozilla.dev.security.policy
You need to log in before you can comment on or make changes to this bug.