Closed Bug 279099 (punycode) Opened 20 years ago Closed 17 years ago

Protect against homograph attacks (spoofing using punycode IDNs)

Categories

(Core :: Networking, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla1.8beta3

People

(Reporter: ericj, Assigned: gerv)

References

(Blocks 1 open bug, )

Details

(Whiteboard: [sg:spoof])

Attachments

(9 files, 3 obsolete files)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; rv:1.7.3) Gecko/20040913 Firefox/0.10 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; rv:1.7.3) Gecko/20040913 Firefox/0.10 firefox (and other unamed browsers) incorrectly handles punycode encoded domain names. This allows attackers (namely phishers) to spoof urls of just about any domain name, including ssl certs. Proof of concept url: http://www.shmoo.com/testing_punycode/ The links are directed at "http://www.pаypal.com/", which the punycode handlers render as "www.xn--pypal-4ve.com" The domain was just registered, so the root servers may not have gotten it yet. Point your dns servers at '216.254.21.212' if you have problems. Here's what I think the bug is: 1. firefox (and mozilla) should warn the user if punycode is in use at all 2. You should consider validating the ssl cert with the non-decoded version of the website Just in case it's not clear, an attack case could be an ebayer/phisher who includes links to paypal in their auction. When the auction ends, the buyer clicks on the paypal link (which is a punycode/proxy to the real paypal), and proceeds to steal all of their private green bits. I have not done any platform testing, or tested any other versions of mozilla/firefox/etc. I assume this bug is cross-platform. The proof of concept urls are hosted on a personal server, and as such, I'd like to have a chance to bring them down before this bug becomes public. Please email me at ericj@shmoo.com before marking this bug public. /me goes and reads up on the mozilla bounty program. Reproducible: Always
This bug impacts many other browsers, and I'm working on notifying them right now. Based on the critical nature of this bug, I believe it's best to: 1. not notify the public until all vendors have been notified & have a chance to release updates 2. set a fixed date on which this vulnerability will become public (so no one company releases details before others have a chance to release updates). That date will be 2/5/05, unless folks convince to delay this action. Thanks, Ericj 206.321.3411
Attached file more examples
from a spreadfirefox.com blog I found out this morning about http://www.retrosynth.com/misc/phishing.html which plays with the same idea: www.xn--amazn-mye.com www.xn--micrsoft-qbh.com www.xn--papal-fze.com These three were registered to Jesse C Lee (Witchita, KS) on Jan 8, 2005. The retrosynth page was last updated (created?) Jan 16, 2005, presumably by the site owner Cary Roberts in Mountain View, CA. What's the connection? What's the connection between retrosynth and the spreadfirefox blogger? This may already be widely known.
Darin: any ideas?
Assignee: firefox → darin
Status: UNCONFIRMED → NEW
Component: General → Networking
Ever confirmed: true
Product: Firefox → Core
QA Contact: general → benc
Whiteboard: [sg:fix]
Version: unspecified → Trunk
Opera has responded: Date: Thu, 20 Jan 2005 18:06:30 +0100 From: bug-161715-s10@bugs.opera.com To: ericj@shmoo.com Subject: Your bug report Hello Eric, What you illustrate is an inherent problem with IDNA and the international Unicode characterset. On many systems success may depend on which fonts and languages the user have installed (and what is included in the default installation) There was a discussion about a similar issue in our forums a couple of days ago: <URL: http://groups.google.com/groups?threadm=tmgou051aaovjqh2isd5shkcel8rp4j96q%404ax.com > Unfortunately, I do not believe your suggestion of warning the user about IDNA encoded names in the name of secure servers is particable. It might look that way when you are dealing with spoofsites such as your example, but it would be maddening for Chinese and Japanese websurfers, in fact it would also irritate many European (e.g. French, German and Scandinavian) surfers who are using languages with characters that will generate punycode servernames. The problem about spoofing websites using IDNA is IMO best solved by the domainname registrars, by limiting on their side the character-combinations they want to accept in a domainname. AFAIK such limitations are implemented in (e.g.) the Norwegian zone, but Verisign has not yet implemented something similar, which is understandable given the worldwide use of .com domains. Please note that Wand or cookies will not be tricked by this kind of servernames. -- Sincerely, Yngve N. Pettersen ******************************************************************** Senior Developer Email: yngve@opera.com Opera Software ASA http://www.opera.com/ Phone: +47 24 16 42 60 Fax: +47 24 16 40 01 ********************************************************************
Component: Networking → Bookmarks
Product: Core → Firefox
Version: Trunk → 1.0 Branch
We should consider adding opera to the CC list on this bug: bug-161715-s10@bugs.opera.com Cheers, Eric
It turns out this attack was talked about several years ago; it was called the 'homograph attack' http://www.cs.technion.ac.il/~gabr/papers/homograph.html The problem today is that several browsers support this right out of the box. This introduces a huge security risk for users. Filtering at the registrar level is possible, but VERY hard. They should not allow mixed-byte or multi-language encodings, and should consider blacklisting some of the chars from the punycode encode process. However, as a user of firefox, I see no method for me to disable punycode support. This is not just a browser bug - - it's a standards bug. But early adoption means that firefox & CO needs to deal with it at some level (even if it means disabling puny support, or ssl + puny support). I don't know what the right answer is. I'm just saying: "TODAY THIS IS A HUGE PROBLEM FOR FIREFOX SECURITY".
That opera address is not registered in bugzilla can't be CC'd, but we have contacts at Opera and will work through those.
Summary: CRITICAL SECURITY VULN: punycode allows attackers to spoof urls/ssl certs → punycode allows attackers to spoof urls/ssl certs
After talking about this bug with a few other security folks, I have some ideas I'd like to share. 1. Different validation of ssl certs. Currently, the browser encodes the unicode into punycode, loads the website, and validates that the puny encoded domain matches the ssl cert. I think this is a problem. The browser should validate the cert name with the raw unicode text (you can generate ssl certs with unicode CNs - - I tested this). 2. Filtering should happen at both the browser level and the registrar level. Example filtering should include: A. not mixing double-byte & single byte punycode wrapped domain names. This makes it much harder to spoof domain names, as most other codepages don't have standard latin in them. B. Validation of codepage. Ensure that all chars in a domain are part of ONE codepageset, not mixed. C. Don't allow for bad unicode chars (see MS Press "Writing Secure Code, 2nd Edition", page 379) such as non-shortest encoding of UTF8->punycode. D. Block some 'non-alpha' chars in other code pages. Examples are Unicode 05B4, which looks like the latin period '.' IDN Filtering is a complex subject, and is highly prone to errors. 3. Display a country flag next to the addressbar/domain name. Display icon or something showing the current language the domain is in. 4. Must have feature: Disable/enable IDN in all mozilla products. Anyway, I hope some of these ideas get some traction or result in some better solution... If I can assist in any way (testing, providing evil ssl certs, whatever) please let me know. Cheers, Eric
>3. Display a country flag next to the addressbar/domain name. Display icon or >something showing the current language the domain is in. hm? how would mozilla know the language? > 4. Must have feature: Disable/enable IDN in all mozilla products. network.enableIDN, this already exists
Both omniweb & konqueror correctly discover that the ssl cert isn't valid for that domain. That means they are not checking the puny encoded version of the domain with the CN, they are checking the UTF-8 version of the domain with the CN of the cert. This is what I expect firefox to do. I'll attach a screenshot shortly. Also note that they display the alternate script with a different font - - - making it (more) clear that something phishy is going on.
If the behavior you describe means that IDN sites simply can't use SSL, then, sure, it would fix this bug, but that would be a pretty serious bug in itself. If it doesn't mean that they can't use SSL, then it doesn't help this bug at all.
The two attachments demostrates the 'other' behavior that browsers have when validating CNs with IDN sites. Omniweb & Konqueror validate the UTF8 domain with the CN firefox/mozilla, safari, any gecko-powered browser validate the puny encoded domain with the CN At this point, I'm not sure which one is correct; but there should be a correct method for using ssl with IDN. Purhaps this is because the existing RFCs don't really talk about ssl + IDN.
Flags: blocking-aviary1.0.1?
This bug really has two parts: -- should we be expecting the domain names in SSL certs to be punycode-encoded, or raw Unicode? -- how do we deal with homograph attacks using punycode-encoded domain names? The first question should be quite easy to resolve, and if necessary, fix. I've filed it as bu 280839. Let's focus this bug on discussion of the second point, which will be much harder to address.
OS: Windows 2000 → All
Summary: punycode allows attackers to spoof urls/ssl certs → Protect against homograph attacks (spoofing using punycode IDNs)
Alias: punycode
*** Bug 281381 has been marked as a duplicate of this bug. ***
Group: security
*** Bug 281428 has been marked as a duplicate of this bug. ***
See http://www.unicode.org/Public/4.0-Update/Scripts-4.0.0.txt for a Unicode code-point to script mapping table. Now consider the following algorithm as a first hack: We first divide the different Unicode script families into "potentially confusable" equivalence sets: for example, LATIN, CYRILLIC and GREEK are potentially confusable, as they each contain characters with lowercase glyphs that look like 'c' or 'a'. However, LATIN and ARABIC do not contain any similar characters, so they are not "potentially confusable". We put this information in a (suitably compressed) look-up table. This now leads naturally to a simple algorithm for spotting "stranger" characters in the context of another "potentially confusable" script (ie different script, but same script equivalence set). Note that there are still more things to look out for: * we should canonicalise the string with NAMEPREP first, since we can't rely on the registrar to do so * font variant characters * double-width and half-width characters * expansion of ligatures, roman numerals etc. Even than, some tricky, but potentially dangerous cases are still left out, such as the fact that the ANGSTROM SIGN is in the LATIN script family, even though it is visually indistinguishable from LATIN CAPITAL LETTER A WITH RING ABOVE. This makes it very difficult to put a solution in place without creating a false sense of security. On the other hand, the Unicode .pdf charts _do_ appear to contain a detailed cross reference of visually confusable characters, as do the charts in the Unicode book. However, I cannot find this information anywhere online. With the scripts information, and the cross-reference information, we could probably construct a serious character-level "confusion table" which would very effectively catch spoofing attacks. Does anyone have any good contacts in the Unicode Consortium who could release this information to us in machine-readable form? (For example, letting us know the decrypt password for the existing character chart .pdfs would enable us to extract this information; the original data would be even better).
A spoofed domain name doesn't have to mix character sets. As an extreme example, you could use simply letters from the MATHEMATICAL SANS-SERIF SMALL series. Also, it's probably going to become quite common to mix sets, e.g. with mixed English-Japanese site names.
(In reply to comment #18) > * font variant characters > * double-width and half-width characters > * expansion of ligatures, roman numerals etc. Aren't these all taken care of by NFKC normalization (which we already do before display)?
OK, after some more grovelling around in the Unicode mailing list archive, I've found the following file: http://www.unicode.org/Public/UNIDATA/NamesList.txt This has the cross-reference data in it, giving both exact and approximate visual similarities between the characters, and also code-point equivalents for ligatures etc. Together with the script-family data, this is probably a good starting point for an anti-spoof algorithm.
After reading TR#15, yes, NFKC normalization won't hurt at all: we should do it as a first step, before anything else. Indeed, we should do a full NAMEPREP. A question; DNS is case-insensitive, but sometimes visual collisions may be case-sensitive. For example, Greek capital Alpha collides with Latin capital A, but not for the lowercase versions. NAMEPREP implies NFKC normalization and the use of STRINGPREP tables B.1 (deletion of silly characters) and B.2 (case folding; RFC 3454 implies folding to lowercase). Should we look for collisions in either upper or lowercase, or is it safe to restrict to lowercase only?
> 4. Must have feature: Disable/enable IDN in all mozilla products. network.enableIDN, this already exists but does not appear to be working. I set that in prefs.js user_pref("network.enableIDN", false); restarted firefox when to http://www.shmoo.com/idn/ clicked on the URL and got 'meeow'. about:config name/value shows: network.enableIDN false
This issue is being intensely discussed in the CAcert newsgroup. There *may* be some useful insight there. Subject: Bug in Mozilla based browsers could cause big security problems... Newsgroup: gmane.comp.security.cacert Thread: news://news.gmane.org:119/4207362A.2020208@cacert.org
-> core:networking
Component: Bookmarks → Networking
Product: Firefox → Core
Target Milestone: --- → mozilla1.8beta
Version: 1.0 Branch → Trunk
> At this point, I'm not sure which one is correct; but there should be a correct > method for using ssl with IDN. Purhaps this is because the existing RFCs don't > really talk about ssl + IDN. I think it makes more sense to compare the punycode value of the hostname to the cert since that is the value of the hostname used with DNS to resolve the IP address. It seems like a bug to me that KHTML and Opera do otherwise. As with many of the older internet specifications (DNS, HTTP, Cookies, etc.), IDN names are intended to be converted to punycode before being used. So, it is an odd choice to treat certs as somehow different.
> but does not appear to be working. See bug 261934. The bug was fixed recently on the trunk. The patch applies cleanly on the 1.7 branch.
Status: NEW → ASSIGNED
*** Bug 281439 has been marked as a duplicate of this bug. ***
(In reply to comment #21) > OK, after some more grovelling around in the Unicode mailing list archive, I've > found the following file: http://www.unicode.org/Public/UNIDATA/NamesList.txt > > This has the cross-reference data in it, giving both exact and approximate > visual similarities between the characters, and also code-point equivalents for > ligatures etc. Together with the script-family data, this is probably a good > starting point for an anti-spoof algorithm. An algorithm which looks purely at specific character pairs will remain a point of weakness. If a flaw leaves the user has no other protection, then each flaw, big or small, will be announced with all the gravitas of a full security vulnerability. The spreadfirefox people don't need this. Detection of potential problems needs to operate on several levels, and I think we need a top down approach, with warnings on by default and user configurable, so that the browser is safe `out of the box'. For example the warning could be displayed 1. the first time a new codeset is encountered in a URL 2. the first time a particular pair of codesets are used together in a URL. The user may disable this warning for future encounters with that character set or combination of character sets, or may leave the warning enabled but create an exception for that particular site. This would catch almost all of the problem without getting into the detail of similar appearing characters. Below this would be the more detailed algorithm for flagging potentially ambiguous constructions. However with such broad general protections in place, this could now be implemented on a per codesetpair basis.
With respect, confirmation alerts ds not make you "safe out of the box"; they merely makes you *annoyingly* unsafe, since people don't read them. If mostly-reliable homograph attack detection turns out to be at all practical, I suggest a Thunderbird-style banner along the top of the page: "&brandShortName; thinks this site is a fraud. (Tell Me More) (Not a Fraud)" Disable form controls + applets + plug-ins unless "Not a Fraud" is clicked.
RFC 3490 section 10 (http://www.apps.ietf.org/rfc/rfc3490.html#sec-10) apparently outlines some high-level suggestions for dealing with this problem.
I heard about this (after hearing the initial warning in 2000) and have followed several sets of directions to disable it, from going to about:config and turning the network.enableIDN off to going to the compreg.dat and editing out the lines mentioning IDN (which was then overwritten by firefox), but have been unable to turn it off. I have restarted (all copies) of my Firefox 1.0 browser so the settings should have taken effect. I use Suse 9.1 and cannot tell the difference between the two urls unless I watch the status bar while its loading, meaning I would have to go out of my way to verify authenticity of some sites.... Is there a way to turn this off as the other ways I have been told of dont seem to be working? Spoofstick plugin doesnt help either. Any help would be appreciated.
> the network.enableIDN off to going to the compreg.dat and editing out the lines > mentioning IDN (which was then overwritten by firefox), but have been unable The preference is indeed broken. See bug 261934, which has the fix for the preference. You should be able to get around this problem by editing compreg.dat as suggested, just make sure that you edit the compreg.dat that lives in your Firefox profile directory. Keep in mind that Firefox re-generates compreg.dat whenever a new extension is installed, so this is not a great solution.
Here are some potentially interesting references on this issue: * "The Homograph Attack", Communications of the ACM, 45(2):128, February 2002 http://www.cs.technion.ac.il/~gabr/papers/homograph.html * Method for detecting a homographic attack in a webpage by means of language identification and comparison http://www.priorartdatabase.com/IPCOM/000010253/ * Draft Unicode Technical Report #36, Security Considerations for the Implementation of Unicode and Related Technology http://www.unicode.org/reports/tr36/tr36-1.html * IDN Language Table Registry http://www.iana.org/assignments/idn/ * IANA registered language table list: http://www.iana.org/assignments/idn/registered.htm Regarding the last link: note how the registered tables for Greek, Hebrew and Arabic do not include any Latin letters. On the other hand, the tables for Japanese, Thai and Korean _do_ but these scripts are sufficiently unlike Latin script that no confusion is likely to occur between their native characters and the Latin characters. As yet, there is no registered table for Cyrillic, but I doubt that it would need Latin characters in it. There is also quite a lot of activity on the Unicode mailing list about this topic. http://www.unicode.org/consortium/distlist.html
Here's are some more useful references: * ICANN Briefing Paper on IDN Permissible Code Point Problems http://www.icann.org/committees/idn/idn-codepoint-paper.htm * ICANN Input to the IETF on Permissible Code Point Problems http://www.icann.org/committees/idn/idn-codepoint-input.htm
*** Bug 281496 has been marked as a duplicate of this bug. ***
This is a proposed "blacklist" of valid Unicode character ranges which are unlikely to ever be used in any valid domain name in any language. The names of ranges are those given by the Unicode Consortium. Note that this blacklist will not _of itself_ eliminate the homograph problem, but it will substantially reduce the number of possible characters avaliable for homograph spoofing. At the moment, I make no proposal as to how the blacklist should be used; it's just a collection of character ranges containing characters that make no sense being included in any domain name, in any language. I would appreciate any comments regarding ranges that should be added to or taken out of this list. Not that the above assumes that NAMEPREP has been applied first to normalise the string prior to scanning for blacklisted characters.
*** Bug 281507 has been marked as a duplicate of this bug. ***
A programmatic analysis of homographs from the Unicode data shows 2661 possible cross-script one-way clashes. Applying the blacklist in the attachment (but without the ideographic description characters, which will probably be needed) reduces this to only 462 cross-script one-way clashes. More analysis to follow.
only for reference: Secunia Advisory SA14163 (Mozilla / Firefox / Camino IDN Spoofing Security Issue) http://secunia.com/advisories/14163/ http://secunia.com/multiple_browsers_idn_spoofing_test/
I think I fixed it on my Mac. I took a reference from http://users.tns.net/~skingery/weblog/2005/02/permanent-fix-for-shmoo-group-exploit.html They refer to a file called compreg.dat, but they locate it in the user profile data. I located one in my main install. /Applications/Mozilla.app/Contents/MacOS/components That is where I changed it. I even reset the about:config back to default. It seems to work. You change the one line of text in compreg.dat from ... (Scroll down to the [CONTRACTIDS] section ...) @mozilla.org/network/idn-service;1,{62b778a6-bce3-456b-8c31-2865fbb68c91} Change the 1 to a 0 so the line reads: @mozilla.org/network/idn-service;0,{62b778a6-bce3-456b-8c31-2865fbb68c91} This really worked under Mozilla for Mac. The "paypal" spoof no longer works in my Mozilla browser.
(In reply to comment #42) It is easier to update to current branch builds (or trunk if you want) and use the pref. See bug 281506 comment 1.
There are also ASCII characters that look very similar with some fonts: l (lowercase L), 1 (digit), I (uppercase i).
FYI, Unicode.org has a proposed draft tech report: Proposed Draft Unicode Technical Report #36 (1.0 version dated 2004-10-12) Security Considerations for the Implementation of Unicode and Related Technology which includes a section of Visual Spoofing: http://www.unicode.org/reports/tr36/#visual_spoofing which lists 2 recommendations: (1) Cross-Script Spoofing: the user should be alerted to these cases by displaying mixed scripts with some special formatting to alert the user to the situation. For example, a different color and special boundary marks, are used in Example 2c. A tool-tip can be displayed when the user moves the mouse over the address to display more information about the situation. (2) Inadequate Rendering Support: Browsers and similar programs should follow the Unicode Standard guidelines to avoid spoofing problems. There is a technical note, UTN #2: Rendering Combining Marks(http://www.unicode.org/notes/tn2/), which provides information as to how this can be implemented even in the absence of font support.
*** Bug 281474 has been marked as a duplicate of this bug. ***
Here's a workaround for linux, I'm sure there's something similar in other os's, but I don't have access to them to look. This does disable all idn service lookups as far as I can tell. This should help with the security issue at the moment until a more feasible solution can be found. open a terminal and type... $ cd ~/.mozilla/firefox/ in that folder will be another folder where the name will depend on your profile name, if you used to default, the folder will be foobar.default change to the *.default folder and type... $ vim (or vi, kvim, gvim, scite, etc) compreg.dat now use vi's search function by typing.... /idn-service;1 You will find two locations that match it, highlight the 1 with the cursor, and use the 'r' key to replace the 1 with a 0. Do this for both locations, then go back to www.schmoo.com/idn and test, and it won't allow you to navigate to the page. I've tried testing it on a few fake sites and it doesn't allow navigation to them.
After more analysis of the Unicode cross-reference tables, I can see that an attempt to enumerate 100% of all possible homograph sets is probably not feasible without massive effort (although making equivalence classes from the crossrefs has found a great many). However, it has given me a lot more insight into the problem. Homographs are generally unpopular within a single writing system. On the other hand, many simple symbols have been either re-used or re-invented in many alphabets. So the secret of homograph spoofing is mixing languages and/or symbol sets. This proposal suggests a method for detecting language mixing. However, there is not a 1:1 correspondence between writing systems and code ranges. Some writing systems are split across a number of code ranges; others use characters from other writing systems -- for example, both Cyrillic and Japanese use the ASCII numerals. Nor is there a 1:1 mapping between writing systems and languages; for example, Japanese uses four distinct writing systems. However, we _should_ be able to map from _sets_ of code point ranges, some per-character attributes, and one small set of special case characters, to the plausibility of a DNS label. So how about the following algorithm for a single label in a domain name: 1. Run the string through NAMEPREP. 2. If there are leading combining characters, reject as malformed. 3. Assign each character to a character range, according to the official Unicode code point ranges; except that: characters 0123456789 and HYPHEN are special, and go in a special range of their own. 4. If there are any characters from "blacklisted" code point ranges, reject the string as suspicious. A blacklist is a powerful way of limiting spoofers' options. 5. If there are any other Unicode punctuation characters apart from HYPHEN, reject as suspicious. 6. If there are any Unicode whitespace characters, reject as suspicious. 7. Now look at the set of character ranges used; are they compatible with a single writing system/language set? This would consist either of one range and optional ASCII digits + HYPHEN, or any of a number of hard-coded sets dealing with cases such as Japanese and Chinese. If the set of ranges is not compatible with a single script, reject the string as suspicious. 8. If all the tests above pass, return OK. This would certainly raise the bar for spoofers to jump over quite substantially, and would not be very code intensive; the script-lookup code is tiny, and the number of special cases rather small, even when considering obscure languages. If this looks plausible, we can then use the test homographs I've discovered, and the existing spoofing examples, to test the effectiveness of such an algorithm. There are still other issues to look at, even if this is a possible solution: * Forwards-compatibility and future Unicode allocation policy * RFC compliance
*** Bug 281578 has been marked as a duplicate of this bug. ***
I have attached a small Python program that analyzes internationalized domain names for cross-script homograph attacks, and tries to detect various possible kinds of spoofing.
This is a slightly more paranoid version of the previous IDN checker. Doubtless it still has many deficiencies. However, I would be interested in any comments about how it might be improved, or interesting counterexamples that it currently cannot detect.
Attachment #173811 - Attachment is obsolete: true
Blocks: 281496
After yet more thinking, I can identify three different categories of potential problem: 1. Homographs that cross language/writing system boundaries. We should be able to catch all or nearly all of these programatically, eliminating a vast class of attacks. 2. Homographs within a single language/writing system. Unfortunately, the writing system most affected by this is the Latin writing system. For example, consider the problem of detecting spoofing by using the many variants of the letter 'i'. 3. Confusion generated by _semantic_ duplicates. This is a major problem for the CJK writing system family. TLDs which follow the single-language and label-filtration practices recommended by IANA will be more resistant to problem 2: for example, readers of domains within a Turkish-language domain will be sensitive to the difference between dotted and dotless-i. This still fails to address problem 2 within multiple-script-system gTLDs, where readers will not in general be sensitive to homographs outside their own language subset of their local writing system. For example, French readers who will easily notice cedillas on 'c's will probably not notice the i-variants which are outside the scope of their own language. (English readers, with no native accented letters, are of course worst off of all). How can problems 2 and 3 be solved? The current favourite solution is "bundling", where a single domain registration also registers all possible variants. However, this can only occur at the domain registrar end, or, at extra cost, by having domain registrants register all the possible variants of their domains. There are several technical problems with this: * how do you know that you have _exhaustively_ enumerated all possible variants? A spoofer only needs one missed variant, and they've won. * backwards compatibility with existing registrations * infrastructure problems; some names may potentially have thousands or even millions of variants to be bundled And there are a number of business problems with this: * the registrar incurs extra costs without extra revenue, providing a disincentive to do it at all * generating bundles means fewer possible strings available to sell * potential legal liability issues; have they missed a homograph? Once they've started along this route, should they include entries in the bundle to resolve '0'-'o', 'l'-'I' and other possible near-misses? What about characters which are homographs in one font, but not in another, or in one casing, but not in another? Should they generate possible simple typos in the bundle? And so on. Ways forward: * we should consider a programmatic approach for problem 1, which will nip in the bud a huge number of potential attacks in non-Latin writing systems * problem 2 needs more thought, and it is also probably the most pressing problem, given that almost all existing domains are registered within the Latin writing system * problem 3 is a matter for oriental-language experts, and I believe this is being discussed in the IDN community.
Regarding problem 2, homographs within the Latin writing system, here is a heuristic that will probably catch a great many current spoofing attempts: (This is a first hack at the logic, so please excuse any clunkiness). * we first assume that NAMEPREP, and the checks for cross-script spoofing (problem 1) have been applied, and passed. * next, look at the length of the TLD name. If it is two characters long, I believe that IANA policy requires it to be a country-code TLD (ccTLD). In this case, assume that the ccTLD owner knows what they are doing with regard to language and character set filtration, and return OK. * if the TLD name is not a ccTLD, it's a gTLD. In this case, we are rather less confident in the registrar doing the right thing. Now look at the whole FQDN. * Since most legacy domain names are all-ASCII, "ASCIIfy" the whole FQDN. This means, for each character in the FQDN, changing it to the corresponding unaccented ASCII character (a transform which is easily programmatically computed from Unicode character names). * Is the ASCIIfied FQDN identical to the FQDN? Then each of its labels are either ASCII, or belong to just one non-Latin language (see the check for Problem 1). In any case, if the ASCIIfied FQDN is identical to the FQDN, return OK. * Now look up the "ASCIIfied" FQDN. If this name lookup returns a value, then the FQDN is _possibly_ spoofed. If the ASCIIfied name lookup fails, we return OK. Note that this does not catch spoofing attempts for one non-ASCII Latin IDN domain against another non-ASCII Latin IDN domain; however, since most legacy domains, including most current high-value targets, are pre-IDN names, this will go a long way to ameliorating problem 2. Downside: an extra name lookup for non-ASCII Latin-script IDNs in gTLDs. Name lookup caching will, of course, greatly reduce this overhead.
Another downside is that it means that if the domain owner registered both the accented and non-accented variants (very common in Europe!) their accented version will raise an alarm each time.
Ian, I was going to snap back with "yes, but if they resolve to the same IP address, we know they're the same!", but then I thought about round-robin DNS, HTTP redirect, Akamai, anycast, and so on... And then I thought about automated WHOIS queries, but only as a joke. Back to the drawing board with that last bit, then.
Hmm. I'm having quite a hard time working out what might work in general. * Reverse DNS won't work for virtual hosted domains * Looking at the MX records won't work; spoofable, not always present * Looking at SPF records won't work; spoofable, not always present (SPF works because although the _records_ might be copyable to another domain, the machines pointed to by those addresses remain under the control of the entity that controls the real domain) * Looking at the NS record chain won't work because both domains might be farmed out to the same outsourced DNS provider * Even matching A records won't work if you are truly paranoid, since the two domains, one valid, one not, might both be hosted on the same outsourced virtual host provider's machine * And note that if you use an anonymizing DNS registrar who won't publish your details on WHOIS, even that opens you up to your spoofer registering with the same registrar Any ideas for a _reliable_ way of mapping domain names to controlling entities?
Nor, for that matter, after a bit of experimenting, can we say "let's just fetch and compare their PKI certificates", if present. Most commercial sites won't allow TLS/SSL connections on most URL paths. Nor would it be useful to check "do they serve the same content?", even if this check was cheap to implement. The front page of a spoofed site might well be identical to that of the real site, with the scam being located on an inside page. Indeed, spoofed sites would be very likely to copy the exact content of their target's entry page, to make them more likely to resist human scrutiny. My current thinking is on the lines of heuristics again: although any one of the DNS-based tests in the previous comment is possible for a spoofer to get around, what is the likelihood of them being able to accomplish several of these attacks at once? Also, most high-profile targets tend not to outsource their web-serving to shared virtual servers, although of course, we are then back to Akamai, which some of them _do_ use.
just keep in mind that if you do any UI fixes we'll need to do them for camino too, so the more you can keep in the backend the better. We'd also like this on the 1.7 branch so camino 0.8.x can take advantage.
OK, back to the case analysis. There are several layers to be considered here: * The name level, where we consider a name as an identifier as a static entity in a vacuum * The DNS lookup level, where names _dynamically_ map to resources such as IP addresses, MX records and so on * The protocol level, where DNS names are only part of the overall name; for example, one HTTP URL might redirect to another, perhaps with a different version of the DNS name in. Similarly, hosting companies bind content to URLs at this level. * The authority level; who _controls_ the given resource. This is _not_ necessarily the entity that hosts it. This appears to be crucial. PKI certificates supposedly bind this level to the DNS name level. Ideally, we want a strong link through the entire chain, or at least a strongly plausible set of links which make spoofing very difficult to do without forging many different links, and thus more likely to be detected by one of the entities involved in making the chain.
How about displaying "strange" domain names differently? By "strange", I mean ones that use characters outside the "normal" set for the user's language (which we know, because they are using a particular localized version, right?). This different display could be something as simple as placing a '!' before the offending characters or a certain amount of space. If we want to go even further, we could use a bold font or a larger font or both. IDNs are intended to allow people all over the world to use their own languages in domain names. They are not intended to allow domain owners to register "cute" misspellings (or spoofed names). So it's OK to penalize these "strange" domain names by displaying them differently in the URL and status bars (and an even more conspicuous display in security dialogs such as certs).
For the tactic of displaying characters outside the user's expected language in a distinctive way, the following expired Internet-Draft might be useful: http://www.alvestrand.no/ietf/lang-chars.txt This seems to be mostly useful for Latin-based languages, and some Cyrillic-based languages. However, to quote from the Internet-Draft: There are a lot of languages in the world. Estimates vary between 500 and 6000, with some eternal conflicts about the difference between a language and a dialect guaranteeing that any list claiming to be authoritative will be the source of endless debate. Many of these languages have a writing system. Some have several. These are also likely to have changed over time, with the meaning of character symbols changing, the shape of the characters changing, or completely new characters being added, or old ones removed from the set. This means that even within a single language, a list of characters is likely to be controversial. These problems have made several experts in the field of languages and characters refuse to even consider the idea of working out such a list. For other languages, we will probably have to use the Unicode code point ranges. So, now we have three proposals for anti-spoofing techniques, which are each potentially complementary: * detecting broken Unicode and cross-writing-system mixes in IDN labels * attempting to detect possible spoofs by doing a DNS lookup on an accent-stripped version of the IDN, and checking if the two resources are the same if both lookups succeed * displaying characters outside the user locale in a distinctive way, or otherwise providing a warning that they are being used The first approach is linguistic; the second requires lookups; the third is GUI-based. Are there any other techniques which can be added to these?
Thinking about high-level requirements: -We must not place excessive burden on non-latin scripts users -We must detect spoofed domains -We must try to not create situations where users just 'click yes' (This is what most users do today when getting ssl warnings). -We must make the IDN spoofing solution accessible (for example, only shading background colors wouldn't meet this requirement) Some ideas for meeting those requirements: 1. Assuming we can come up with a clear 'name' for each script (German, Russian, Latin, etc), we can create a whitelist of scripts users which users can manage. If a domain gets requested that's not in the users 'whitelist', they get prompted/warned. Note that all that is really required then to spoof the domain is "Latin+ Cyrillic", but it's a step in the right direction. 2. Heuristic matching of IDNs - - create a database of commonly 'forged' chars between scripts. For example: Cyrillic-lowcase-a looks just like ASCII/latin-lowercase-a. Mark these chars as 'suspect'. If a domain has either a very high ratio of suspect chars, or a very low ratio of suspect chars, warn the user. This gets rather nasty with Traditional Chinese/Japanese. Even if the 'suspect chars' db won't work (due to the effort required to populate it), you can do a similar thing by matching ratios of codepages; if it's 90% latin & 10% Cyrillic, it's likely we are being spoofed. 3. You can detect the script/codepage the target HTML webpage is, in order to see if it matches the script(s) the domain uses. 4. For SSL related sites, the browser could display the punycode version of the IDN next to the lock icon (today is displays the UTF8).
Yes, lists of characters used in languages are controversial, but we do not have to use authoritative lists or even fixed ones. We should come up with an API that can hide the strategy in the implementation. The lists of chars or char ranges (if any) can change over time. For Japanese, for example, we might even consider testing whether the Unicode Han character is in one of their standard sets (i.e. JIS X 0208, JIS X 0212, etc). A Chinese character that falls outside the primary JIS set can be flagged with a different display. Note that this only affects the *display* of the domain name, not it's actual lookup. So it doesn't have to be perfect or exhaustive.
We shouldn't put the punycode next to the lock icon in Firefox, the name must be readable. Yes, it would have helped in this paypal spoof case to be able to say "hey, random garbage, must not be the real paypal." But for users in regions where IDN is heavily used it isn't going to help anyone to have a large percentage of the legit sites display unreadable random garbage. It won't help the users tell whether they're on the right site, and it just makes the browser look broken.
Regarding a big homograph pair table: yes, that was my first idea, and I even attempted to compile one. After generating a _long_ list programmatically, I realised that it didn't cover many that I could see by eye from the code tables. Given that there are 12,886 alphabetic or symbol characters in Unicode 3.2 even discounting CJK, Hangul and so on (which would give a grand total of 95,156 if included), you would have to inspect n(n+1)/2 - n = 83,018,055 possible character pairs. Including CJK etc, that would give a ludicrous 4,527,284,590 possible pairs to inspect. Eliminating great swaths of pairs by character set alone does not work; some of the trickiest homographs are between semantically unrelated characters in quite unrelated character sets. Remember that in the long term, we have to consider not only the IDN-to-ASCII spoofing problem, but IDN-to-IDN spoofing. On the other hand, restricting each label to a single writing system, (or family of mutually compatible writing systems, in a few special cases) works rather well at eliminating the possible spoofing pairs, reducing the total by many orders of magnitude. Enforcing the limit is also firmly within the spirit of IANA's recommendations, which recommend that each label come from a single well-defined language. Blacklisting symbol and other specialized character sets makes the total smaller still. Now, with all the characters in a single label limited to a single writing-system-group (to coin a term), we need only to consider homographs _within a particular writing-system-group_, or the possibility of constructing a whole-label homograph. Now, whole-label homographs are possible in theory if an entire label consists entirely of homographs from the same script-pair (consider faking ayayay.com using Cyrillic characters), but in practice the statistical properties of most languages are such as to make these unlikely (consider finding a script with homographs for each of 'g', 'o', 'l' and 'e' (google), or 'a', 'm', 'z', 'o', 'n' (amazon), 'm', 'i', 'c', 'r', 'o', 's', 'f', 't' (microsoft)). Once we've done this, we can then worry about confusable characters within the writing system of a particular label, and in particular, collisions with ASCII domain names. Only when we've done this, should we add extra code to deal with specific dangerous character pairs we already know about.
FWIW, I created a simple Firefox extension that effectively kills IDN support: http://friedfish.homeip.net/extensions/no-idn.xpi
(In reply to comment #62) > 3. You can detect the script/codepage the target HTML webpage is, in order to > see if it matches the script(s) the domain uses. HTML doesn't normally have any script labelling, but the "codepage" (i.e. charset) is sometimes indicated. When the charset is a universal one like UTF-8, it is harder to tell what language the document is written in. If ICANN and the like do not already have this recommendation, perhaps we could have them add that security-conscious sites should label their HTML documents with the language so that browsers can check that the domain name used to get to their site does not contain characters normally found outside their language.
(In reply to comment #65) > On the other hand, restricting each label to a single writing system, (or family > of mutually compatible writing systems, in a few special cases) works rather > well at eliminating the possible spoofing pairs, reducing the total by many > orders of magnitude. Enforcing the limit is also firmly within the spirit of > IANA's recommendations, which recommend that each label come from a single > well-defined language. But some domains do have a few characters that are of a different writing system then the rest of the domain. For example, In literal form: http://www.färgbolaget.nu http://www.bücher.de http://www.brændendekærlighed.com http://www.räksmörgås.se http://www.färjestadsbk.net http://www.mäkitorppa.com http://www.ma&#776;kitorppa.com In escaped form: http://www.f&#x00E4;rgbolaget.nu http://www.b&#x00FC;cher.de http://www.br&#x00E6;ndendek&#x00E6;rlighed.com http://www.r&#x00E4;ksm&#x00F6;rg&#x00E5;s.se http://www.f&#x00E4;rjestadsbk.net http://www.m&#x00E4;kitorppa.com http://www.m&#x0061;&#x0308;kitorppa.com
*** Bug 281674 has been marked as a duplicate of this bug. ***
> But some domains do have a few characters that are of a different writing system > then the rest of the domain. For example, All of your examples come from the LATIN characters, no different writing systems in sight. See the IANA language tables referenced in comment 35
How about showing something like this in the status bar when the user moves the mouse over a suspicious link: http://www.payp!a!l.com/ If the user doesn't have the sidebar turned on and clicks on the link, then we can show a similar string in the location bar. However, cut/copy/paste of the location bar should not include the '!' characters. If the user doesn't have the location bar turned on, then they won't see the lock icon either, which means that they don't know or don't care about security anyway. Educating the user about security is also important.
(In reply to comment #71) > If the user doesn't have the sidebar turned on and clicks on the link, then > we can show a similar string in the location bar. I meant to say status bar, *not* sidebar. Oops...
(In reply to comment #71) > How about showing something like this in the status bar when the user moves > the mouse over a suspicious link: > > http://www.payp!a!l.com/ > > If the user doesn't have the [statusbar] turned on and clicks on the link, > then we can show a similar string in the location bar. I think it would be just as effective and more user friendly if suspicious characters were rendered in a different font and/or colour, much like Konqueror does as shown in comment 12. Adding superfluous characters like '!' would just be confusing.
(In reply to comment #73) > I think it would be just as effective and more user friendly if suspicious > characters were rendered in a different font and/or colour, much like Konqueror > does as shown in comment 12. Adding superfluous characters like '!' would just > be confusing. I had a look at comment 12, but the font does not look that different and the color doesn't either. I still like the '!' chars. Maybe we should go even further and refuse to load a document from a domain name containing characters outside the user's language. After all, IE doesn't even support IDNs out-of-the-box. In countries where IDN is used a lot, we could support IDNs using characters from the user's language only. Or they could add languages that they need. Or we could load any document, but only after the user has selected an item buried deep inside lots of disclaimers. If some domain registrars are not following the IDN guidelines, then the browser may be the last line of defense. We could send a strong message to these registrars by making it more difficult for users to reach the Web sites with names that they neglected to filter.
(In reply to comment #73) > I think it would be just as effective and more user friendly if suspicious > characters were rendered in a different font and/or colour, much like Konqueror > does as shown in comment 12. Can't just have the characters appear in different colors because of users who are color blind. Very hard for some or most to tell the difference in colors without starring at it...and who really analyzes the address bar and link anyways? I think there should be a small dialog box (i know, i know we all hate them) like Thunderbird implemented for suspected phising sites...also provide a link in the dialog box to explain what exactly they got the warning for.
(In reply to comment #53) > Regarding problem 2, homographs within the Latin writing system, here is a > heuristic that will probably catch a great many current spoofing attempts: > > * next, look at the length of the TLD name. If it is two characters long, I > believe that IANA policy requires it to be a country-code TLD (ccTLD). In this > case, assume that the ccTLD owner knows what they are doing with regard to > language and character set filtration, and return OK. This needs exceptions though, as the .cc ending, for example, is effectively used like a gTLD. Also, I don't think ccTLDs are safe. http://www.amazon.de/ for example, the German local variant of Amazon, could probably be spoofed. (Although admittedly, this particular URL doesn't contain any characters that could really be spoofed in German.)
(In reply to comment #60) > How about displaying "strange" domain names differently? By "strange", I mean > ones that use characters outside the "normal" set for the user's language > (which we know, because they are using a particular localized version, right?). Not right. I might be an exception, but I always try to get the English versions of programs, even though my native language is German. Given that the internet is mostly English, I feel more comfortable in an English environment. That doesn't mean, however, that I don't want to access http://www.öbb.at/ (the IDN URL of the homepage of the Austrian railway) once in a while, without getting warnings. Also, there aren't localizations for all languages out there.
I've been doing some more exploring, and I've found this interesting triple. http://www.bücher.ch/ redirects you to a German online bookstore http://www.bucher.ch/ takes you to the web site of Bucher Biotec AG Both entirely valid domain names, registered by different entities. And before you think "ah, yes, but in German ü is another way of saying ue" consider that: http://www.buecher.ch/ takes you to a Swiss online bookstore
A question: what do people consider as confusable characters within the Latin writing system? Let's try an easy example first: does the Latin Extended-A s-cedilla in "micro&#351;oft" jump out at the viewer if they are not looking carefully? Less noticeable, how about the Latin Extended-A dotless 'i' in "m&#305;crosoft"? Or the Latin-1 accented 'i' in "mìcrosoft"? (This last being the nastiest, as it is both the least visible, and also in the base Latin-1 set).
(In reply to comment #77) > Not right. I might be an exception, but I always try to get the English versions > of programs, even though my native language is German. Given that the internet > is mostly English, I feel more comfortable in an English environment. That > doesn't mean, however, that I don't want to access http://www.öbb.at/ (the IDN > URL of the homepage of the Austrian railway) once in a while, without getting > warnings. OK, so how about checking against a set of languages for certain TLDs? For ccTLDs that mostly use one language, we just check against that language. Countries that normally use more than one language would be checked against each language, but *not* the union of the languages. There is an Internet Draft that discusses a "one domain label, one language" rule: http://www.ietf.org/internet-drafts/draft-klensin-reg-guidelines-05.txt However, we can set stricter rules for ourselves if we think that that might protect the user from phishers. For example, we might have a "one FQDN, one language" rule. > Also, there aren't localizations for all languages out there. True. Those users are out of luck. But we can use other sources for the language(s), such as ccTLDs, as described above. We could also look at the set of languages preferred by the user. In Firefox, you can find these under General > Languages in the preferences/options.
See the list of attachments for some code I've written to enforce "one language, one label" -- this maps characters to code-point ranges, and sets of code-point ranges to writing systems, and thence to languages. This would at a stroke catch all of the Cyrillic/Latin alphabet exploits that are the subject of the recent announcements, and a lot of potential future nastiness between other pairs of script systems. However, this does not quite slam the door shut, as there is still room for exploits within a single writing system; notably, this is worst within the Latin writing system with its many local variants. However, we may, be able do something there based on the user locale. As a first question: * can anyone think of any reason, ever, for allowing multiple writing systems in a single label, other than in the special exceptions given for Chinese, Japanese and Korean? Can anyone point to an existing legitimate domain name which breaks this rule?
IMO the proper fix for the ssl case (https://paypal.com) is to remove the UserTrust network certificate from the store. Obviously they are not doing their job and therefore they shouldn't be trusted.
(In reply to comment #81) > this maps characters to code-point ranges, and sets of code-point > ranges to writing systems, and thence to languages. Instead of using code-point ranges and writing systems, it might be better to use a set of characters for each language, as is done in the IANA registry. This would be more in the spirit of IANA.
Whilst this is a very complex issue, we seem to be moving towards a rough consensus about what needs to be done... How about doing this multi-layered set of fixes: * firstly, we make sure users can easily turn off IDNA entirely * secondly, we ENFORCE "one language's writing-system-set, one label", and blacklist all symbols, dead-language scripts, and other exotica, roughly in the way I've coded in my example programs. This kills all cross-script exploits stone dead. This can be done at the name-normalization level, or even better at the DNS-lookup level, so that lookup up these bogus names simply gives an error: we can give a distinct error code for this, if needed. Note that we can't enforce "one language, one FQDN", consider "www.<something in Thai>.com" as an example. * thirdly, we display characters in domain names which are outside of the user's acceptable set, as defined in (for Firefox) Options > General > Languages, in some distinctive way: for example, by adding a question mark, so that, for an English user, "http://www.mìcrosoft.com/" would be displayed "http://www.mì?crosoft.com/". Whilst this is 'soft-security', not many people will mistake "mì?crosoft" for "microsoft". Note that by doing a simple text-substitution on name display, this can quite easily work in every part of the GUI, without code being needed to do exotic font or style changes. Note that this involves more paranoid character set selection than simply referencing code pages; however, the Alvestrand Internet-Draft cited above seems to be a good reference for Latin languages, and these are the ones currently with the greatest risk exposure. We could also use a page-top-banner like the one used for popup blocking in Firefox, or spam-blocking in Thunderbird, to warn the user "This page is from a web address which contains unfamiliar symbols outside your preferred language(s): you might want to check if it is genuine" ... can anyone think of better wording for the banner? And that's about as far as we can go, for now. Unfortunately, the "spotting the correct version by lookup" technique is dead in the water for the time being, due to existing allocations by registrars, and major technical problems in the concept itself. There are also all sorts of horrible semantic problems lurking with Chinese names, and the Chinese Unicode and IDN community are well aware of this, and trying to find a solution through using IDN bundles. However, I think this is out of the scope of the immediate problem, which needs an immediate fix.
(In reply to comment #76) > Also, I don't think ccTLDs are safe. http://www.amazon.de/ for example, the > German local variant of Amazon, could probably be spoofed. .de domains only allow a very specific set of characters. (bottom of http://www.denic.de/de/richtlinien.html)
(In reply to comment #85) > .de domains only allow a very specific set of characters. (bottom of > http://www.denic.de/de/richtlinien.html) Have they tried to register this set with IANA?
(In reply to comment #84) > make sure users can easily turn off IDNA entirely And the default should probably be IDNA turned on. > ENFORCE "one language's writing-system-set, one label", and > blacklist all symbols, dead-language scripts, and other exotica Why bother with blacklists when we can use whitelists a la IANA? > This can be done at the name-normalization level, or even better at > the DNS-lookup level I'm not sure about doing it at the DNS level. If a user clicks on a link with suspicious characters in the domain name, we should probably give some kind of warning. But what about <img src="http://www.foo.com/image.gif">? > Note that we can't > enforce "one language, one FQDN", consider "www.<something in Thai>.com" as an > example. Yes, we can. Take a look at the Thai IDN registration at IANA. (I think they may have made a mistake by including Latin capital letters, though.)
> Why bother with blacklists when we can use whitelists a la IANA? Please read my previous postings here, and the numerous links to background papers provided. This is a _hard_ problem, and it's clear the IDNA/registrar community has not been thinking about this hard enough, or we would not have got into this state in the first place. If we want to do "hard" blocking using whitelists, there are a vast number of whitelists to draw up, and that will take lots and lots of time, and would postpone a fix almost indefinitely until a whitelist had been drawn up for every concievable language. The IANA whitelists only cover a tiny number of languages, and even then are registrar-dependent (two different registrars could select different character sets for the same language, for example). In addition, the IANA whitelists are dependent on the language used _for the given label_, which is not 1:1 obtainable from the name of the TLD: for example, see the .info registration for the de: language. Neither is it available from the DNS. In any case, it won't work. Consider the dotless-i and i-acute examples I cited before. Both entirely valid strings in European languages that would pass a whitelist. In the long run, both blacklists and carefully-selected whitelists would be a good idea; but for the moment, blacklists and "soft" whitelists as proposed above give the greatest security gain for the least investment in time and effort. I think that some of the examples given above show that this is not a problem which is amenable to a perfect fix, given that human beings have chosen to use languages which contain easily-confusable characters in the same writing system. Nor is it possible to anticpate all possible attacks; during my research over just a couple of days, I've found a number of possible new attacks (all of which are dealt with in the latest version of my proposal, by the way). If I can do that in a couple of days, I'm sure there are many more left. However, we can make life many, many, orders of magnitude more difficult for spoofers by doing some relatively simple things, including defeating all of the existing known attacks. Sometimes "good" is better than "best", if "best" means waiting a long time first with the security hole still open. Remember that after we've rolled out a "good" solution in the very near future, we can always work on making it tighter, and aiming towards perfection in the long run.
(In reply to comment #88) > > Why bother with blacklists when we can use whitelists a la IANA? > > Please read my previous postings here, and the numerous links to background > papers provided. Um, let's take an example. The spoofed paypal.com one. The gTLD doesn't give us any languages. Suppose the user uses US English, and hasn't changed Firefox's General > Languages setting. So the only language we can check against is en-US. It only contains Latin small letters, digits and a few others. So with my proposal, the status bar would show payp!a!l and if the user clicked it anyway, your top banner would appear with a warning and my '!' characters would appear in the location bar. Do you see any blacklists in this picture? (I may have missed references to blacklists in the RFCs and official guidelines. If so, please let me know the specific location, chapter and verse.) > it's clear the IDNA/registrar > community has not been thinking about this hard enough, or we would not have got > into this state in the first place. The IDNA community *has* thought about it. Look at all their RFCs, guidelines and IANA registrations. The problem is the registrars. They don't seem to care. The Secunia test case shouldn't even have been possible to register. And the other problem is our browser, of course. IDN was checked into the tree without whitelist checking. > If we want to do "hard" blocking using whitelists, there are a vast number of > whitelists to draw up, and that will take lots and lots of time, and would > postpone a fix almost indefinitely until a whitelist had been drawn up for every > concievable language. Nope. We could put the whitelists on the mozilla.org site *too*, and the fixed version of Firefox could download the most up-to-date versions. The initial whitelists would come with the product. > The IANA whitelists only cover a tiny number of languages, and even then are > registrar-dependent (two different registrars could select different character > sets for the same language, for example). In addition, the IANA whitelists are > dependent on the language used _for the given label_, which is not 1:1 > obtainable from the name of the TLD: for example, see the .info registration for > the de: language. Neither is it available from the DNS. My wording may have been too vague, but I didn't mean to say that we would use the IANA registrations themselves. We will have to come up with our own. I used words like "a la IANA" and "spirit". > In any case, it won't work. Consider the dotless-i and i-acute examples I cited > before. Both entirely valid strings in European languages that would pass a > whitelist. Dotless-i and i-acute do not occur in *all* European languages. > Sometimes "good" is better than "best", if "best" means waiting a long time > first with the security hole still open. Remember that after we've rolled out a > "good" solution in the very near future, we can always work on making it > tighter, and aiming towards perfection in the long run. We are in violent agreement here. :-)
I think we are settling down to a relatively small set of practical methods to prevent spoofing; perhaps we could start to write some code soon? Here is another justification for blacklists. Just to give you nightmares, here is another scenario. As you know, NAMEPREP will map quite a lot of things to the lowercase ASCII letters. Now suppose that someone has coded up a spoofed address in just these characters, Punycoded it, and slipped it past a dumb registrar, perhaps using an automated domain transfer. Now, we have two different DNS names, foo.com and xn-<something>.com, both of which will map to the ASCII string foo.com after being un-Punycoded and (for reasons of caution) re-normalized with NAMEPREP. Unfortunately, the owner of xn-<something> can now pass out links to their site, and "foo.com" will be displayed in the browser bar in 100% correct ASCII. Disaster! So, what we need to do here is to apply some of the blacklist logic prior to using NAMEPREP, as well as applying blacklist/whitelist logic after. Belt and braces. So, the methods seem to be boiling down to: * Character range blacklists, both before and after NAMEPREP, for totally unreasonable characters, like Linear B, surrogates, control-image graphics and so on and so forth. * Enforce the prevention of script-family mixing in labels, except as permitted in the CJK languages * Make the data tables for these script-family lists auto-updatable, so we can fix it if we get it wrong, or if the Unicode standard changes? * Per-language strict whitelists for user-specified "accept" languages, to be worked out on a language-by-language basis and auto-updatable from the Mozilla site, which are used to add warning characters to displayed text in the GUI. If this overall approach is OK by people, I can start generating data tables and Python proof-of-concept code ASAP. Or -- if this is not a good idea -- please explain why, and propose something else that's better!
As per an earlier comment requesting chapter and verse on blacklists: Section 5 of the NAMEPREP RFC, RFC 3495, says: >5. Prohibited Output > > This profile specifies prohibiting using the following tables from > [STRINGPREP]: > > Table C.1.2 > Table C.2.2 > Table C.3 > Table C.4 > Table C.5 > Table C.6 > Table C.7 > Table C.8 > Table C.9 and these tables are defined in RFC 3454. Also, this ICANN document: Internationalized Domain Names (IDN) Committee Input to the IETF on Permissible Code Point Problems http://www.icann.org/committees/idn/idn-codepoint-input.htm, which recommends a whitelist-based scheme, but then goes on to specify a _blacklist_ of what should not be included in any of the whitelists, as follows: > ...at least the following sets of characters not be included, pending further > analysis: > > * line and symbol-drawing characters, > * symbols and icons that are neither alphabetic nor ideographic language > characters, such as typographical and pictographic dingbats, > * punctuation characters, and > * spacing characters. Also, I seem to remember some RFC language somewhere saying that application writers can apply their own extra constraints to IDNA interpretation... I'm still looking for that. By the way, note that one way of interpreting the RFC is that these forbidden outputs are to be removed on a per-character basis: that would be a big mistake, as it would allow the domain www.micro<forbiddencharacter>soft.com to be registered, and then NAMEPREP will remove the character to generate a spoofed name...
(In reply to comment #86) > (In reply to comment #85) > > .de domains only allow a very specific set of characters. (bottom of > > http://www.denic.de/de/richtlinien.html) > > Have they tried to register this set with IANA? Interestingly, no they haven't registered it. Unlike the rather small list registered for de: by .info, this one contains a vast number of characters, including the eminently spoof-worthy accented and dotless 'i' variants, and characters like LATIN SMALL LIGATURE OE, LATIN SMALL LETTER T WITH CEDILLA, and LATIN SMALL LETER KRA. Again, this raises the issue of how we should compile whitelists: just because this is the official .de registrar list for .de, does not make it a good list for spoof detection. On the other hand, a "soft" whitelist which does not include KRA, for example, will clearly flag that character as unusual in a domain name, but not prevent users going to a page containing it.
(In reply to comment #90) > Just to give you nightmares, here is another scenario. As you know, NAMEPREP > will map quite a lot of things to the lowercase ASCII letters. Now suppose that > someone has coded up a spoofed address in just these characters, Punycoded it, > and slipped it past a dumb registrar, perhaps using an automated domain > transfer. Now, we have two different DNS names, foo.com and xn-<something>.com, > both of which will map to the ASCII string foo.com after being un-Punycoded and > (for reasons of caution) re-normalized with NAMEPREP. Unfortunately, the owner > of xn-<something> can now pass out links to their site, and "foo.com" will be > displayed in the browser bar in 100% correct ASCII. Disaster! I just tried moving the mouse over a link to www.xn--amazn-mye.com and Firefox showed the same string in the status bar. It did not un-Punycode it. Am I misunderstanding your example?
*** Bug 281831 has been marked as a duplicate of this bug. ***
(In reply to comment #90) > transfer. Now, we have two different DNS names, foo.com and xn-<something>.com, > both of which will map to the ASCII string foo.com after being un-Punycoded and > (for reasons of caution) re-normalized with NAMEPREP. Unfortunately, the owner > of xn-<something> can now pass out links to their site, and "foo.com" will be > displayed in the browser bar in 100% correct ASCII. Disaster! What we probably should do in that case is actually connect to foo.com. That doesn't seem to be the case now, but we also don't currently display foo.com in the URL bar.
(In reply to comment #93) > (In reply to comment #90) > > Just to give you nightmares, here is another scenario. As you know, NAMEPREP > > will map quite a lot of things to the lowercase ASCII letters. Now suppose that > > someone has coded up a spoofed address in just these characters, Punycoded it, > > and slipped it past a dumb registrar, perhaps using an automated domain > > transfer. Now, we have two different DNS names, foo.com and xn-<something>.com, > > both of which will map to the ASCII string foo.com after being un-Punycoded and > > (for reasons of caution) re-normalized with NAMEPREP. Unfortunately, the owner > > of xn-<something> can now pass out links to their site, and "foo.com" will be > > displayed in the browser bar in 100% correct ASCII. Disaster! > > I just tried moving the mouse over a link to www.xn--amazn-mye.com and Firefox > showed the same string in the status bar. It did not un-Punycode it. > > Am I misunderstanding your example? Yes you are. You are using 0x43e, CYRILLIC SMALL LETTER O, which NAMEPREP normalizes to itself, not to LATIN SMALL LETTER O. By the way, I note that someone has already registered that URL. I'll see if I can manufacture an example.
(In reply to comment #95) > What we probably should do in that case is actually connect to foo.com. That > doesn't seem to be the case now, but we also don't currently display foo.com in > the URL bar. *If* we decide to decode a Punycoded domain name in a link clicked by the user, then we should also check whether it is converted back to the original when we run it through nameprep and punycode. If not, we should either warn the user or pass the original to DNS or both, since it is malformed.
(In reply to comment #97) > (In reply to comment #95) > > What we probably should do in that case is actually connect to foo.com. That > > doesn't seem to be the case now, but we also don't currently display foo.com in > > the URL bar. > > *If* we decide to decode a Punycoded domain name in a link clicked by the > user, then we should also check whether it is converted back to the original > when we run it through nameprep and punycode. If not, we should either warn > the user or pass the original to DNS or both, since it is malformed. Yes! That would work. Similarly if the user were to type in a Punycoded domain name in the browser bar.
Also, if the Punycoded name does not convert back to itself, then the original (malformed) Punycode should be displayed in the status bar when the user mouses over it.
Coming back to the black/white list issue, it seems that our disagreement was due to some confusion over nameprepping. Mozilla already runs the string through nameprep, I believe (mozilla/netwerk/dns/src). I was talking about what to do *after* nameprepping. I believe we only need whitelisting after nameprepping.
"Must have feature: Disable/enable IDN in all mozilla products." Disabling IDN should not just be a feature, but, since IDN is not in wide use, should simply be disabled as a default in all Mozilla products and released ASAP, possibly with a hidden pref to turn it on.
Tying up loose ends: What to do about <img src=...>. Should we just silently load the image even if the domain name contains characters outside the user's or ccTLD's language(s)? Of course, if it's <img src="https://..."> then we need to check the cert.
(In reply to comment #101) > "Must have feature: Disable/enable IDN in all mozilla products." > > Disabling IDN should not just be a feature, but, since IDN is not in wide use, > should simply be disabled as a default in all Mozilla products and released > ASAP, possibly with a hidden pref to turn it on. I agree. Disabling IDN by default is the right thing to do _until_ there is a verified working fix for the bulk of spoofing scenarios. It's quick, simple, and will sort the PR problem in a snap, whilst allowing users who want IDN to re-enable it by using a preference. Given the range of problems here, it looks like we could be talking for at least a couple of weeks before we reach consensus on a detailed set of fixed for all of the more general problems, implement them, make test cases, and validate them. We can't wait that long. Perhaps this bug should be divided into a set of smaller bugs: * disable IDN by default, and provide GUI for a user pref to turn it on, with an appropriate warning. * roll this out to existing Mozilla suite and Firebird users by automatic update. This deals with the immediate pressing problem of poor security and hence bad PR. * deal with cross-script IDN spoofing, blacklist "must not happen" characters in IDNs, more paranoid Unicode string syntax checking (eg. no leading combining marks) * use locale-based whitelists for displaying IDNs to prevent same-script IDN visual spoofing * deal properly with literal Punycode in domain names (ie. test for NAMEPREP/Punycode round-tripping, then treat as if entered as Unicode) * put all this in Gecko 1.8 / Firefox 1.1 This generates good PR by being the first with a proper fix for spoofing. Now we have four quite discrete sub-tasks, we can start solving them one-by-one, without any inter-problem interactions, instead of all together at once, which creates an apparently much larger problem.
(In reply to comment #102) > Tying up loose ends: What to do about <img src=...>. Should we just silently > load the image even if the domain name contains characters outside the user's > or ccTLD's language(s)? > > Of course, if it's <img src="https://..."> then we need to check the cert. I think so. After all, we are implicity trusting the page content generator at this point, who is at liberty to point their image sources wherever they like. On the other hand, when we go "View Image", or otherwise inspect the image URL, it should go through the same URL display code that _will_ flag non-locale characters, just as if they were in a page link or a user-entered link. By the way, I have now generated language whitelists for all the languages in the Internet-Draft, namely Afrikaans, Albanian, Basque, Breton, Bulgarian, Byelorussian, Catalan, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Faeroese, Finnish, French, Frisian, Gaelic, Galician, German, Greenlandic, Hungarian, Icelandic, Irish, Italian, Latin, Latvian, Lithuanian, Macedonian, Maltese, Norwegian, Polish, Portuguese, Rhaeto-Romance, Romanian, Russian, Sami, Serbian, Slovak, Slovenian, Sorbian, Spanish, Swedish, Turkish, Ukrainian, and Welsh.
*** Bug 281863 has been marked as a duplicate of this bug. ***
(In reply to comment #103) > * deal with cross-script IDN spoofing, blacklist "must not happen" characters in > IDNs, more paranoid Unicode string syntax checking (eg. no leading combining marks) If you're talking about adding more steps to the DNS label conversion, I don't think you can do this and still claim conformance to the nameprep and punycode RFCs. This is an interoperability issue: The server sends us an HTML document, the user clicks on a link, we convert it from HTML to Unicode and then pass it through nameprep and punycode before sending it to the DNS server. The HTML document author assumes that we will adhere to the HTML, Unicode, Nameprep and Punycode specs. If we don't, we have an interop problem. > * use locale-based whitelists for displaying IDNs to prevent same-script IDN > visual spoofing No, my language-based whitelist proposal is intended to alert the user to potential spoofing across scripts as well as within a single script. For example, the Cyrillic small letter 'a' is outside the en-US language *and* outside the Latin script. The i-acute is also outside en-US but inside Latin.
This is a series of character repertoires for various languages. Source: from expired Internet-Draft "Characters and character sets for various languages", by Harald Tveit Alvestrand, draft-alvestrand-lang-char-03.txt The characters given here for a language include the base set, the Required characters, and the Important characters. See the Internet-Draft for the definitions of these terms. One correction has been made: the entry for German contained a control character. This has been removed. This is a draft document generated from another draft; there is absolutely no guarantee of correctness or fitness for any use; this information is provided for research and entertainment purposes only.
(In reply to comment #106) > (In reply to comment #103) > > * deal with cross-script IDN spoofing, blacklist "must not happen" characters in > > IDNs, more paranoid Unicode string syntax checking (eg. no leading combining > marks) > > If you're talking about adding more steps to the DNS label conversion, I > don't think you can do this and still claim conformance to the nameprep and > punycode RFCs. > > This is an interoperability issue: The server sends us an HTML document, the > user clicks on a link, we convert it from HTML to Unicode and then pass it > through nameprep and punycode before sending it to the DNS server. The HTML > document author assumes that we will adhere to the HTML, Unicode, Nameprep > and Punycode specs. If we don't, we have an interop problem. > One person's interoperability problem is another person's security precaution. We are not the document author's agent; we are the _user_'s agent. Many document authors would like us to see their pop-up ads. Our users generally don't. We explicitly choose not to interoperate with standards-based ECMAScript behaviour by default in this case. Similarly, we should refuse by defult to interoperate with URLs which contain domain names with no plausible origin in any human language, or syntactic brokenness in the structure of their Unicode character stream, in the spirit of IANA's recommendations, and in spite of some registrars' willingness to register essentially any pattern of bits registrants are willing to pay for. Remember, IE has 100% non-interoperability with IDN, and it looks like we will probably have to go back to that too in the short run. What I'm proposing will enable all concievable reasonable IDN labels to work, and auto-reject the three-dollar labels, _without_ the user needing to keep glancing back at the URL bar every click to see if they are about to fall down a hole. Can you suggest a plausible reason for mixing writing systems in the same DNS label (ie not in the same _name_, just a single dot-delimited segment), other than in the "safe" combinations such as hiragana-katakana-kanji-latin? Or indeed to use dingbats, character graphics, musical notes, or cuneiform characters in a domain name? Remember, if this turns out to be a long-term problem, we can always re-allow this behaviour at some time in the future: indeed, this is just the sort of behaviour to hide under a hidden pref or two with names like: * dns.unicode.blacklist-bad-codepoints * dns.unicode.prevent-script-mixing
Indeed, just to amplify my previous comments a bit more, this is _not_ a 100% solvable problem, given the current state of IDNA, and the apparent lack of coordination in the standards and domain registration communities to do anything about it. However, codepoint blacklisting and preventing script-mixing probably catch > 90% of all possible problems with zero cost in user attention, and intelligent locale-based display of URLs will catch perhaps a bit more than another 90%, to get a > 99% coverage of possible spoofing. That's about the best we can do at the moment, without the cooperation of registrars, or the creation of new protocols. Notice that even with both proposals in effect, spelling a domain name with an 'i' with an i-acute instead will get past the browsers of anyone who has their browser set to read any language containing that character (for example, any of Danish, Faeroese, Icelandic, Greenlandic, Irish, Welsh, Dutch, Catalan, Spanish, Galician, Portuguese, Italian, Hungarian, Slovak, or Czech). Perhaps an even-longer term solution is a special font for the URL bar with exaggerated accents and clearly different letter-forms for near-homographs within the same script family?
Added Thai, Arabic, Hebrew and Greek to the above document, based on IANA registry data for these languages given by the .pl registrar.
Attachment #173999 - Attachment is obsolete: true
It seems to me that the real problem is not IDN, but that it is not obvious when you are going to a new site. To fix that problem, I would recommend that next to the lock icon, a history icon should be placed. If the site is not in the history of the browser, then a 'new' icon could be displayed. If the site is in the 'history', than a history icon could be displayed. And, if the user has manually approved the site, then a 'trusted' icon could be displayed. Another possibility is to display an icon in the location bar, and allow option background and forground colors on the location bar. This way, it is obvious that you are going to a new site, and if you thought that you were going to an old site, you should think twice.
As before, but now with lowercase letters only, as NAMEPREP will ensure we don't have to worry about capital letters.
Attachment #174006 - Attachment is obsolete: true
(In reply to comment #108) > (1) dns.unicode.blacklist-bad-codepoints > (2) dns.unicode.prevent-script-mixing Let me add another to this list: (3) dns.unicode.whitelist-good-lang-chars I guess I thought that (3) would be a more fine-grained solution than (1) and (2), and would make the the first two unnecessary. But perhaps people want to implement (1) and (2) to reject those domain names, while (3) allows them but displays them differently to alert the user. The link click actions could be enumerated along a different axis: (a) Silently refuse to perform the DNS lookup (and subsequent connection) (b) Refuse DNS and connection, but warn the user (c) Refuse DNS and connection, but warn user and allow user to connect (d) Do DNS and connection, but indicate suspicious domain in location bar (e) Do DNS and connection and don't indicate suspicious domain anywhere So what is your proposal? Is it the following: (1), (2) -> (b) and (3) -> (d) Or maybe you had other actions in mind?
After some more analysis, it appears that attacks between Latin scripts, and attacks from Cyrillic to Latin or vice-versa are the main threats, with other threats having lesser numbers of homographs for attackers to play with, the risk exposure rising exponentially with the number of attackers. Just to perform get some order-of-magnitude insight, I looked at the number of English and Russian words which consist _entirely_ of letters which are homograms between the Latin and the subset of Cyrillic script used in my Russian dictionart. These are "acxeoyp", and their Cyrillic counterparts. In each case, I get roughly the same figure, of around 0.00075 (0.075 %) of the vocabulary being made up of such words; that's 24 out of 31801 words for Russian, and 35 out of 47158 words for English. Unfortunately, when you consider Cyrillic languages other than Russian, there are more spoofable characters, notably "ijs", which expands the English spoofable range by more than a factor of 4 to 183 out of 47158 words, 0.004 (0.4%) of the vocabulary. By the way, compare this with the situation where we allow script mixing, where _every_ word containing _any_ of the homographs is a threat: that is to say, 99.7% of all words. Clearly allowing script mixing is a bad thing. So, given that Netcraft assumes that there are approximately 60,000,000 domain names registered, and assuming that the word statistics are similar to those of my English dictionary, a 0.075% share would represent roughly 45,000 spoofable domains in the ASCII namespace, if we allow Cyrillic labels to spoof them. Similarly, a 0.4% share would represent 240,000 spoofable labels. On the other hand, the total number of Cyrillic labels to date is presumably rather small, and a 0.075% share of that rather smaller. This leads me to make the following proposal (please bear with me, this is only the first hack at the logic): We should consider special processing for Cyrillic labels in names if they consist _entirely_ of homographs for Latin letters, if they are in a "non-Cyrillic context". A "non-Cyrillic context" is a name with no other _unambiguous_ Cyrillic labels, which is not in a TLD where Cyrillic characters might be expected to have priority, such as a TLD for a country where Cyrillic script is the norm. There are two reasonable courses of action: * either reject them as a probable spoofing attempt, or * rewrite them to the equivalent Latin alphabet label, so that "what you see is what you get" Similarly, when we _are_ in a "Cyrillic context", we should consider similar treatment for Latin labels consisting _entirely_ of homographs for Cyrillic characters. This can easily be extended to other languages, such as Greek and Coptic. It has the following advantages: * it accepts the "grandfathering in" of the the existing Latin namespace * it does not prevent the use of _unambiguous_ labels from any writing system in any TLD * it does not disadvantage any user of a ccTLD for a country with a different native script, instead giving that script priority locally, as users might expect * it allows the use of any script in any TLD, when "full IDN" is available: the use of a TLD in a given script will automatically signal which script is to be given priority in name interpretation
Can we fix this issue by changing UI? For example, display the raw DNS URL but offer a tooltip to display the IDN when mouse over? I do not like the idea of whitelist / blacklist. (To solve the problem without it, we can disable the feature, which is not nice but it solve the problem.) I do want to show people the IDN. But that does not mean we need to display it exactly as where we display URL. Can we display in a different mean ? I once think about putting the raw DNS in the end of URL but that won't work since a long URL can push it out from the box. Using other color to display IDN will be meaningful for people know what it mean but it won't be a good solution for general public. Adding another text field below the URL on the fly to display the IDN will be good but the UI will be strange. how about an IDN icon next to the URL bar that when people click it the it will show the IDN to the user but we always display the raw DNS on the URL bar? People can still type in the IDN into the URL bar, but then it will convert to the raw DNS and display there. If people click the IDN icon, then it will show a tooltip below the URL bar to show the IDN.
>how about an IDN icon next to the URL bar that when people click it the it will show the IDN to >the user but we always display the raw DNS on the URL bar? or an option in the context menu... The other thing we may want to do is instead of using the black list the block the use of IDN, use it to show a 'warnning dialogbox' when the user first get into that domain. This can be done by a observer which observe the change of the URL box, whenever we change the URL, scan the URL bar, if it is a new domain from the previous one and it find any character in the URL fall into that black list, show a warning dialogbox to the user.
(In reply to comment #114) > * it does not prevent the use of _unambiguous_ labels from any writing system in > any TLD > * it allows the use of any script in any TLD, when "full IDN" is available: the > use of a TLD in a given script will automatically signal which script is to be > given priority in name interpretation This would not be in the spirit of the ICANN guidelines: http://www.icann.org/general/idn-guidelines-20jun03.htm I will now try a different stance. Bear with me: Since IDN is not widely used yet, this might be the time to decide that we simply will not accept non-ASCII characters in the US TLDs (i.e. .com, .org, etc). I mean, what business do these Cyrillic registrants have in .com, anyway? Why can't they just stay where they belong, in *.ru and the like? If we can get Microsoft to agree with this stance, and they decide to reject non-ASCII characters in the US TLDs too (when they get around to supporting IDN), then the world's dominant browsers (i.e. IE and Firefox) will effectively be enforcing the *correct* IDN rules. The browsers are the last line of defense. We are the user's agent. We will protect them from lazy/negligent registrars. -------------- On the other hand, if MSIE and Firefox *do* allow non-ASCII characters in US TLDs, then the flood-gates are open. We are making ourselves vulnerable. We are asking for trouble. Kinda like ActiveX in IE and *.exe in Outlook.
.com, .org, .net.... are *not* US TLDs Cyrillic registrants can have them if they want, if there site is commercial, an organisation, linked with network activity....
(In reply to comment #118) > .com, .org, .net.... are *not* US TLDs > Cyrillic registrants can have them if they want, if there site is commercial, an > organisation, linked with network activity.... Couldn't agree more, the gTLDs are absolutely not US-specific in any way. And even if they were, don't forget the millions of US citizens who speak languages other than English and have a legitimate need for IDNs.
(In reply to comment #119) > Couldn't agree more, the gTLDs are absolutely not US-specific in any way. > And even if they were, don't forget the millions of US citizens who speak > languages other than English and have a legitimate need for IDNs. Continuing to try this other hat on: What characters are permitted in a person's name on a US driver's license? What characters are permitted in the name of a US corporation? Are there legitimate reasons to keep those rules in place? Do the non-English speakers in the US transcribe their personal/company names into English letters in some contexts? Should the ASCII-only rule only apply to *.us, leaving *.com open to the world (and the spoofers)? Should we go for a complex system where domain labels consisting entirely of homographs are caught? What happened to KISS (Keep It Simple, ...)?
(In reply to comment #109) > Perhaps an > even-longer term solution is a special font for the URL bar with exaggerated > accents and clearly different letter-forms for near-homographs within the same > script family? I think we need to do something like this. Even in English, there are characters that look very similar in some fonts (e.g. letter l and digit 1 but not capital letter I in domain names, since it's capital). One possible concern might be the expanded width of the URI in the location bar. This might be too complicated, but maybe we can use a different font for the domain name part of the URI, and keep the compact sans-serif in the rest.
Hardware: PC → All
(In reply to comment #120) > (In reply to comment #119) > > Couldn't agree more, the gTLDs are absolutely not US-specific in any way. > > And even if they were, don't forget the millions of US citizens who speak > > languages other than English and have a legitimate need for IDNs. > > Continuing to try this other hat on: > > What characters are permitted in a person's name on a US driver's license? > What characters are permitted in the name of a US corporation? Are there > legitimate reasons to keep those rules in place? Do the non-English speakers > in the US transcribe their personal/company names into English letters in > some contexts? > > Should the ASCII-only rule only apply to *.us, leaving *.com open to the > world (and the spoofers)? > > Should we go for a complex system where domain labels consisting entirely > of homographs are caught? > > What happened to KISS (Keep It Simple, ...)? The answer to that question is: yes, as simple as possible, but _no simpler_. Unfortunately, this is not a simple problem, unless you can either detect or visually eliminate homographs. Earlier comments referenced the ICANN rules. Registrars and registries are simply ignoring the ICANN rules; that's one of the principal reasons why we are in this mess. Unfortunately, we have no power to force them do so. That's why I think my proposal is actually more "in the spirit" of the ICANN recommendation than the current status quo, where we have rules, but nobody follows them. Note that in the presence of an ICANN-compliant setup, my suggested strategy is completely invisible, and equivalent to the identity function. And yes, I'm working on just such as "complex system". It's table driven, and will probably end up as one page of code, plus language tables many of which are already known. I've now tabulated the current "accidental spoofing" rates between the major Latin-script languages, and between Cyrillic and these languages, and I will continue to work on this proposal to take this new data into account.
(In reply to comment #122) > Unfortunately, this is not a simple problem Well, that depends on the chosen policy. If Mozilla decides to disable IDN by default for now, that is very simple. Another simple solution is to stick to ASCII characters in *.com (using a white list, naturally :-) > And yes, I'm working on just such as "complex system". I think it's great that you're doing all this work! Have you heard from any of the module owners at mozilla.org regarding the type of patch that they would like to consider? Darin, any thoughts?
(In reply to comment #123) > (In reply to comment #122) > > Unfortunately, this is not a simple problem > > Well, that depends on the chosen policy. If Mozilla decides to disable IDN > by default for now, that is very simple. Another simple solution is to stick > to ASCII characters in *.com (using a white list, naturally :-) > > > And yes, I'm working on just such as "complex system". > > I think it's great that you're doing all this work! Have you heard from any > of the module owners at mozilla.org regarding the type of patch that they > would like to consider? Darin, any thoughts? I agree, pushing out a fix that will turn off IDN by default is the best short-term option, unless a properly tested and audited fix can be deployed first. Then we can set IDN to be on by default in the next release (Firefox 1.1?). Please note that I have very little desire to code C++ at the moment -- however, I can provide "executable pseudocode" in the form of Python, as that is ideal as a rapid development and testing platform, and the test vectors for the Python testbed can be used for any eventual patch. One of the advantages of a table-driven approach is that it minimizes code size, and allows just this sort of development route.
FWIW, not necessarily an endorsement, but Mozilla does have some related code: mozilla/intl/unicharutil/util/nsCompressedCharMap.cpp mozilla/intl/uconv/public/nsICharRepresentable.h There are some tables of characters used in languages at fontconfig.org: fc-lang/*.orth We have to be careful with tables we get from elsewhere. They may include too many characters. We need to be conservative in our solution to the spoofing problem.
(In reply to comment #117) > Since IDN is not widely used yet, this might be the time to decide that we > simply will not accept non-ASCII characters in the US TLDs... > > If we can get Microsoft to agree with this stance, and they decide to reject > non-ASCII characters in the US TLDs too (when they get around to supporting > IDN), then the world's dominant browsers (i.e. IE and Firefox) will > effectively be enforcing the *correct* IDN rules. That sounds very much like trying to use a dominant position to enforce and create a defacto proprietary standard, which is just a bad idea. (In reply to comment #122) > Earlier comments referenced the ICANN rules. Registrars and registries are > simply ignoring the ICANN rules; that's one of the principal reasons why we > are in this mess. Unfortunately, we have no power to force them do so. That's a shame most registrars don't follow the rules, however that doesn't apply to all. I'm fairly certain, given the relatively strict guidelines enforced by the auDA for the registration of .au domains which are followed by all accredited .au registrars, that such spoofed variants simply wouldn't be allowed. Thus, I believe, any request for a paypal.com.au variant, for example, would be rejected by the registrar which, IMHO, is the right level to handle this problem. Though, I reluctantly agree, given the situation with other TLDs, that handling this at the UA level may just be something we have to accept.
Re: de facto standards, I was playing devil's advocate. Just ignore those comments. Tonal's doing some great work here, ensuring that Mozilla leads the way.
This is a table of some homograms for ASCII lowercase characters, with confusion distances. Note that this list is neither definitive nor exhaustive, and only covers the Cyrillic, Greek and Coptic, Latin-1 Supplement, Latin Extended-A and Latin Extended-B, and is only provisional even within these tables. The main purpose of this table is to aid research into homograph spoofing, and to inspire other developers to inspect the Unicode code charts or Unicode rendering in other fonts and to extend this table. Key to confusion distances: 0 => visually identical 1 => almost identical 2 => easily confusable at small font sizes
(In reply to comment #90) > Just to give you nightmares, here is another scenario. As you know, > NAMEPREP will map quite a lot of things to the lowercase ASCII > letters. Now suppose that someone has coded up a spoofed address in > just these characters, Punycoded it, and slipped it past a dumb > registrar, perhaps using an automated domain transfer. Now, we have > two different DNS names, foo.com and xn-<something>.com, both of > which will map to the ASCII string foo.com after being un-Punycoded This can't happen. See RFC3490, specifically steps 6 and 7 of ToUnicode. The algorithm for decoding a punycode-encoded domain name checks that it encodes to the same punycode as you started with, and leaves it unchanged (ie as punycode) if it doesn't. It would probably be good to verify that Mozilla implements ToUnicode correctly though
(In reply to comment #128) > Created an attachment (id=174139) [edit] > Experimental table of some homograms, with confusion distances I suspect GREEK SMALL LETTER GAMMA is confusable with y at small font sizes (in some fonts)
(In reply to comment #18) > On the other hand, the Unicode .pdf charts _do_ appear to contain a detailed > cross reference of visually confusable characters, as do the charts in the > Unicode book. Under "Cross References" "Explicit Inequality" the Unicode book says: "The two characters are not identical, although the glyphs that depict them are identical or very close." However, they do not seem to include cross references for *all* of the spoofs. For example, Cyrillic small 'a' does not have a cross ref. Maybe they would update the Unicode charts if you send them your info? You say your table is only provisional at this point, so maybe you would want to wait until it's more or less "ready".
Another solution to this problem is to pop up a dialog the first time the user clicks on a link containing a domain name with characters normally found outside the user's language. (We could still check in the homogram table solution too.) The dialog I have in mind would explain the issue and then allow the user to specify that the browser should allow certain other languages, chosen from a list that we can generate based on the characters found in the domain name. The user could even view the entire list of languages and select some from there, or just tell the browser to allow any language. The reason I'm mentioning this white list approach again is because I feel that the homogram approach is essentially a black list approach, and we cannot deploy a black list approach until it is complete, whereas we can start using white lists right away, even before they are complete. Over time, we can expand the white lists with characters that we deem "safe". But that's just me. I have no idea how others feel about this...
(In reply to comment #132) > ... > > The user could even view the entire list of languages and select some > from there, or just tell the browser to allow any language. > > The reason I'm mentioning this white list approach again is because I feel > that the homogram approach is essentially a black list approach, and we > cannot deploy a black list approach until it is complete, whereas we can > start using white lists right away, even before they are complete. Over > time, we can expand the white lists with characters that we deem "safe". > > But that's just me. I have no idea how others feel about this... Erik, I see the blacklist and whitelist approaches as complementary, not competitive. Neither is 100% guaranteed to work. For example: consider a user who can read both Russian and English, and thus has chosen to accept URLs containing domain names in either script. Unfortunately, this also means that they will not be alerted if a domain name contains a label that is 100% Cyrillic characters, but exactly spoofs a Latin-script name, as this: * is a 100% conformant IDN which follows the current IANA "one label, one language" policy precisely * follows the "no-script mixing" principle suggested earlier * follows the principle of no graphical characters suggested earlier * is visually indistinguishable _by design_ from the Latin equivalent, and cannot be distinguised even by a bilingual Russian/English reader Note that the reverse would also apply for a Latin spoof of a Cyrillic-script word. (Conside Latin versions of the Russian words &#1086;&#1088;&#1077;&#1093;, &#1086;&#1088;&#1077;&#1093;&#1072;, &#1088;&#1072;&#1089;, &#1088;&#1072;&#1089;&#1072;, &#1088;&#1072;&#1089;&#1077;, &#1088;&#1086;&#1089;, &#1088;&#1086;&#1089;&#1072;, &#1088;&#1086;&#1089;&#1077;, &#1089;&#1077;&#1088;, &#1089;&#1077;&#1088;&#1072;, &#1089;&#1077;&#1088;&#1086;, &#1089;&#1077;&#1088;&#1086;&#1077;, &#1089;&#1089;&#1086;&#1088;&#1072;, &#1089;&#1089;&#1086;&#1088;&#1077;, &#1089;&#1089;&#1086;&#1088;&#1091;, &#1089;&#1091;&#1093;, &#1089;&#1091;&#1093;&#1086;, &#1089;&#1091;&#1093;&#1086;&#1077;, &#1091;&#1088;&#1072;, &#1091;&#1093;&#1072;, &#1091;&#1093;&#1086;, &#1091;&#1093;&#1091;, &#1093;&#1072;&#1086;&#1089;&#1077;, &#1093;&#1086;&#1088;). This is where we introduce the idea of "script preference" for top-level domains. Supposedly, registries should filter each label they issue with a character set from a single specified language, and they should register their character sets and languages with IANA. However, in the absence of compliance with these rules, we can help them along a bit in cases where there is ambiguity. _This_ is where homograph tables, and the principle of assigning language/script family precedence to TLDs, are useful; we can close the door on all or nearly all of the possible spoofing options that remain. (Notice that we've already squeezed down the possible cases to a very small portion of the namespace, less than 0.1%, by applying the no-script-mixing and no-graphic-characters rules). If a URI with a Cyrillic domain name label made up entirely of Latin homographs is in a domain which has "Latin-script precedence", we should treat it as potentially spoofed. Similarly vice-versa. It is then a matter for browser design policy whether the domain name containing this label should be: * treated as malformed, and lookups return an error * generate a warning to the user, and prompt them as to whether they are really sure * simply provide an on-screen warning banner * or even attempt to guess the "correct" domain name (DON'T DO THIS LAST ONE! IT'S ONLY AN EXAMPLE!) Even this algorithm is not perfect. But it can be very good. Note that the homograph tables don't need to be 100% perfect to reduce the last remaining options by many orders of magnitude. If there are (say) 4 distinct characters in a name, all of whose characters are spoofable, forbidding script-mixing requires every one of them to be spoofed. Now, only about 0.1% of the name repertoire will have all-spoofable characters (based on experiments with wordlists -- this is conservative, because many words are short), and a homograph table that only has 90% coverage will reduce the number of spoofing possibilities by a factor of 10**4 = 10,000. So, at the end of this process, we might expect 0.00001% of domains to be spoofable. According to Netcraft, there are currently roughly 27,000,000 domains active. So, with a 0.00001% failure rate, we can expect roughly 2.7 domains to fall through the cracks and remain spoofable. Increasing the accuracy of the homograph table to 95%, if you believe this analysis, leaves an expected count of effectively zero sites left spoofable. So, what are the tables we need to implement this? 1. A table of _assigned_ codepoint ranges containing characters that will not be used in any language 2. A mapping of codepoint ranges to script systems 3. A number of special-case lists of script systems for languages which use more than one script system (essentially the CJK languages) 4. A homograph table, giving equivalence classes of visually confusable characters 5. A mapping of ccTLDs to script systems via languages, via existing machine-readable linguistic sources How much code is needed to implement this? Probably (judging by my Python test programs, and allowing for a less concise language) between 1 and 3 pages of C++. Note that all of the tables involved are likely to have order 100 entries or less, and are easily compiled from existing sources. Note that none of them dictate the character assignments within any language, allowing Unicode to add new codepoints within a language, and if we know the pattern of future Unicode assignments (which are pre-planned) we can be forward-compatible with new updates to Unicode, even without the ability to update the tables. Add the ability to update the tables, and we have a maintainable, forwards- and backwards- compatible system which could effectively end the current spoofing worries regarding IDN, and allow its continued rapid deployment, whilst providing yet another strong incentive to use Mozilla products. By the way, please don't treat any of this as a rejection of whitelist techniques: no method is perfect, and attackers may be ably to find a way through even the best-designed defences given enough time and ingenuity; this proposal involves multiple layers of defence, and I think that adding a whitelist scheme is another good way of aiming for the same objective of preventing spoofing, in particular for users who are non-European and less accent-aware.
You're absolutely right. If the user reads both English and Russian, we need to watch out for homographs. Thanks for being so patient with me.
There's a lot of discussion going on here :-) One idea which met with approval on the Mozilla security list was the following: Most domain registrars have been correctly implementing the guidelines for avoiding IDN-related spoofing problems. AIUI, the .jp registry even delayed issuing IDN names for six months until the guidelines were finished. Unfortunately, there are a few rather large exceptions to this - .com being one. So, the suggestion is to have a blacklist of those TLDs, and display the IDN in raw punycode form throughout the UI until such time as the registrars get their act together. Later Firefoxe releases, or automatically-pushed updates, can shrink (or expand) the blacklist. This has many significant advantages. It's fairly simple to code, and doesn't penalise IDN domain owners and registrars who have been doing the right thing. It doesn't place any restrictions on what domains are allowed. It requires no user configuration, and no assumptions about what characters a given user might be familiar with. It involves no pop-ups. It places the blame and the responsibility where it really belongs, and kills any homograph attacks stone dead. Gerv
(In reply to comment #135) > There's a lot of discussion going on here :-) > > One idea which met with approval on the Mozilla security list was the following: > > Most domain registrars have been correctly implementing the guidelines for > avoiding IDN-related spoofing problems. AIUI, the .jp registry even delayed > issuing IDN names for six months until the guidelines were finished. > Unfortunately, there are a few rather large exceptions to this - .com being one. > > So, the suggestion is to have a blacklist of those TLDs, and display the IDN in > raw punycode form throughout the UI until such time as the registrars get their > act together. Later Firefoxe releases, or automatically-pushed updates, can > shrink (or expand) the blacklist. > > This has many significant advantages. It's fairly simple to code, and doesn't > penalise IDN domain owners and registrars who have been doing the right thing. > It doesn't place any restrictions on what domains are allowed. It requires no > user configuration, and no assumptions about what characters a given user might > be familiar with. It involves no pop-ups. It places the blame and the > responsibility where it really belongs, and kills any homograph attacks stone dead. > > Gerv > Neat. Alternatively, you can have a _whitelist_ of TLDs which are known to be following the ICANN / IANA rules. This is more "politically" neutral, avoids the issues associated with a blacklist, and yet will act in the same way as a strong incentive for non-conformant TLD registries to follow best practices. This also deals better with new TLD allocations.
Whether we go for a white or a blacklist probably depends on getting a much better view of how widespread the problem is. What I'm hearing from the IDN community is that most people are playing by the rules - it's just a few high-profile registrars and TLDs which aren't. If that's the case, a blacklist is probably good - we do want to send a message. After all, their negligence has put our users at risk. On the other hand, if the picture is more mixed than I understand, then perhaps a whitelist approach might be better. Gerv
I would tend to agree that if you're going to have a list of TLDs, then a white list would be better since we can't anticipate whether new TLDs will be served by registrars that follow rules. Also, I think the solution discussed here should still be worked on, even if Mozilla decides to use the TLD black/white list solution in the interim since we don't know what Microsoft is going to release when they get around to it. If they implement the ideas discussed here or some other ideas and end up supporting IDNs in *.com, then we will probably want to be able to start supporting it in short order too (via auto-updates or whatever). Finally, I still think we should seriously consider doing something about the font in the status and location bars. If expanded width is indeed a concern, how about my idea of using a good font for the domain name part only? Maybe this part could be separated out to a different bug number.
Eric: I'm not sure what you mean by "what Microsoft will do" - IE doesn't support IDNs, and I've not seen it mentioned among any of the things they plan to do for the next release. In any case, that's ages away - in the Longhorn timeframe. Hopefully, by then, registrars will have sorted their lives out and all this will be but a distant memory. I don't know who writes the IDN plugin for IE. I haven't heard any comments from them on the situation. I personally bvdon't think we need to change the status bar font - but you are right, that should be a separate bug. It's not IDN-specific. Gerv
(In reply to comment #139) > Eric: I'm not sure what you mean by "what Microsoft will do" - IE doesn't > support IDNs, and I've not seen it mentioned among any of the things they plan > to do for the next release. In any case, that's ages away - in the Longhorn > timeframe. Hopefully, by then, registrars will have sorted their lives out and > all this will be but a distant memory. OK, you're probably right. > I don't know who writes the IDN plugin for IE. I haven't heard any comments from > them on the situation. I found an IDN plug-in for IE at a domain owned by Verisign: http://www.idnnow.com/index.jsp > I personally bvdon't think we need to change the status bar font - but you are > right, that should be a separate bug. It's not IDN-specific. OK, bug 282079.
> I found an IDN plug-in for IE at a domain owned by Verisign: That plug-in was developed by VeriSign who are energetic about making it easily available to registries wishing to implement IDN. (Please note that the IDN policies for a TLD are set by the *registry*, not the *registrars*.). The registries that prefer to recommend the Mozilla browsers to their clients are probably disparing of the discussion in the present forum, which appears eager to have Mozilla voluntarily take itself off the IDN market. VeriSign, whose .com policies triggered the current concern, has everything to gain by their plug-in becoming the only game in town.
> VeriSign ... has everything to gain by their plug-in becoming the only game > in town. I should have mentioned that there is also an open source IDN plug-in for IE. This is presented as alpha, among other things, because it leaks punycode in the status line. As things are now developing, this may end up being a strong feature. It can also be toggled on and off, which, togther with the status line display of punycode, may be all that's really called for to ease present concerns. The IDN-OSS plug-in performs as described here even when stacked on top of the VeriSign plug-in (although other negative interaction cannot be discounted) and probably deserves some attention from the participants in the present thread. The VeriSign plug-in is available at http://idnnow.com and the open source at http://idn.isc.org.
I agree, the prospect of a propriatary plug-in monopolizing the market is the worst of all possible worlds. I now think the best short-term solution to the spoofing problem is the one proposed by Gerv, namely that domains run by non-standards-compliant registrars get their Punycode made visible -- it's neat, easy to code, and does not require disabling IDN support. I continue to believe that stricter IDN filtering rules, both at the registrar and browser, as suggested in my earlier proposals, are necessary in the medium- and longer term. However, based on input off-line from a number of people, I now believe that this can best be achieved by working with IANA / ICANN and the registry community, so that we do not have to take responsibility for an ad-hoc non-standard implementation, but can instead be seen to be implementing a solution based on authoritative standards. Incidentally, the discussion in this bug has kicked off a related discussion in the Unicode mailing list, where it has been mentioned that there is now a proposal to create an "official" homograph list.
> I now think the best short-term solution to the spoofing problem is ... that > domains run by non-standards-compliant registrars get their Punycode made > visible -- it's neat, easy to code, and does not require disabling IDN > support. How do you propose determining the identity of the registrar? What standards is registrar behavior to be judged against?
Rather than just showing the punycode in the status bar which many people either don't have turned on, don't notice, or may even be altered by a script in the website (unless that ability has been disabled by the user); how about displaying the information bar at the top, just like the popup blocker, that explains that it is an internationalised domain name, the possible security implications and show the punycode version. It should provide a more information link/button and the option to add the site to trusted and untrusted lists. This could be used in conjunction with any of the other proposed checks so that it's not shown for every single IDN, just those that mozilla detects as the likely candidates for spoofs.
I like the idea in comment 145. It may harm valid IRIs a bit, but they are not widely deployed and I guess the option is pref controlled so people can turn it off. (Where I mean IRIs that are registered without the intention to spoof users by valid IRIs.)
> (Please note that the IDN > policies for a TLD are set by the *registry*, not the *registrars*.) Well that's good, because it's easy for us to determine the registry (just look at the TLD), but hard for us to determine the registrar (requires a WHOIS). If the policies for .com do not protect against phishing, then we should not display IDN domains in their full form in that TLD, because to do so is a security risk. It's as simple as that. I don't quite understand how Mozilla not displaying IDN for .com gives Verisign a monopoly on anything. But if Verisign want to have a monopoly on putting their customers at risk of phishing, let them. I strongly believe that whatever solution we implement should allow full, uncrippled and first class implementation of IDN in those cases, whatever they may be, where we have established that there is no more risk than in the ASCII domain name space. (In reply to comment #145) > Rather than just showing the punycode in the status bar which many people either > don't have turned on, don't notice, or may even be altered by a script in the > website (unless that ability has been disabled by the user); My suggestion is not to only show the punycode in the status bar, but to use it everywhere for TLDs which have poor homograph control policies. The status bar is always-on in Firefox, unless the user specifically disables it. This is a security feature. The security area of the status bar (to the right) cannot be altered by script. > how about > displaying the information bar at the top, just like the popup blocker, that > explains that it is an internationalised domain name, the possible security > implications and show the punycode version. A strong characteristic of a good solution is that it does not discriminate against all IDN domain names. This solution, in its plain form, does. There is definitely value in using a phishing detection heuristic to display such a bar - but that's fixing the more general phishing problem, not just the homograph one. Gerv
> I don't quite understand how Mozilla not displaying IDN for .com gives > Verisign a monopoly on anything. Take a look at the documentation provided by the TLDs that support IDN. Prominent in every such text is a clear reference to the need for an IDN-complaint browser, and a list of available alternatives. Such lists are almost always headed by the VeriSign IE plug-in and Mozilla, often listing no further alternatives. Regardless of VeriSign's IDN policies in .com, their IE plug-in is a sound implementation of IDNA. The same goes for Mozilla. What do you think the maintainers of the TLD documentation are going to do if the one of these two decides that it is no longer going to provide rigorous support for IDNA? > if Verisign want to have a monopoly on putting their customers at risk of > phishing, let them. How are you defining the concept of VeriSign customer? Someone who expects to be able to use the Unicode form of an IDN in .com? Someone who is using IE for the task?
(In reply to comment #148) > Take a look at the documentation provided by the TLDs that support IDN. Could you give links to such documentation for a few different TLDs? > Prominent in every such text is a clear reference to the need for an > IDN-complaint browser, and a list of available alternatives. Such lists are > almost always headed by the VeriSign IE plug-in and Mozilla, often listing no > further alternatives. Regardless of VeriSign's IDN policies in .com, their IE > plug-in is a sound implementation of IDNA. The same goes for Mozilla. What do > you think the maintainers of the TLD documentation are going to do if the one of > these two decides that it is no longer going to provide rigorous support for IDNA? So your argument is "People won't use or recommend your browser if you try and protect them from security problems"? Are you saying that providing "Warning! This could be a scam!" information on a subset of IDN names is "rigorous support", whereas allowing all IDNs except those in known-risky TLDs is not? I would hope that, in a few months, the problematic registrars will see the writing on the wall and fix their policies. By the time IDN use becomes widespread, everything will be sweetness and light again. However, some pressure needs to be put on them to achieve this aim. If we accept responsibility for the problem, say "Yeah, you keep on registering what domains you like. We'll try and sort out the phishing problem at our end", then we're opening ourselves up to massive and unnecessary liability and bad publicity every time a bug is found in e.g. our embedded homograph tables (if that's the solution chosen - it's an example). > > if Verisign want to have a monopoly on putting their customers at risk of > > phishing, let them. > > How are you defining the concept of VeriSign customer? Someone who expects to > be able to use the Unicode form of an IDN in .com? Someone who is using IE for > the task? Someone who registers a domain in .com - like Paypal, Inc. or Bank Of America. These companies are put at greater risk of damaged reputations and irate customers with monetary losses because of Verisign's (and the other .com registrars') lack of control over domain registration. Gerv
>> Take a look at the documentation provided by the TLDs that support IDN. > > Could you give links to such documentation for a few different TLDs? Most of it is in the national languages of ccTLD registries. Dot-com provides a good example of the way a large commercial gTLD is doing this, but given their proprietary interest in browser support, they only point users in the direction of their own plug-in: http://verisign.com/products-services/naming-and-directory-services/naming-services/internationalized-domain-names/index.html The other end of the gTLD scale -- small domain, non-profit operation -- is dot-museum, http://about.museum/idn/. Their list of supported languages contains links to a number of ccTLD IDN support sites. > Are you saying that providing "Warning! This could be a scam!" information on > a subset of IDN names is "rigorous support", whereas allowing all IDNs except > those in known-risky TLDs is not? I believe the warning text to be an excellent means for balancing the two considerations. What I don't want to see happen is the resolution of IDNs made conditional. > I would hope that, in a few months, the problematic registrars will see the > writing on the wall and fix their policies. Again, registrars are not responsible for the IDN policies of TLDs. > By the time IDN use becomes widespread, everything will be sweetness and > light again. However, some pressure needs to be put on them to achieve this > aim. The question of who needs to apply what kind of pressure to whom, and how that might effectively be done, goes way, way, beyond the scope of the present discussion. The whom is, however, not the TLD registrars.
> I believe the warning text to be an excellent means for balancing the two > considerations. What I don't want to see happen is the resolution of IDNs made > conditional. So people are not going to recommend Mozilla if we disable IDN in problematic TLDs, but they are if some (random, to the user) uses of an IDN pop up a scary warning message? > > I would hope that, in a few months, the problematic registrars will see the > > writing on the wall and fix their policies. > > Again, registrars are not responsible for the IDN policies of TLDs. The link's not hard to understand. This is the way we put pressure on: - disable IDN for problematic TLDs - less people register IDN names in those TLDs, because - registrars get less money - registrars either get together to solve it, or put pressure on the registry - registry or registrars implement sensible policies - we lift the block. How else do you suggest that we persuade them to sort their acts out? > The question of who needs to apply what kind of pressure to whom, and how that > might effectively be done, goes way, way, beyond the scope of the present > discussion. It's precisely the present discussion - because if we are going to take responsibility in the browser for solving this problem (which would be contra to Opera's stance, and my understanding of current mozilla.org staff opinion) then our course of action is going to be very different to that if we are making it clear it's a registry problem. Gerv
My apologies; in a rush to catch Match of the Day, I left that message uncomplete and unnecessarily brusque. Attempt 2 at the middle section: > Here's how we establish a link: > > - disable IDN for problematic TLDs > - less people register IDN names in those TLDs, because they appear ugly > - registrars get less money > - registrars either get together to solve it, or put pressure on the registry > - registry or registrars implement sensible policies > - we lift the block. > > How else do you suggest that we persuade them to sort their acts out? Gerv
How do we determine which TLD's are "safe" (or unsafe)? There's a fairly short list registered with IANA (comment 35). Other comments state belief that other TLDs are responsible (e.g. au,de). Some suggest going by statements made by the registries themselves, but Verisign says the right things (http://verisign.com/products-services/naming-and-directory-services/naming-services/internationalized-domain-names/page_001394.html#01000006) while "paypal.com" in a mixture of latin and cyrillic shows .com is broken.
It is difficult to decide on a black (or white) list of TLDs. However: (In reply to comment #135) > This has many significant advantages. It's fairly simple to code, and doesn't > penalise IDN domain owners and registrars who have been doing the right thing. > [It doesn't place any restrictions on what domains are allowed.] It requires no > user configuration, and no assumptions about what characters a given user might > be familiar with. It involves no pop-ups. It places the blame and the > responsibility where it really belongs, and kills any homograph attacks stone dead. I have another proposal that meets all of the above criteria, except for the one enclosed in [ and ] which I feel is inappropriate to begin with. We simply use the subset of IANA IDN tables that we deem safe as a filter. If a domain name contains characters outside the TLD's table, we present the Punycode form of the name in the UI. The safe IANA tables are ones that either have a single language or have multiple languages but do not have homographs. The JP table is a good example of a safe table. On the other hand, the .biz German table is unsafe because it implies that other languages such as Russian might be registered in the future. This means that .biz allows homographs, so it doesn't pass our test. This places the pressure where it belongs, i.e. on the guidelines authors, the IANA registry and the domain registries. Mozilla could either start with the very small number of safe IANA IDN tables (putting some pressure on registries that haven't registered their table yet) or a larger number of tables that we come up with on our own (which the TLD registries can use for their IANA submission if they wish). TLDs without tables would default to US-ASCII, a safe set. Mozilla would enlarge its set of tables via new releases, auto-updates or user-intervention-less secure downloads. This means that our rules would never allow Cyrillic small letter 'a' to be used in a *.com domain name, but that's OK because the Latin small letter 'a' looks the same to a human and its character code should not be a concern.
(In reply to comment #154) > This places the pressure where it belongs, i.e. on the guidelines authors, > the IANA registry and the domain registries. It also pressures the domain registrars. > This means that our rules would never allow Cyrillic small letter 'a' to > be used in a *.com domain name, but that's OK because the Latin small > letter 'a' looks the same to a human and its character code should not > be a concern. Sorry, what I meant to say is that we would allow any domain name but we would display the Punycode form if it didn't follow the TLD's rules. I should add that my proposal provides for a second filter. If the domain registrar fails to filter a new domain name or to remove an old spoof, then our filter will catch it.
My most recent proposal is in some sense taking Mozilla's standards stance to its logical extreme. I.e. Mozilla does not simply follow the ECMAScript standard as is; it blocks pop-up windows. My proposal does not simply follow the ICANN guidelines as is; it blocks homographs. However, a lot of people would point out that this is like the tail wagging the dog. Me being the tail and ICANN, IETF, IANA, Unicode Consortium, domain registries and registrars being the dog. So it would probably be prudent for Mozilla to heed Neil's advice: (In reply to comment #143) > I continue to believe that stricter IDN filtering rules, both at the registrar > and browser, as suggested in my earlier proposals, are necessary in the medium- > and longer term. However, based on input off-line from a number of people, I now > believe that this can best be achieved by working with IANA / ICANN and the > registry community, so that we do not have to take responsibility for an ad-hoc > non-standard implementation, but can instead be seen to be implementing a > solution based on authoritative standards. Having said that, the easiest way to come up with a list of TLDs for Gerv's proposal is to first choose to use a white list of such TLDs and then to take a closer look at the IANA IDN registry. I propose to use the following list of TLDs initially: jp, kr and th.
(In reply to comment #156) > My most recent proposal is in some sense taking Mozilla's standards stance > to its logical extreme. I.e. Mozilla does not simply follow the ECMAScript > standard as is; it blocks pop-up windows. window.open is not part of any standard, _especially_ not ECMAScript.
(In reply to comment #157) Chuckle. I really ought to just shut up... :-)
The ICANN guidelines include the following: "top-level domain registries will (a) associate each registered internationalized domain name with one language or set of languages" I wonder if this language or set of languages can be looked up via DNS itself. I.e. is there a DNS record for the language(s)? Or are the registrars only expected to apply language rules at the time of registration itself?
(In reply to comment #145) > Rather than just showing the punycode in the status bar which many people either > don't have turned on, don't notice, or may even be altered by a script in the > website (unless that ability has been disabled by the user); how about > displaying the information bar at the top, just like the popup blocker, that > explains that it is an internationalised domain name, the possible security > implications and show the punycode version. > > It should provide a more information link/button and the option to add the site > to trusted and untrusted lists. This could be used in conjunction with any of > the other proposed checks so that it's not shown for every single IDN, just > those that mozilla detects as the likely candidates for spoofs. I like this idea a lot. To me, it'll be the best method of alerting the user without causing inconvience, and this will also give them a better sense of security.
(In reply to comment #156) > I propose to use the following list of TLDs initially: jp, kr and th. I guess many people would complain that this list is too short. There appear to be quite a few IDNs registered around the world, e.g. Europe, China. If Mozilla requires these TLD representatives to register their table with IANA in order to be included in Mozilla's white list, they might be in too much of a rush to compile the table and submit it, increasing the risk of mistakes. Perhaps we should instead have them point us at any existing tables they have (e.g. the DE table mentioned here earlier) and have them state their intent, in writing, to register with IANA. This way, they can take their time to polish the table(s) and also their implementations of filters at registrars, etc.
(In reply to comment #151) > So people are not going to recommend Mozilla if we disable IDN in problematic > TLDs, but they are if some (random, to the user) uses of an IDN pop up a > scary warning message? If someone is actively maintaining a list of IDNA-compliant applications, it would be reasonable for them to remove an item from the list if it ceased to fulfill all "must" requirements stated in the protocol, or was otherwise rendered inapplicable to the entire TLD namespace. If useful functionality is added to an IDNA-aware application, it would be equally reasonable for the application to remain on the list. > - disable IDN for problematic TLDs > - less people register IDN names in those TLDs, because they appear ugly > - registrars get less money > - registrars either get together to solve it, or put pressure on the registry > - registry or registrars implement sensible policies > - we lift the block. > > How else do you suggest that we persuade them to sort their acts out? Registrars provide automated front-ends to the TLD registries, with the policy engines residing on the latter platform. Registrars may freely decide which TLDs they wish to service, but then need to support each selected TLD full out. Registrars compete fiercely with each other on a market that is still only a shadow of what it once was. If you wish to teach individual registries a lesson by somehow whipping their sales agents into compliance, you'd need to be able to do this without leaving any remaining registration channel into a shunned TLD. As it happens the only gTLD where there is real IDN money is .com. Network Solutions (the largest of the registrars, previously doing business as VeriSign Registrar) is certain to support this domain regardless of what any other registar may feel compelled to do -- and all the other registrars know it.
Note that if Mozilla is going to use published language tables, you're going to have to look rather harder than just at the IANA registry. The ICANN guidelines only require the registry to publish their language tables; they can do so by any appropriate means (eg by placing them on their web site). Use of the IANA registration mechanism is entirely optional, so it would be inappropriate to penalize those registries that haven't used it.
This was just posted on the Unicode mailing list: From: Mark Davis <mark.davis@jtcsv.com> To: Unicode Mailing List <unicode@unicode.org>, UnicoRe Mailing List <unicore@unicode.org> Subject: IDN Security Date: Mon, 14 Feb 2005 09:20:06 -0800 (20:50 IRST) There were a few items coming out of the UTC meeting in regards to IDN. 1. We will be adding to draft UTR #36: Security Considerations for the Implementation of Unicode and Related Technology (http://unicode.org/reports/tr36/). In particular, this will include more background information and a set of specific recommendations for both browsers and registrars; both to be refined over time. 2. The UTC has authorized the editorial committee to make updates to #36 between UTC meetings, to allow for faster turn-around in presenting both background material and recommendations. We will try to incorporate ideas presented on these lists and others, so suggestions are welcome. 3. The UTR had for some time recommended the development of data on visually confusables, and we will be starting to collect data to test the feasibility of different approaches. In regards to that, I'll call people's attention to the chart on http://www.unicode.org/reports/tr36/idn-chars.html, that shows the permissible IDN characters, ordered by script, then whether decomposable or not, then according to UCA collation order. (These are characters after StringPrep has been performed, so case-folding and normalization have already been applied.) &#8206;Mark
Is the IDN display only happen in URL bar? How about email header? If I regsiser a domain as "ao" + cyriliic l + ".com" and send a email as "ytang0648@ao" + cyrillic l + ".com" to you and you look at it from thunderbird, when you reply, will it go back to ytang0648@aol.com or it may go back to ytang0648@ao + cyrillic l + ".com' ?
about "mix set"- considering someone use "www." + "ÈZÈQÈUÈi" + ".com" for www.ebay.com all the characters in "ÈZÈQÈUÈi" are in Cyrillic block. Not a mix set.
Its going to take longer to sort this out. minus for 1.0.1 and plus for 1.1
Flags: blocking1.8b2+
Flags: blocking1.8b-
Flags: blocking-aviary1.1+
Flags: blocking-aviary1.0.1?
Flags: blocking-aviary1.0.1-
I think trying to map against specific glyphs that are similar is always going to be error-prone and difficult unless all browsers standardise on a font across platforms for display of urls. My suggestion for immediate response to this bug is as follows: 1) Use a different colour for the address bar for domains that are not in the range provided by the users default character encoding (even if this is ascii for say Japanese users). This treats all domains equally. What a user will need to know is when a domain is not in their default encoding (otherwise they can basically trust the glyphs I guess). 2) For domains covered by 1) above, also include the raw unicode character code of the domain (as opposed to friendly view) in brackets after the domain name. Are these adequate visual clues?? 3) The first time any site is visited that meets the condition 1) above, display a warning to the user, explaining what has happened, and give them the option to permanently disable this warning. 4) I think blacklists (other than country and international standards/guideline based) should be handled at the proxy layer using real-time block lists. If these things aren't handled locally in this way, there is a danger that legitimate users who are unfortunately caught by them might consider that they are being denied service maybe?
First off, here's an attachment showing how the IDN URI looked in my address bar the first time I went to it. Imagine my suprise when I saw it and said, "that doesn't work at all... what the hell is that letter?" I'm not espousing that we make different fonts appear with impossible to read characters (my OS has upsettingly done that for me already by screwing with my Intl. fonts it seems)... What I am saying is that we should notice how different we can make something (and how obvious it can appear) when we use different fonts, make things bold, etc. Now, Paul Hoffman suggests using this same sort of solution in his blog http://lookit.proper.com/archives/000302.html We have to remember that making the IDN feature obnoxious will only breed dissatisfaction. but making it noticable will hopefully alert people to what is happening. Also, we should not (and i'd imagine "Must Not" in some spec somewhere) manipulate the URI in such a way that renders it useless. In other words, we cannot put extra characters in the Address bar like exclamation marks, etc. While our browser might be able to handle them, we have to remember that people bookmark, copy/paste, save, and physically write (using pen and paper) URIs all the time. We can't have "but if I type it in my browser it works fine, Grandma". So setting colors, spacing, weight, etc. of various character sets by default should seem to be a valid solution. Obviously, we should choose fonts that match the user's localization as best as possible (i.e. cryllic users could have the cryllic letters appear less intrusively since they'd most likely be seeing them more.) Also, as Paul Hoffman suggests, information about the specifics of the issue should be presented to the user. I think we've all decided that dialogs do more harm than good, but the new "Alert Bars" that firefox and thunderbird use when installing software or loading remote images all seem like valid (less) obtrusive notifications. And having them constantly wouldn't be necessary... we could simply state to the user, "This page features mixed characters from different languages. If this is unexpected, this page may be fraudulent." Then we could have a button (much like the "allowed sites button" for firefox) that would dismiss this message for particular combinations of character sets. In this fashion, a user could easily deactivate the warning for certain character set combinations that they commonly visit. Localizations could even make the setting for them. Other users will still be notified (both in the Address bar itself, and in the "Alert Bar" message) Obviously, no solution will fully remove the danger of this type of spoof, but a simple consistent system of alerting the user will sufficiently enable them to make informed decisions.
(In reply to comment #169) > Created an attachment (id=174352) [edit] > Screenshot showing Mozilla's default rendering on my computer. > > First off, here's an attachment showing how the IDN URI looked in my address > bar the first time I went to it. > > Imagine my suprise when I saw it and said, "that doesn't work at all... what > the hell is that letter?" Well, it didn't work well in your case because you used an X11core font build of Mozilla on a Unix-like platform. Mozilla built with a modern font system (i.e. Xft) on Unix as well as on Windows and Mac OS X makes it all but impossible to tell Cyrillic 'a' from Latin 'a' because in some (truetype and opentype) fonts covering both Latin and Cyrillic (there are many of them), a *single* glyph is v likely to be shared by two letters so that they look 100% identical. As demonstrated by your screenshot, what's been regarded as a hindrance to a good looking rendering (an inflexible partitioning of characters into font character sets in X11) could give us a hint about a potential solution. > > I'm not espousing that we make different fonts appear with impossible to read > characters (my OS has upsettingly done that for me already by screwing with my > Intl. fonts it seems)... What I am saying is that we should notice how > different we can make something (and how obvious it can appear) when we use > different fonts, make things bold, etc. > > Now, Paul Hoffman suggests using this same sort of solution in his blog > http://lookit.proper.com/archives/000302.html > > We have to remember that making the IDN feature obnoxious will only breed > dissatisfaction. but making it noticable will hopefully alert people to what is > happening. > > Also, we should not (and i'd imagine "Must Not" in some spec somewhere) > manipulate the URI in such a way that renders it useless. In other words, we > cannot put extra characters in the Address bar like exclamation marks, etc. > While our browser might be able to handle them, we have to remember that people > bookmark, copy/paste, save, and physically write (using pen and paper) URIs all > the time. We can't have "but if I type it in my browser it works fine, > Grandma". > > So setting colors, spacing, weight, etc. of various character sets by default > should seem to be a valid solution. Obviously, we should choose fonts that > match the user's localization as best as possible (i.e. cryllic users could > have the cryllic letters appear less intrusively since they'd most likely be > seeing them more.) > > > Also, as Paul Hoffman suggests, information about the specifics of the issue > should be presented to the user. I think we've all decided that dialogs do more > harm than good, but the new "Alert Bars" that firefox and thunderbird use when > installing software or loading remote images all seem like valid (less) > obtrusive notifications. And having them constantly wouldn't be necessary... we > could simply state to the user, "This page features mixed characters from > different languages. If this is unexpected, this page may be fraudulent." Then > we could have a button (much like the "allowed sites button" for firefox) that > would dismiss this message for particular combinations of character sets. > > In this fashion, a user could easily deactivate the warning for certain > character set combinations that they commonly visit. Localizations could even > make the setting for them. Other users will still be notified (both in the > Address bar itself, and in the "Alert Bar" message) > > > Obviously, no solution will fully remove the danger of this type of spoof, but > a simple consistent system of alerting the user will sufficiently enable them > to make informed decisions.
Here is my proposal (I'm a native Russian, who visits English, Deutsch, Russian and Ukriane sites often) In short: display alphabet name next to second level part of the domain. Let's take for example abbreviation "pap" written in English and Cyrillic letters, it looks absolutely the same in Russian and Ukraine but I can't blacklist it (remember I read Russian, Ukraine and English). Let's look how latin pap can be displayed in different domains: pap.com pap.de pap.ru pap.ua Here is how pap written in Cyrillic letters can be displayed: pap--Cyrillic.com pap--Cyrillic.de pap--Cyrillic.ru pap--Cyrillic.ua If we mix Cyrillic and Latin letters the browser will display them as pap--UNKNOWN.com pap--UNKNOWN.de pap--UNKNOWN.ru pap--UNKNOWN.ua Good, but now we have a problem with Ukraine language: it uses Latin i. So we have to add Ukrain language detector. For example for the word pip where p is Cyrillic and i is Latin the browser will show: pip--Ukrainian.com pip--Ukrainian.de pip--Ukrainian.ru pip--Ukrainian.ua This way URLs must be displayed everywhere, not only in the address area! Alphabet detector can return three types of results: alphabet name, UNKNOWN and INVALID. When detector for all Cyrillic languages is completed, it is possible to forbid any other mixes of Cyrillic and Latin alphabets, so paypal with some Cyrillic letters will be simply an error. You won't be even able to visit it.
I don't know about others, but I tend to ignore the banners at the top of a page unless I have a specific reason to look at them. I like the banner concept (ie. that it's less annoying than a pop-up). In order to avoid a user simply ignoring the banner or mindlessly clicking away a pop-up box, I think that a user should be unable to submit information to a site without dismissing the banner. I'm imagining that if the banner is not dismissed that a pop-up box saying something like: "Due to a potential homograph attack, mozilla has blocked submitting data to this site. Please follow _this link_ for more information on homograph attacks. To allow data submission, please dismiss the homograph attack banner. [Cancel]" The wording probably needs work, but I hope it gets the idea across. I think the pop-up box has to have only one button on it in order to force a user to read it. I don't expect this kind of confirmation if the user has submitted information to the site previously, but for the first visit I think it would be acceptable.
(In reply to comment #172) > I don't know about others, but I tend to ignore the banners at the top of a > page unless I have a specific reason to look at them. When you are about to enter your credit card number, do you look at the location bar? (And do you like Mozilla's default font there? :-) > In order to avoid a user simply ignoring the banner or mindlessly clicking > away a pop-up box, I think that a user should be unable to submit information to > a site without dismissing the banner. What if the page says: "Due to an increased number of identity theft cases on the Internet recently, we strongly urge you to use registered mail to confirm your password. Our secure address is P.O. Box 2369, Miami, Florida 34210."
(In reply to comment #45) > ... For example, a different color ... I've been arguing in Bug 22183 for a color-coded URL, which would (I believe) help with the IDN issue as well. * https://bugzilla.mozilla.org/show_bug.cgi?id=22183#c233 * https://bugzilla.mozilla.org/show_bug.cgi?id=22183#c237 Then, today I found on boingboing.net a link to http://lookit.proper.com/archives/000302.html which talks about having a different background color for homographs in a tooltip. I think that the different background color (or style or color) would work well in the URL bar.
The following announcement was posted to the mozilla.{seamonkey,security} newsgroups recently: http://weblogs.mozillazine.org/gerv/archives/007556.html
Here is some info about 3 IDN plug-ins for MSIE and whether MSIE might support IDN in the future: http://support.microsoft.com/?kbid=842848 I found the above link in the following: http://www.w3.org/International/articles/idn-and-iri/
(In reply to comment #171) Thank you for sending this info, especially the Ukrainian info. Do you think that Cyrillic domain registrants might also wish to include some "Latin" (ASCII) letters in their domain names? (For example, some foreign names like "IBM" or some better example?) Neil, would it be possible to use code points instead of ranges in your proposal's script detection in order to support Ukrainian?
(In reply to comment #173) > > In order to avoid a user simply ignoring the banner or mindlessly clicking > > away a pop-up box, I think that a user should be unable to submit information to > > a site without dismissing the banner. > > What if the page says: > > "Due to an increased number of identity theft cases on the Internet > recently, we strongly urge you to use registered mail to confirm your > password. Our secure address is P.O. Box 2369, Miami, Florida 34210." That's just suspicious to begin with ;) Short of incredibly annoying actions on every potential homograph page (such as blacking out the displayed page), there's no real way (for mozilla) to stop that. Given that the confirmation I gave will only (incorrectly) occur if all the followiong are true: * The page is falsely detected as a homograph by mozilla * The user ignored the warning banner * The site requires the user to type in and submit information (as opposed to selecting choices from drop down boxes or clicking links) * It's the first time the user submits information to the site I don't think the forced acknowledgement is unreasonable...
(In reply to comment #171) > Good, but now we have a problem with Ukraine language: it > uses Latin i. So we have to add Ukrain language detector. > For example for the word pip where p is Cyrillic and i is > Latin the browser will show: Actually, I don't understand this. The Unicode book says that the following character exists: 0456 CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
The test from secunia.com (http://www.payp&#1072;l.com/ where the 2nd. a is fake) is working even when network.enableIDN is set to false at least on SuSE 9.1 FireFox 1.0 i686.
(In reply to comment #179) > (In reply to comment #171) > > Good, but now we have a problem with Ukraine language: it > > uses Latin i. So we have to add Ukrain language detector. > > For example for the word pip where p is Cyrillic and i is > > Latin the browser will show: > > Actually, I don't understand this. The Unicode book says that the following > character exists: > > 0456 CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I I can confirm that this character should indeed be used in Ukranian. Someone else on the IDN list also raised that Tajik mixes Cyrillic and Latin, which was wrong again. As far as I know, the only language that mixed Latin and Cyrillic is a very old orthography of Kurdish, which is rarely used today. Most Kurdish writers use either Arabic or Latin.
How about: if each label in the https domain name is composed of characters in one of the languages the user understands, then in the status bar display "unicode domain name padlock" else display "punycode domain name padlock". Top of browser, the address is always in Unicode (to avoid cultural offence), but there could be a small IDN symbol to the left of the favicon, clickable to see punycode and IP address. If the accept-language list was empty, the browser locale language would be used. Otherwise the accept-language list should be used and the locale ignored. (If I'm in an internet cafe in Germany (I don't speak German), I add "en, eo, fr" to the accept-languages (as I read English/Esperanto/French but not German).) What characters are in each language? For Europe, see http://www.evertype.com/alphabets/index.html . Each domain label should be in one and only one of the user's understood languages (but edge conditions do exist eg see http://www.toysrus.co.uk -- could the r be cyrillic?). cheers, Aaron http://lingvo.org/idnd
Another way to protect the user against this, (and phishing attempts) is to check when the server's DNS record was registered. If it's less than (say) 1 week ago, then alert the user, and warn that the server is new, and to be cautious with personnel / financial information. Of course, this will encourage phishers to sit on a newely registered site before using it....
A couple comments. 1. White/blacklist of TLDs ignores other spoofing possibilities. It is possible for IDN characters to appear in more than just the domain name. Combined with a DNS cache poisoning attack, it is possible to then spoof a third-level hostname. (e.g. &omega;&omega;&omega;.foobank.de) 2. Language detection by registrars seems pretty unreliable. I registered a Japanese-language IDN .com domain with a French registrar and they assigned the language to be French. 3. I like the both the information bar and highlighting ideas. Ideally, only the non-ascii characters will be highlighted in the host name. Mousing over would then provide more information. -Chris
(In reply to comment #184) Re: 1., the TLD white/black list proposal would simply apply to *all* domains under those TLDs, so the 3rd level IDN domain name spoof you mention is simply reduced to the DNS cache poisoning problem itself, which I submit is outside the scope of this bug report. No? Re: 2., when/how did you find out that the registrar assigned the French language to your domain name? Thanks!
The article "Phishing - Browser-based Defences" at http://www.gerv.net/security/phishing-browser-defences.html makes a lot of good points. Here are two comments about it, though. First, I think it sells the colorizing of letters a little short. There are millions of users who would NEVER visit a DNS site name that had mixed scripts. Obviously, the millions of people who only read & write English are unlikely to WANT to visit a site that uses non-English letters, since such sites tend to not use English (!). Only colorizing when they're mixed, and letting people turn it off if it's "ugly", would help millions of people. Yes, the colorblind & some users will not be helped, but if you help the majority, it's unlikely phishers will perform the attack. A phishing attack that only works against the colorblind is less likely to be attempted. One simple solution: you can often guess the best default ("should I colorize or not?") based on the user's language settings; then let the user set it differently as an option. Second, I think you'll need more glyphs than 2 if you do the "symbol glyphs" approach (which is an interesting idea!). A phisher could create a program that randomly morphs a domain name among many different ways, trying to find a good substitution, and then hashing the result to see if the glyphs match. The way to figure out the required number of glyphs is to imagine a program that can create a large number of "phish food" domain names from a given name (substitute l for 1, substitute O for 0, do both, ...), and see how many alternatives you can find for a given name. Here's a back-of-the-envelope calculation; say a phished domain name has no more than 10 characters DNS chararacters in the name, and on average each character can be reasonably substituted with 3 other characters. (These are guessed numbers, but it should be possible to figure out REAL values from these using common phished domains like paypal.com, ebay.com, etc.). That means that the set of alternatives is (3+1)^10-1, i.e., 1,048,575 alternatives - about one million. A two-char glyph only gives 64^2 = 4,096 hashes, so a phisher is almost certain to find several alternatives with the same visual hash. Four glyphs gives you 64^4=16,777,216.. a phisher only has approximately 1/16 chance of finding a match. Five glyphs gives you 1,073,741,824... the phisher has around a 0.1% chance of finding a match. Shorter domain names are even harder to forge. --- David A. Wheeler
"the millions of people who only read & write English are unlikely to WANT to visit a site that uses non-English letters, since such sites tend to not use English (!)." So some 350 million native english speakers (assuming none of them speaks a foreign language) are a majority now? ;) But I'm in favor of some sort of color code as well. It's quite similar to the glyphs Gerv is suggesting in that people see a difference between their usual site and a phishing site. While a hash can can be "spoofed" with a hash collision or a similar looking sign the color coding would only tell you about the character encoding. The difficulty will be to compartmentalize the character encodings in a way that it is unlikely that two different encodings with the same color could be used to spoof a similar looking domain.
(In reply to comment #38) > Created an attachment (id=173729) [edit] > Proposed blacklist of Unicode code points that should never occur in URLs Please don't blacklist the Runic alphabet, as proposed by your list! In general, I'm against blacklisting by alphabet, because this introduces issues of favoritism. There's homograph attacks in almost every alphabet, especially considering that many of them have common origins (Phoenician, etc.). A selfish reason: I put up a site just for fun, (Thurisaz).com: xn--9ve.com There's still users of the Runic alphabet out there (mostly scholarly/religious, as with other "obsolete" alphabets). Google "futhark". Proposed solutions that are more inclusive (and have already been discussed in this bug): * For any characters that are not traditional domain name characters (A-Z, 0-9, hyphen), loudly mark them: perhaps with a red background behind them. Bounds on min/max kerning distance would ensure a spammer couldn't sneak invisible characters in there. * Pop up a warning box that shows the user-visible domain name, and also the decoded (xn--) domain name that it maps to. Ask the user if they truly want to go to the xn-- site. * Any other solution, except a blacklist that imposes an outright ban on entire alphabets....
(In reply to comment #185) Re: #1, No, it is more than that. Consider that perhaps the .de domain does IDN well, checking for homographs in registered domains. Mozilla then whitelists IDNs for .de domains, considering that the TLD should be safe. Registrars however have no clue about third-level domains, and anyone running a DNS server (or doing cache poisoning attacks) can create any third-level domain (and lower) that they want. So, although the .de TLD checks all the xxxxx.de domains to not have homograph attacks, there is NO guarantee about homograph attacks further down the chain, in perhaps yyIDNyy.xxxxx.de . Re: #2, When I registered the domain, it said the language assignment was French.. but it doesn't appear in the published whois record.
On http://4t2.cc/mozilla/idn/ I have a small extension for Firefox that warns on a IDN and shows the according punycode. Could this be a way to prevent IDN pishing?
(In reply to comment #190) > On http://4t2.cc/mozilla/idn/ I have a small extension for Firefox that warns on > a IDN and shows the according punycode. Could this be a way to prevent IDN pishing? Certainly, that interface is much like what I had in mind for my suggestion in comment 145. However, the message needs to be greatly improved. A typical user won't have a clue about IDN or punycode. How about something more user friendly: Warning: www.paypal.com contains some characters from international alphabets. Some international characters look very similar or the same as each other which may be used to spoof web site addresses. _More information_ I'm sure it can be improved a lot and the more information link/button could reveal detailed information, much like Paul Hoffman suggested [1] as well as providing more user friendly explanations. [1] http://lookit.proper.com/
(in reply to comment 133) If a domain name is a Russian word written totally in Latin homographically equivalent letters instead of original Cyrillic letters, this does not have to be a spoof. Please note: long before IDN was first presented, some Russian sites already had domain names which contained only Latin letters homographically equivalent to Cyrillic -- this was a pre-IDN hack to include Russian words to ASCII domain name. The Latin homographs are: A/a, B/b, C/c, E/e, H, K, M/m, n, O/o, P/p, T, u, X/x, y Their respective Cyrillic equvalents (named according to http://www.unicode.org/charts/PDF/U0400.pdf chart): A (capital/small), VE (capital)/SOFT SIGN, ES (capital/small), IE (capital/small), EN (capital), KA (capital), EM (capital)/TE (small, cursive variation), PE (small, cursive variation), O (capital/small), ER (capital/small), TE (capital), I (small, cursive variation), HA (capital/small), U (small) Please also note that Cyrillic letter ZE is pretty much homographical to DIGIT THREE, and BE (small) is more or less homographical to DIGIT SIX. Cyrillic letter YERU is homographical to "bI" or "bl" (two symbols together). So we have more than half of the alphabet -- if you carefully avoid Russian letters GHE, DE, IO, ZHE, SHORT I, EL, EF, TSE, CHE, SHA, SHCHA, HARD SIGN, E, YU, and YA, then you may write Russian words with Latin letters (either capital or small). There are 33 letters in Russian alphabet (see http://learningrussian.com/alphabet.htm for details). Only 15 don't have homographs in Latin or digits. Please note also: it is allowed by Russian rules to use IE letter ('e' letter, don't think of MSIE) instead of IO letter in most words. So, effectively, only 14 Russian letters cannot be presented homographically in pure ASCII. The Russian Alphabet (Unicode names --> ASCII homographs, if exist): A --> A/a BE --> 6 VE --> B GHE DE IE --> E/e IO =-=> E/e (not allowed in some words) ZHE ZE --> 3 I --> u SHORT I KA --> K EL EM --> M EN --> H O --> O/o PE --> n ER --> P/p ES --> C/c TE --> T/m U --> y EF HA --> X/x TSE CHE SHA SHCHA HARD SIGN YERU --> bI/bl SOFT SIGN -> b E YU YA Some existing (registered and working) domain names using this technique (some I knew of, some I've found right now combining the above enumerated letters forming valid Russian words -- these homographs are also used to represent Russian words below in round brackets, because Bugzilla does not support Unicode yet, AFAIK): http://www.XAKEP.ru/ (Russian word 'XAKEP' means 'hacker') http://www.PEKA.ru/ (Russian word 'PEKA' means 'river') http://CTEHA.ru/ (Russian word 'CTEHA' means 'wall') http://www.CblP.ru/ (Russian word 'CblP' means 'cheese') http://www.TEMA.ru/ (Russian word 'TEMA' means 'theme'; and, backreplacing IE-->IO, we get a variant of name of this domain's owner) http://ABTO.ru/ (Russian word 'ABTO' means 'auto') http://3ByK.ru/ (Russian word '3ByK' means 'sound') http://KOCMOHABT.ru/ (Russian word 'KOCMOHABT' is equivalent to 'astronaut') http://MATPAC.ru/ (Russian word 'MATPAC' means 'mattress') http://MEXA.ru/ (Russian word 'MEXA' means 'furs' -- yes, plural; singular form http://MEX.ru/ is cybersquatted) http://MPAMOP.ru/ (Russian word 'MPAMOP' means 'marble') http://OXPAHA.ru/ (Russian word 'OXPAHA' means 'guard' or process of guarding) And some not so useful (registered for sale or otherwise cybersquatted) domains: http://www.BAHHA.ru/ (Russian word 'BAHHA' means 'bath') http://CyKA.ru/ (Russian word 'CyKA' means 'bitch') http://MOCKBA.ru/ (Russian word 'MOCKBA' means city of Moscow) http://EBPO.ru/ (Russian word 'EBPO' means 'euro') http://KOCMOC.ru/ (Russian word 'KOCMOC' means 'outer space') http://KPACKA.ru/ (Russian word 'KPACKA' means 'paint, dye, colour') This is a proof for two more or less separate ideas: 1) Full homography of a domain name can be a legacy of pre-IDN times, a basis for someone's ethic and legal business, which must not be ruined. 2) We should end at considering only symbol-to-symbol homography; also, two adjacent symbols of one alphabet may happen to look like a single glyph of another alphabet. The hunt for Russian domain names written in pure ASCII will continue.
Sergey: that's very useful - thanks :-) Everyone else: I'm currently up to my eyeballs in IDN lists and blog posts and emails. I'm trying to get on top of what everyone is saying this weekend, and see what emerges.
> Russian words below in round brackets, because Bugzilla does not support > Unicode yet, AFAIK): Well, all you have to do is set 'View | Character Encoding' to UTF-8 before posting any comment with non-ASCII characters and do the same when viewing any comment posted in UTF-8. We'd not have NCRs as in comment #133. Please, everybody, set 'Character Encoding' to UTF-8 before *posting* comments with non-ASCII characters here and in other bugs at bugzilla.mozilla.org. (be aware that changing 'character encoding' resets the content of a textarea - you would lose everything you've written there so that before changing 'character encoding' you have to make sure to copy it to the clipboard or elsewhere) > http://www.XAKEP.ru/ (Russian word 'XAKEP' means 'hacker') 'ХАКЕР' in Cyrillic One idea (as a part of *multiple* lines of defense). We may render characters beloning to the 'minority' scripts for a given domain component in a 'conspicuous' 'color' (and/or font) different from the color used to render characters in the 'majority' script (the script with the largest count in a given domain component.). For 'pаypаl' where 'а' is Cyrillic, Cyrillic would be the minority script while Latin would be the majority script)
Thank you, Jungshik Shin, this helped. Ok, now 19 more domains I've found last night (use UTF-8 to view Russian words below). Website: http://PEMOHT.ru/ Russian word: ремонт Translation: 'repair' (noun) Status: cybersquatted Website: http://caxap.ru/ Russian word: сахар Translation: 'sugar' (noun) Status: used by sugar traders Website: http://COK.ru/ Russian word: сок Translation: 'juice' (noun) Status: list of winners of some lottery drawing among apple juice customers Website: http://COCKA.ru/ Russian word: соска Translation: 'comforter, dummy teat' Status: pornocybersquat Website: http://coyc.ru/ Russian word: соус Translation: 'sauce, gravy' Status: sauce recipe list, FAQ, etc. Domain: cyxapu.ru Russian word: сухари Translation: 'rusks, pieces of dried bread' (plural) Status: DNS works, but no route to host Website: http://www.MAKCu.ru/ Russian word: МАКСИ Translation: this is a trademark that has no direct meaning and translation; it is most likely derived from the word 'максимум', which means 'maximum' or 'at most' Status: some cellphone-related business and FAQ Website: http://yxo.ru/ Russian word: ухо Translation: 'ear' Status: webmail provider, hosting provider Website: http://yKcyc.ru/ Russian word: уксус Translation: 'vinegar' Status: cybersquatted by international drug dealers Website: http://xop.ru/ Russian word: хороший Translation: 'good' or 'fine' (there's a kind of pun in this domain name: Russian word 'хор' means 'chorus') Status: furniture shop Website: http://XPyCT.ru/ Russian word: хруст Translation: 'crunch' (noun) Status: website temporarily closed (probably it exceeded its bandwidth limit or other rent limit) Website: http://KAPTA.ru/ Russian word: карта Translation: 'map' or 'card' Status: communication service card dealer Website: http://KOBEP.ru/ Russian word: ковёр Translation: 'carpet' or 'rug' Status: cybersquatted Website: http://MAPKA.ru/ Russian word: марка Translation: '(postage-)stamp' or 'trade-mark' or 'brand' Status: philatelic activity Domain: HAyKA.ru Russian word: наука Translation: 'science' (noun) Status: DNS works, but no route to host Website: http://npoKaT.ru/ Russian word: прокат Translation: 'hire' (noun) Status: merchandise for hire Website: http://PECTOPAH.ru/ Russian word: ресторан Translation: 'restaurant' Status: internet shop selling goods and services somehow related to restaurants Website: http://CTAHOK.ru/ Russian word: станок Translation: (noun) 'machine-tool' or 'lathe' or 'printing-press' Status: somehow related to machine-building or machine works; not yet open Website: http://TypucT.ru/ Russian word: турист Translation: 'tourist' (noun) Status: site is under construction
setting bug 237820 as a blocked meta tracker
Blocks: 237820
First let me give a quick reminder of the difference between registries and registrars: Each top-level domain has exactly one registry, who maintains the list of all second-level domains therein. A TLD may have many registrars, who interface between the registry and the registrants (customers). It's the registry who sets and enforces the policies regarding which names are allowed; the registrars have no control over that. So let's stop picking on the registrars. :) A good solution to the problem of homograph attacks is going to take weeks or months (or longer) to develop. Therefore it would be good to immediately deploy something very simple to reduce the severity of the problem. I suggest: 1) Have a user-configurable set of TLDs for which domain names show in ASCII form instead of readable form. If the browser is following the IDNA spec then it's already calling ToUnicode() before it ever displays any domain name; therefore a simple hook or wrapper could be used to make it call ToASCII() instead for certain TLDs. The user could choose whether to use a blacklist (show these TLDs in ASCII form) or a whitelist (show all TLDs except these in ASCII form). I think the default should probably be to just blacklist .net and .com, because I think those are the only target-rich TLDs whose registries admit IDNs indiscriminately. There might be other indiscriminate TLDs (.nu?), but how many people have important trust relationships with sites in those TLDs that phishers would be interested in? In particular, I haven't heard of problems with .org, which is not managed by Verisign. 2) Make it easy to switch a global setting between "always ASCII", "always readable", and "use the TLD list". This is obviously nowhere near a complete solution, but I think it would improve the situation significantly, and it is very simple--there is no fancy UI with colors and fonts to design, no character table to design, and the code changes can be narrowly focused (in theory, but I'm not familiar with the code). Importantly, this measure would avoid penalizing the communities centered around sites in responsible TLDs. Also, I'm surprised that comment #82 got no responses: > IMO the proper fix for the ssl case (https://paypal.com) is to remove > the UserTrust network certificate from the store. Obviously they are > not doing their job and therefore they shouldn't be trusted. I don't really understand the SSL trust model, but this sounds like an interesting idea. What exactly is UserTrust's job? What exactly does the certificate they issued to the bogus paypal supposedly assert? Switching topics, I'd like to say something about Nameprep, since people have mentioned using it for various purposes that aren't clear to me. Nameprep is intended as a generalization of tolower, which converts uppercase ASCII letters to the corresponding lowercase ASCII letters, and leaves other ASCII characters unchanged. For ASCII domain names, there are certain situations where it is appropriate to call tolower. For IDNs, Nameprep plays the analogous role. In fact, Nameprep behaves exactly like tolower when its input is ASCII, so you can simply replace tolower with Nameprep for all domain names. The important point here is that Nameprep is appropriate *only* in situations where tolower was already appropriate for ASCII domain names. If you're in a situation where you wouldn't want to apply tolower to an ASCII domain name, then you shouldn't be applying Nameprep to an IDN either. Usually tolower is not applied to domain names for display purposes; it is used internally for doing case-insensitive comparisons. Comparison of IDNs is done using ToASCII followed by tolower, and ToASCII uses Nameprep internally. Firefox seems to use tolower or Nameprep for display: When I type http://www.CS.Berkeley.EDU/ into the location bar, it gets changed to http://www.cs.berkeley.edu/. I'm not sure that's consistent with the spirit of the domain name specs. If DNS servers and resolvers are required to preserve case when possible, why is my browser altering it? I suppose there might be a good reason, but I find this surprising (even though I generally think domain names look better in all lower case).
"It's the registry who sets and enforces the policies regarding which names are allowed; the registrars have no control over that. So let's stop picking on the registrars." That's an oversimplification though. It is the registrars that apply (or fail to apply) the policies in the first instance. The registries then enforce (or fail to enforce) their policies with registrars that are not implementing them properly. If the registry is failing to enforce good policy, then the position may will be that some registrars are better than others. Even if the registry is doing the enforcing, then there will be a lag between registrars allowing bad registrations and the registry getting the registrar to correct the problem. So while it's no good picking on registrars exclusively, they do have a role to play.
Commercial registrars generally do not enforce policies, and simply rely on the registry to perform the necessary checks. And rightfully so, since things can get pretty hairy when it comes to IDN. After all, it is the registry's responsibility to ensure that no rogue names exist in its database.
(In reply to comment #195) > Website: http://PEMOHT.ru/ Sergey, thank you for all this info and good work! I would just like to point out one thing regarding IDNs. They use a spec called stringprep that includes lowercasing, so it may be more interesting for you to find existing Russian domain names in ASCII that only contain lowercase. It is impossible to register IDNs with uppercase Cyrillic (unless the registry is breaking the IDN rules).
(In reply to comment #191) > > Warning: www.paypal.com contains some characters from international alphabets. > Some international characters look very similar or the same as each other which > may be used to spoof web site addresses. _More information_ Thanks for your feedback. In addition to the (extended) popup message and the changed statusbar icon, I have also added an icon in the location bar. http://4t2.cc/mozilla/idn/
(In reply to comment #197) > Firefox seems to use tolower or Nameprep for display: When I type > http://www.CS.Berkeley.EDU/ into the location bar, it gets changed to > http://www.cs.berkeley.edu/. I'm not sure that's consistent with the > spirit of the domain name specs. If DNS servers and resolvers are > required to preserve case when possible, why is my browser altering it? > I suppose there might be a good reason, but I find this surprising (even > though I generally think domain names look better in all lower case). I don't really know why Firefox lowercases the ASCIIs. It might just have fallen out of the IDN work. It may be against DNS conventions too. However, one advantage is that you can more easily spot the difference between capital I and lowercase l if you lowercase the name.
*** Bug 283013 has been marked as a duplicate of this bug. ***
Depends on: 283016
ok, so rip this comment to shreds if you want (I can take it) -- or maybe its right -- or maybe it will spark a better thought from someone else ---- but it's worth thinking about I though it as I was waking up this morning, it seems simple (which if it's correct would be nice), but could be oversimplifying. What I'm thinking is: * Don't we just have to worry about MIXED character encodings? * Can you spoof the string "paypal" (for example) without mixed encodings? * What COULD you spoof without mixed encodings? (the Russian, maybe?)
> > I though it as I was waking up this morning, it seems simple (which if it's > correct would be nice), but could be oversimplifying. > > What I'm thinking is: > * Don't we just have to worry about MIXED character encodings? > * Can you spoof the string "paypal" (for example) without mixed encodings? > * What COULD you spoof without mixed encodings? (the Russian, maybe?) Oh, well, off the top of my head: asap, ascii, arab, arabia, arabic, arabs, archie, aries, asia, bach, ceo, cpu, cpus, cray, crays, europe, ieee, jr, ok, os, ohio, pc, pcs, popek, popeks, rcs, rsx, rick, roy, sccs, sr, usc, xeroxes, york, yorker, yorkers, yorks, aback, abase, abaser, abases, abash, abashes, abbe, abbey, abbeys, abhor, abhorrer, abhors, abjure, abjurer, abjures, abscess, abscesses, abscissa, abscissas, absorb, absorber, absorbs, abuse, abuser, abusers, abuses, abyss, abysses, acacia, access, accesses, accessories, accessory, accrue, accrues, accuracies, accuracy, accuse, accuser, accusers, accuses, ace, acer, aces, ache, aches, acre, acres, across, aerobic, aerobics, aerospace, ah, air, airer, airers, airier, airs, airship, airships, airspace, airy, ajar, apace, ape, aper, apes, apex, apexes, aphasia, aphasic, apiaries, apiary, apiece, apish, apocrypha, appear, appearer, appearers, appears, appease, appeaser, appeases, appraise, appraiser, appraisers, appraises, apprise, appriser, apprisers, apprises, approach, approacher, approachers, approaches, apropos, apse, apses, apsis, arc, arch, archaic, archbishop, archer, archers, archery, arches, arcs, are, area, areas, ares, arise, ariser, arises, ark, arose, arouse, arouses, arrack, array, arrayer, arrays, arrears, arroyo, arroyos, as, ascribe, ascribes, ash, asher, ashes, ashore, ask, asker, askers, asks, asp, asper, asphyxia, aspic, aspire, aspirer, aspires, ass, assay, assayer, assayers, asses, assess, assesses, assessor, assessors, assure, assurer, assurers, assures, aura, auras, aurora, auspice, auspices, auspicious, ax, axe, axer, axers, axes, axis, aye, ayer, ayers, ayes, babe, babes, babies, baby, babyish, back, backache, backaches, backer, backers, backpack, backpacker, backpackers, backpacks, backs, backspace, ...
(In reply to comment #205) Or, with a tighter definition of "homograph", using only the really good ones: asap, ascii, asia, ceo, ieee, os, pc, pcs, sccs, acacia, access, accesses, ace, aces, apace, ape, apes, apex, apexes, apiece, appease, appeases, apse, apses, apsis, as, asp, aspic, ass, assay, asses, assess, assesses, ax, axe, axes, axis, aye, ayes, cap, cape, capes, caps, case, cases, cease, ceases, coax, coaxes, cocoa, coo, coop, coops, cop, cope, copes, copies, cops, copse, copses, copy, ease, eases, easy, epic, epics, escape, escapee, escapees, escapes, espies, espy, essay, essays, excess, excesses, excise, excises, expose, exposes, eye, eyepiece, eyepieces, eyes, ice, ices, icy, is, ix, jay, jeep, jeeps, joy, joys, oasis, oops, oppose, opposes, ox, pa, pace, paces, papa, pas, pass, passe, passes, pay, pays, pea, peace, peaces, peas, peep, peeps, pep, pi, pie, piece, pieces, pies, pipe, pipes, ****, ****, poise, poises, pop, pope, popes, poppies, poppy, pops, pose, poses, possess, possesses, pox, poxes, sap, saps, say, says, ...
(In reply to comment #206) And going the other way, to Russian, I can do: &#1072;&#1075;&#1072;, &#1075;&#1072;&#1088;&#1100;, &#1075;&#1086;&#1088;, &#1075;&#1086;&#1088;&#1072;, &#1075;&#1086;&#1088;&#1072;&#1093;, &#1075;&#1086;&#1088;&#1075;&#1086;&#1088;, &#1075;&#1086;&#1088;&#1077;, &#1075;&#1086;&#1088;&#1091;, &#1075;&#1088;&#1077;&#1093;, &#1077;&#1075;&#1086;, &#1086;&#1088;&#1077;&#1093;, &#1086;&#1088;&#1077;&#1093;&#1072;, &#1088;&#1072;&#1089;, &#1088;&#1072;&#1089;&#1072;, &#1088;&#1072;&#1089;&#1077;, &#1088;&#1086;&#1075;, &#1088;&#1086;&#1075;&#1072;, &#1088;&#1086;&#1075;&#1072;&#1093;, &#1088;&#1086;&#1089;, &#1088;&#1086;&#1089;&#1072;, &#1088;&#1086;&#1089;&#1077;, &#1089;&#1077;&#1088;, &#1089;&#1077;&#1088;&#1072;, &#1089;&#1077;&#1088;&#1086;, &#1089;&#1077;&#1088;&#1086;&#1075;&#1086;, &#1089;&#1077;&#1088;&#1086;&#1077;, &#1089;&#1089;&#1086;&#1088;&#1072;, &#1089;&#1089;&#1086;&#1088;&#1077;, &#1089;&#1089;&#1086;&#1088;&#1091;, &#1089;&#1091;&#1093;, &#1089;&#1091;&#1093;&#1086;, &#1089;&#1091;&#1093;&#1086;&#1075;&#1086;, &#1089;&#1091;&#1093;&#1086;&#1077;, &#1091;&#1075;&#1072;&#1089;, &#1091;&#1088;&#1072;, &#1091;&#1093;&#1072;, &#1091;&#1093;&#1086;, &#1091;&#1093;&#1091;, &#1093;&#1072;&#1086;&#1089;&#1077;, &#1093;&#1086;&#1088;, &#1094;&#1072;&#1088;&#1100;, ...
(in reply to comment 200) Ok, Erik, here they are (use UTF-8 to read Russian): http://caxap.ru/ -- сахар http://coyc.ru/ -- соус cyxapu.ru -- сухари (the last letter is homographic in some fonts) http://yxo.ru/ -- ухо http://xop.ru/ -- хор http://nana.ru/ -- папа (homographic in some fonts) (in reply to comment 204) Yes, Zachariah, you can easily spoof a Russian IDN without mixing encodings -- see the above reply to comment 200. (retyping comment 207 in UTF-8) > ага, гарь, гор, гора, горах, горгор, горе, гору, грех, его, > орех, ореха, рас, раса, расе, рог, рога, рогах, рос, роса, > росе, сер, сера, серо, серого, серое, ссора, ссоре, ссору, > сух, сухо, сухого, сухое, угас, ура, уха, ухо, уху, хаосе, > хор, царь Interesting. How do you spoof "царь" without mixing encodings?
Once again, to make it more clear for those who did not read the whole list of bug comments (which is large already) -- the above enumerated six existing domains are not spoofy: these were of old (pre-IDN) way to register Russian domain names using Latin alphabet instead of Russian homographically equivalent letters. They were useful before IDN and should remain useful and not broken after any anti-spoofing measure implemented. Comment 194 with the idea of majority/minority scripts seems to be the right way of avoiding harm for the existing pre-IDN homographs.
On the Unicode mailing list, Rick McGowan <rick@unicode.org> has announced that a new revision of Draft UTR #36: Security Considerations for the Implementation of Unicode and Related Technology, is now available at http://www.unicode.org/reports/tr36/tr36-2.html and that comments for official consideration can be made at http://www.unicode.org/reporting.html The review period closes on May 3, 2005.
The gTLD registries, several ccTLD registries, and ICANN have posted statements about IDN abuse that are listed on a resource page that ICANN has just started: http://www.icann.org/topics/idn.html They are also opening a new discussion forum which has a potential advantage over all the others by being immediately visible to ICANN.
(In reply to comment #209) Sergey, does .ru currently allow IDN registration? If so, what rules are there? If not, are they thinking about IDN, and if so, what kind of rules? Thanks!
Unfortunately, I could not make myself certain about it. According to some docs -- http://info.nic.ru/st/10/out_863.shtml for one -- the problem is still in the state of discussion. However, if I go directly to https://www.nic.ru/dns/ and enter something like xn--80aswg.ru to register in .Ru, the first three steps are OK. (I did not finish the process, because I'm not going to spare $20 for the proof of possibility.)
(translating the above to UTF-8) http://президент.ru http://кремль.ru These are websites registered for President of Russia. May be a technical exception.
(In reply to comment #212) The text entry window in Bugzilla echos the Cyrillic characters as they should appear but the posted comment uses the numeric character references. (Bugzilla bug?). The latter form is, in fact, the only one of the two that is legal in a URL. Regardless of the promise that IRI has for remedying this, it still highlights the need for an LDH format for communicating scripts across cultural boundaries beyond which they are unlikely to be recognized. Punycode and NCR are obvious candidates for this role, as far as appearance in a URL goes, and we can debate which is the uglier. When we get around to printing IDN e-mail addresses on business cards, the parallel communication of Punycode may prove a necessary adjunct, with no competition in the aesthetics department.
(In reply to comment #216) See comment #194 from Jungshik Shin. It is not a Bugzilla bug directly. However, Bugzilla should add charset=UTF-8 to its HTTP response Charset header...
(In reply to comment #217) I just filed bug 285255 to try to get bugzilla.mozilla.org to announce its charset as UTF-8.
Erik, Sergey: bug 126266 explains why b.m.o. can't just set the charset to UTF-8. Gerv
For the record, the new bug form on b.m.o is currently hacked to force UTF-8. We can't do that on show bug because of legacy data problems on existing bugs, so if someone adds the first comment containing non-ascii characters at some point later than the opening of a new bug it's going to be whatever charset their browser used. Please read bug 126266 before making any "but you can just do ******" comments, and please make any such comments on that bug if you come up with something that hasn't already been suggested and shot down there already :)
Dear all, I just thought I'd try to summarize a number of the threads in this discussion in one place; please bear with me if I'm repeating the obvious in some places. I've broken the discussion into two parts: "global issues" and "threat analysis and possible solutions". == Global issues == 1. The homograph problem is in the eye and brain of the user, and is therefore necessarily a fuzzy and subjective problem. 2. Becuase of the above, we can therefore only _approximately_ solve this problem. However, that approximation can be very good indeed, and there's no reason why we should not aim for near-perfection in a solution. _We should think in terms of probabilities as engineering targets_. 3. Many parties are involved in this, and every one of them will have to contribute to the solution. They each have different consituencies, policies, interests, and technical constraints. Fortunately, the problem is also multi-dimensional, and its solution can be sliced up in such a way that each group can contribute something to the mix. Although none of these sub-solutions can be perfect (see above), they can together provide multiple opportunities for catching homographs, allowing a very high probability of the overall solution working for any given TLD label. 4. Punycode display eliminates the homograph problem for many purposes, but also defeats the usefulness of IDN at the same time: still, at least it does not break links to IDN websites. It's the least-worst fix until we can do something better. It may also have long-term dangers when IDNs become widespread (see below). 5. The homograph problem is a _combinatorial_ problem. Increasing the size of the character set from 37 to 40,000 has caused a disproportionate exponential explosion in the number of possible homograph combinations. Applying restrictions in a number of intersecting ways will enable us to exponentially _implode_ those possibilities again. 6. The consensus appears to be that only top and second-level labels matter: top labels are not currently a problem (but they may be when "full" IDN arrives). Users are by now well-accustomed to interpreting second-level labels as identifying a commercial or other entity. 7. The above is good, because it means that we can make everything hinge on the TLD registries as trust brokers. Doing things on a per-TLD basis allows the registry part of the solution to scale horizontally, so registries with effective policies can be unblocked ASAP. It also deals effectively with the case of non-compliant registries, and market pressure (non-IE market share heading for 10% and beyond) will do the rest. 8. No-one is talking about the timescale for a fix, or what the definition of "a fix" would be. What is the expected timescale: a month, three months, six months, a year, five years? Again, slicing the problem up will allow multiple bodies to move forwards on multiple tracks, and the browser vendors can act as gatekeepers for their users to decide what is "good enough". == Threat analysis and possible solutions == Here are the major threats: * Writing-system-mixing homographs: for example, Cyrillic 'a' in Latin 'paypal'. Partial solution: make sure that individual domain names are allocated from character sets without internal homographs. [Only needs internal inspection of each sub-character-set for homographs, so vastly less work than checking the whole Unicode set for homographs]. At the moment, the ICANN rules justify this on the basis of language assignments, but it's really about forbidding unnecessary script mixing. (Note that I say "writing system" here: a single writing system can use several scripts: for example, Japanese uses four scripts, but they are not mutually confusable). * Non-writing-system-mixing homographs: for example, Cyrillic 'assay.tld' vs. Latin 'assay.tld'. These are less easy to forge, as the structure of languages provides some entropy that makes collisions less likely than with cross-script attacks. However, they still exist, and we cannot rely on users to select "safe" names. Partial solution: bundling at the registry. [Needs a global homograph list, but is fairly tolerant of error in this list; for example, the above contains homographs for 'a', 's', and 'y', three characters. If we had a homograph list that was 95% accurate, we'd have a probability of 1-0.05**3 = 0.99987 of catching this. A list with 98% accuracy would have a 0.99999 probability of catching it. Clearly the rule here would be: if you have a high value domain, make sure it has lots of different characters in it]. Note that we should distinguish 'blocking' bundling, where registration of new homographs is blocked to anyone but the registrant of the 'root' name, not 'permissive' bundling, where all the homographs actually resolve to the same place as the root name. In the case of 'grandfathered' names, we would need some procedures to resolve conflicts: perhaps where two root names exist in a homograph tree, neither of the registrants should be allowed to register new names, or the first registrant should prevail? Note also that bundling can also mop up the remaining within-writing-system homographs, if, for example, a new exploit was later found (for example, on the lines of "rn" for "m", something simple homograph tables could not catch). [An aside: even if a super-paranoid browser could have the full homograph-risk-detection algorithm built in, it would not solve the problem of non-writing-system-mixing homographs, because it could not resolve which was the "real" name, and which the "fake" name.] * Attacks on protocol characters: fake slash, dot, hash, percent characters and so on. This is a severe risk, that allows forgery of TLDs and other evil attacks that subvert some of the other solutions above. Partial solution: make these characters illegal at both the browser and registry end. (Belt and braces). How can we know we've got them all? Someone's got to check really seriously throught the entire Unicode character set. However, it's only a book-length volume, with most of it being CJK characters; you could do it in a few days, particularly if you could a priori ignore many character ranges (see below). Caveat: what if someone's language actually _requires_ a character that looks like a protocol character: what do we do then? (This is where intersection with per-label character set restriction may help). * In general: any restrictions we can make on character repertoires, either by conservative whitelisting or aggressive blacklisting (preferably both) reduce the combinatorial possibilities for homograph attacks by many orders of magnitude, as well as making the generation of accurate homograph lists much easier. In particular, there are wide ranges of characters which exist only for round-trip compatibility reasons with old character sets, such as the Videotex characters, box graphics, dingbats, and presentation forms for various alphabets. There is no reason why we should support these. Perhaps the Unicode people can give us an official list of "deprecated" code points? * Chinese characters are a special case, because of the tens of thousands of characters in the CJK repertoire, as well as cultural concerns, such as traditional/simplified and Japanese/Chinese versions of the same characters. This is a _huge_ problem requiring scholarly expertise in oriental languages that the people in this discussion do not have. There are groups working on solving this: let's let them get on with solving it: their current solution seems to revolve around bundling. Fortunately, CJK characters look so different from other scripts that this should not stop us from attacking the homograph problem for alphabetic scripts and syllabaries. * Note that Punycode itself can be an attack vector: if users who really need to access IDN sites become used to clicking on "xn--ASCII NONSENSE.tld", and don't bother to understand or remember the ASCII nonsense, there is a chance that they may be fooled into visiting "xn--OTHER ASCII NONSENSE.tld" at a later date: particularly if they do not read the Latin alphabet as their native script. (For example, to me, Thai script just looks like squiggles; it's entirely possible that to many Thai people, Latin script may also look like squiggles). For this reason, it makes sense to get the registries to sort their end as soon as possible: Punycode is an excellent mitigation technique that works best for Latin script readers in a world where > 99.9% of all domains are currently ASCII LDH-only, but it is not a panacea for the long run, when I expect that at the very least 50% of all domains will be IDNs.
> > 2. Because of the above, we can therefore only _approximately_ solve this problem. However, that approximation can be very good indeed, and there's no reason why we should not aim for near-perfection in a solution. _We should think in terms of probabilities as engineering targets_. > > 3. Many parties are involved in this, and every one of them will have to contribute to the solution. They each have different consituencies, policies, interests, and technical constraints. Fortunately, the problem is also multi-dimensional, and its solution can be sliced up in such a way that each group can contribute something to the mix. Although none of these sub-solutions can be perfect (see above), they can together provide multiple opportunities for catching homographs, allowing a very high probability of the overall solution working for any given TLD label. Just to elaborate this point slightly more, this gets around two major objections to a timely, workable solution: * it means that no-one can duck out of providing their piece of the solution, on the basis that someone else should solve the problem "perfectly" at their end; since a layered solution is required, everyone must contribute to make the overall reliability of the system as high as possible. * it takes the teeth out of objections to other people's solutions on the basis that they are not perfect, and that no solution can be implemented until it is perfect. Proposed solutions can still be criticised by comparing them against proposals for better solutions, but they cannot be stalled by comparing them against hypothetical (but unspecified) perfect solutions. My proposed reliability target? A five nines minimum requirement for SLDs with three distinct letters; this corresponds to a reliability target of > 98% for the global homograph list. So, out of the 11195 non-Han, non-Hangul characters in Unicode 3.2, that's a target of no more than 223 missed between-script homographs. If we can reduce that to (say) no more that 50, then the three-character reliability estimate is roughly (1-(50/10000)**3) = 0.99999987, which is almost seven nines. Of course, Chinese is another matter entirely, but I believe that substantial efforts are being devoted to provide a reliable solution for Chinese characters. OK, you're probably getting bored now, but here are some calculations, of the sort that I hope will cast some more light on the problem. For different amounts of coverage of the homograph list, assuming perfect bundling and statistical independence, using an English word list as a source of statistics for an estimate of the relative probabilities of different numbers of distinct characters in typical labels, and assuming there are currently 50 million domains registered (Source: http://www.whois.sc/internet-statistics/), I get the following: Homograph list reliability Est. antispoof reliability Est. vulnerable domains 95% 99.998806% 596 98% 99.999761% 119 99% 99.999911% 45 99.5% 99.999962% 19 99.75% 99.999983% 9 99.9% 99.999993% 3 Note that this takes the ultra-cautious definition that if "homograph list reliability" is 95%, fully 5% of the remaining characters are uncaught homographs. Note that at the bottom end, the stats are entirely dominated by domain names with one and two distinct characters. Make the requirement that SLD labels need to have at least two distinct characters, and I get: Homograph list reliability Est. antispoof reliability Est. vulnerable domains 95% 99.999124% 438 98% 99.999888% 56 99% 99.999974% 13 99.5% 99.999994% 3 99.75% 99.999998% 0.76 99.9% 99.999999%+ 0.12 Again, the stats are dominated by the names with the smallest number of distinct characters. Finally, making the requirement that SLD labels have at least _three_ distinct characters (but are otherwise distributed as normal for English), I get: Homograph list reliability Est. antispoof reliability Est. vulnerable domains 95% 99.999723% 139 98% 99.999984% 8 99% 99.999998% 0.95 99.5% 99.999999%+ 0.11 99.75% 99.999999%+ 0.014 99.9% 99.999999%+ 0.00092 Finally, as an illustration, I list below the numbers of distinct characters in the top 100 domains, as according to alexa.com, sorted by number of distinct chars, and making reasonable assumptions about which part of the DN is allocated by the TLD registrar. If I was the BBC, CNN, go.com, goo.com, or qq.com, I'd be nervously eyeing up the Unicode tables. Although 1 or 2-distinct character strings appear to be a disproportionately large part of this list, the threat is not as big as it seems: apart from people who insist on registering "aaaaaaa.tld" (which tend not to be memorable: was that six 'a's or seven?), most diversity-poor labels will tend to be the very short ones. One-letter ones are forbidden by RFC, so there are, for example, only 52022 possible two and three-letter LDH domains currently available to be registered, roughly 0.1% of the 50 million currently registered domains. Of these, only 5402 have less than three identical characters, about 0.01% of the total registered domains, and only 36 of them will have only one distinct character. If we assume all of these are registered and that we have only a 95% accurate homograph list, then the expected number of spoofable domains in this class will be approximately 36 * 0.05 + (5402-36) * 0.05 * 0.05 = 15 spoofable domains. Making the homograph list 98% accurate reduces this to a more comfortable 2.8 domains, and 99.5% accurate would reduce it to an expected 0.31 domains per TLD in this elite set of registrations, mostly consisting of the risk associated with the single-character-repitition domains such as "aaa.tld". So, in conclusion, a statistical approach to risk estimation and mitigation can be very powerful, and (in principle) reduce spoofing risks to very low levels. Most of the risk is concentrated in labels with low levels of character diversity, information which may be useful to potential registrants who wish to avoid spoofing. It would be interesting to perform this kind of analysis on some real TLD registry data. [data follows: "high value" targets marked as "***"] = The top 100 websites, as per Alexa.com, sorted by number of distinct chars in TLD-registrar-allocated label = == Few distinct characters, look out! == www.qq.com, 1 distinct chars in 'qq' www.bbc.co.uk, 2 distinct chars in 'bbc' www.cnn.com, 2 distinct chars in 'cnn' www.go.com, 2 distinct chars in 'go' www.goo.ne.jp, 2 distinct chars in 'goo' == >= 3 distinct characters, lower risk == www.126.com, 3 distinct chars in '126' www.163.com, 3 distinct chars in '163' www.aol.com, 3 distinct chars in 'aol' *** www.ask.com, 3 distinct chars in 'ask' www.avl.com.cn, 3 distinct chars in 'avl' www.dell.com, 3 distinct chars in 'dell' *** www.free.fr, 3 distinct chars in 'free' www.msn.co.jp, 3 distinct chars in 'msn' *** www.msn.com, 3 distinct chars in 'msn' *** www.nba.com, 3 distinct chars in 'nba' www.tom.com, 3 distinct chars in 'tom' www.uol.com.br, 3 distinct chars in 'uol' www.21cn.com, 4 distinct chars in '21cn' www.3721.com, 4 distinct chars in '3721' www.alibaba.com, 4 distinct chars in 'alibaba' www.apple.com, 4 distinct chars in 'apple' www.daum.net, 4 distinct chars in 'daum' www.ebay.co.uk, 4 distinct chars in 'ebay' *** www.ebay.com, 4 distinct chars in 'ebay' *** www.ebay.com.cn, 4 distinct chars in 'ebay' *** www.ebay.de, 4 distinct chars in 'ebay' *** www.google.ca, 4 distinct chars in 'google' *** www.google.co.jp, 4 distinct chars in 'google' *** www.google.co.uk, 4 distinct chars in 'google' *** www.google.com, 4 distinct chars in 'google' *** www.google.de, 4 distinct chars in 'google' *** www.google.es, 4 distinct chars in 'google' *** www.google.fr, 4 distinct chars in 'google' *** www.hkjc.com, 4 distinct chars in 'hkjc' *** www.imdb.com, 4 distinct chars in 'imdb' www.myway.com, 4 distinct chars in 'myway' www.nate.com, 4 distinct chars in 'nate' www.sina.com, 4 distinct chars in 'sina' www.sina.com.cn, 4 distinct chars in 'sina' www.sina.com.hk, 4 distinct chars in 'sina' www.sohu.com, 4 distinct chars in 'sohu' www.taobao.com, 4 distinct chars in 'taobao' www.xanga.com, 4 distinct chars in 'xanga' www.yahoo.co.jp, 4 distinct chars in 'yahoo' *** www.yahoo.com, 4 distinct chars in 'yahoo' *** www.about.com, 5 distinct chars in 'about' www.aisex.com, 5 distinct chars in 'aisex' www.allyes.com, 5 distinct chars in 'allyes' www.amazon.com, 5 distinct chars in 'amazon' *** www.atnext.com, 5 distinct chars in 'atnext' www.baidu.com, 5 distinct chars in 'baidu' www.china.com, 5 distinct chars in 'china' www.gator.com, 5 distinct chars in 'gator' www.hinet.net, 5 distinct chars in 'hinet' www.lycos.com, 5 distinct chars in 'lycos' *** www.match.com, 5 distinct chars in 'match' www.naver.com, 5 distinct chars in 'naver' www.sex141.com, 5 distinct chars in 'sex141' www.yisou.com, 5 distinct chars in 'yisou' www.adserver.com, 6 distinct chars in 'adserver' www.comcast.net, 6 distinct chars in 'comcast' www.download.com, 6 distinct chars in 'download' www.hao123.com, 6 distinct chars in 'hao123' www.hkflash.com, 6 distinct chars in 'hkflash' www.neopets.com, 6 distinct chars in 'neopets' www.overture.com, 6 distinct chars in 'overture' www.passport.net, 6 distinct chars in 'passport' www.pchome.com.tw, 6 distinct chars in 'pchome' www.poptang.com, 6 distinct chars in 'poptang' www.weather.com, 6 distinct chars in 'weather' www.blogspot.com, 7 distinct chars in 'blogspot' www.chinaren.com, 7 distinct chars in 'chinaren' www.infoseek.co.jp, 7 distinct chars in 'infoseek' www.livedoor.com, 7 distinct chars in 'livedoor' www.myspace.com, 7 distinct chars in 'myspace' www.netscape.com, 7 distinct chars in 'netscape' www.nytimes.com, 7 distinct chars in 'nytimes' www.pconline.com.cn, 7 distinct chars in 'pconline' www.rakuten.co.jp, 7 distinct chars in 'rakuten' www.sayclub.com, 7 distinct chars in 'sayclub' www.webshots.com, 7 distinct chars in 'webshots' www.casalemedia.com, 8 distinct chars in 'casalemedia' www.craigslist.org, 8 distinct chars in 'craigslist' www.fastclick.com, 8 distinct chars in 'fastclick' www.friendster.com, 8 distinct chars in 'friendster' www.mapquest.com, 8 distinct chars in 'mapquest' www.mediaplex.com, 8 distinct chars in 'mediaplex' www.microsoft.com, 8 distinct chars in 'microsoft' *** www.net-offers.net, 8 distinct chars in 'net-offers' www.xinhuanet.com, 8 distinct chars in 'xinhuanet' www.coolmanmusic.com, 9 distinct chars in 'coolmanmusic' www.doubleclick.com, 9 distinct chars in 'doubleclick' www.netvigator.com, 9 distinct chars in 'netvigator' www.newsgroup.com.hk, 9 distinct chars in 'newsgroup' www.offeroptimizer.com, 9 distinct chars in 'offeroptimizer' www.searchscout.com, 9 distinct chars in 'searchscout' www.adultfriendfinder.com, 10 distinct chars in 'adultfriendfinder' www.internet-optimizer.com, 10 distinct chars in 'internet-optimizer' www.mywebsearch.com, 10 distinct chars in 'mywebsearch' www.tribalfusion.com, 11 distinct chars in 'tribalfusion'
Note that the calculations above assume the existence of a homograph table with a certain _absolute_ uncaught error rate: that is, a 95% reliability means that _fully 5%_ of the entire Unicode character set are homographs. * Firstly, the number of characters that are potential homographs are almost certainly less than 10% of the Unicode reportoire (ignoring CJK). If our actual error rate is measured relative to a population of 10% of codepoints being actual homographs, only getting 95% of _real_ homographs will actually mean that only 0.5% of the Unicode codepoints will be uncaught homographs, corresponding to a reliability of 99.5% in the charts above. * Even if a given label consists entirely of homographs, the correct combination of all of the necessary homographs may not be available in any other single legal label's character set, because of character set restrictions on those other labels. Any other combinatorial limit on the number of possible combinations of characters will have the same effect. This may account for a substantial improvement in the risk estimates. The calculations to go into this in detail are somewhat involved, though. Still, nothing a computer algebra system can't cope with. * We can easily estimate both the number of homographs, and the reliability of their classification, by using several different independently-compiled lists, and performing capture-recapture analysis. * Chance spoof rates will be lower again, due to the fact that you need _2_ spoof candidates to get a confusion pair, and stuff to do with Poisson statistics and finite population sizes (p[2 or more spoofable] within a single spoof-set is less than expectation rate, for values of expectation < 1); this means that the legacy bundling conflict problem may not be as bad as feared, even with some errors in the inital homograph list used to construct bundles. * However, a more sophisticated analysis would take into account some of the well-known human perception phenomena that may make names more spoofable, such as a tendency to ignore minor typos when speed reading. We probably need to have a better-defined criterion for the reasonable minimum perceptible difference between two labels, probably in terms of the number of unique non-spoofed characters that need to be present. This would tend to wind the figures back upwards again. Would anyone be interested in having a list of possible homographs constructed, or a more detailed analysis performed? Now it's time for coffee.
If people attempt phishing with homograph attacks, they rely on people being familiar with the name of a website. Of course, a homograph attack consisting in spoofing an unknown site is possible, but it would make little sense (im that case, the homograph aspect of the fraud would not actually help the perpetrators). There might be targeted phishing attacks against specific people or groups of people, but this is very difficult to do on a large scale. So, for untargeted phishing attacks that can easily be done with spamming, the perpetrators have to use widely known websites. Also, the problem of spoofing does not concern all kinds of websites equally. Therefore, it might make sense to make a list of the most important potential victims of spoofing (maybe a few hundred common websites, maybe a few thousands), and the browser should give a strong warning and not proceed without user action when something looks like a spoof of such a website. I think this is not only a temporary solution, even if we have a relatively good way of treating homographs in general, it would probably make sense to make such a distinction. There cam be legitimate reasons for mixing codes (e.g. XML-документы for collections of Russian XML documents) and there can be other reasons for false alerts. Therefore, the warnings in the general case should not be too disruptive for people who use these possibly legitimate domains. On the other hand, if a domain name looks the same as a well-known website, such as Paypal, the warning and disruption level should be different. I think, in the end, using a list of major well-known websites will be useful, whatever the solution for the general problem is. Of course, this presupposes that homonym characters are known. Homonyms between Latin, Cyrillic, Greek and Coptic are basically known, but then it gets more difficult - this is, of course, necessary both for the general case and this suggestion for treating spoofs of "major" websites specially. I think that the practical problem is not as big as it may seem at first sight. Theoretically, any homography between characters of any character set is a problem, but practically, only cases where at least one element of the homonymy pair belongs to a widespread writing system really matters. Phishing attacks targeting small groups of people are not very likely (though, on the other hand, the TLD of e-mail addresses may in many cases facilitate coutry and language specific phishing attacks). This still means that people have to through the whole book of Unicode characters, but they only have to check whether a character looks similar to one of their own writing system. Are there countries where the IDN system is already widely is use? At least in Europe, this does not seem to be the case, neither in Central Europe nor in countries where the Cyrillic alphabet is used. Therefore, looking for characters that are homonyms with Cyrillic characters or non-ASCII Latin characters is not urgent because right now there are hardly widespread IDN addresses many people are so familiar with that it would make sense to spoof them in a phishing attack (and a Russian site with Latin characters that are homographic with a Russian word is likely to be a legitimate one). So, what is urgent now is determining homographs and near-homographs of the ASCII character set. It seems that widespread adoption of IDN will take enough time that data about homographs with additional characters - e.g. Cyrillic, non-ASCII Latin characters in European languages - can be collected in the meantime. Then, however, the problem will be bigger for Cyrillic than it is for ASCII characters. While Cyrillic characters in the address of a website where nothing Cyrillic is expected (e.g. in the language configuration of the browser) are in any case suspicious, domains with Latin characters will probably remain common in countries with the Cyrillic alphabet because they have been in use for such a long time, and as has been pointed out in this discussion, many existing Russian addresses are legitimate homonyms of Russian words created with Latin characters. Furthermore, there is Serbia, where both the Cyrillic and the Latin alphabets are used. A good solution would, of course, be if registrars in these countries prevented the registration of new domains that look the same as existing ones (taking into account at least the Cyrillic and Latin alphabet, homonyms from other character sets could still be treated as suspicious by the browser). When the IDN system is widely adopted, good browsers for people in Eastern Europe will probably have to display Latin and Cyrillic characters differently, anyway, not only to prevent phishing attacks, but also just to avoid confusion, especially with domains that are abbreviations and can easily be accidental homographs.
(In reply to comment #224) The idea of displaying Latin and Cyrillic characters differently in domain names is very interesting. It seems to me that there may already be conventions in ordinary documents (e.g. email, Web pages), such as the hyphen between Latin and Cyrillic. I think one Russian person on the Unicode mailing list even said he had trouble thinking of any examples *without* a hyphen. Is the hyphen the only way it is done? The best way? What other ways are used? Thanks.
(In reply to comment #225) > (In reply to comment #224) > > ... Is the hyphen the only way it is done? The best way? What other ways > are used? Thanks. I think someone already struck down color-coding different character sets (which is what I would have liked, especially if it meant color-coding the whole URI, not just the domain), but it's an example of an alternative way to tell them apart.
Depends on: 286534
Depends on: 286535
I'm glad that there's so much thought going into this issue, but I think much of it isn't practical. To the end user, this is a browser-level issue, and we need to treat it as such. Next, the problem is much simpler. Forgetting about punycode, can you really tell that www.paypal.com and www.paypa1.com are different? Can you do it in Times New Roman? Now adding in Punycode, and a multi-language, multi-charset world, and we've got a headache. Somehow, the user need to be alerted to the possibility that they're going to a different site, but without being too intrusive. Why not just have two URL boxes, or a URL box and a label next to it: www.paypal.com www.pxn--ypal.com Somehow, get the UI to look nice, maybe a mouse-over or an alert bubble. For users going to IDN sites, they need to deal with this. For everyone else, they can ignore it. It's one better than "just display everything in punycode". Another note-- most people see this as a problem with people thinking they're at ASCII sites, but are actually at puny-coded sites, but if you're living in Spain, going to www.a&#921;a.com (www.a&0399;a.com, but click on a link that goes to www.e&#1030;p.com (www.a&0406;a.com), then they both look the same, and neither are ASCII. With the above method, they'd see: www.a&#921;a.com www.xn--aa-09b.com They'd probably not know that they're going to the wrong address, but short of maintaining a list of valid "similar" DNS entries, I don't see any general solution to this problem.
This bug is on my plate for 1.8, but I'm not exactly working on a solution and times ticking. I have many other important things to do for 1.8, and I'm personally fine with the current solution of rending only punycode because I believe that the IDN spec is pretty broken (homographs of '/' considered valid -- come on!). If someone wants to champion a solution for Mozilla that would enable us to safely enable IDN in some form, then by all means run with it. I'll help where I can, but I don't have the time to develop a solution myself. I'm reducing the severity of this bug to minor because it only applies when the default preferences are changed. The original setting of critical was correct for Firefox 1.0 and earlier Mozilla-based browsers, but it no longer applies. I half expect my comments to raise a ruckus in this bug. Please keep any comments brief and constructive. Already, this bug report has grown to a length that would deter most from venturing to read it, let alone actually work on it. Not that there aren't plenty of great comments here... let's just keep it that way ;-)
Severity: critical → minor
Priority: -- → P3
Target Milestone: mozilla1.8beta1 → mozilla1.8beta2
*** Bug 288667 has been marked as a duplicate of this bug. ***
(In reply to comment #228) > If someone wants to champion a solution for Mozilla that would enable us to > safely enable IDN in some form, then by all means run with it. I'll help > where I can, but I don't have the time to develop a solution myself. What would need to be done? Would it be enough to maintain a whitelist of TLDs in all.js, and then, in nsIDNService::Normalize (http://lxr.mozilla.org/seamonkey/source/netwerk/dns/src/nsIDNService.cpp#253), changing: if (mShowPunycode) return ConvertUTF8toACE(input, output); to something like: if (mShowPunyCode || !domainIsInWhitelist(input)) return ConvertUTF8toACE(input, output); If you think this would be enough, I might take a shot at it...
(In reply to comment #0) > User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; rv:1.7.3) Gecko/20040913 Firefox/0.10 > Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; rv:1.7.3) Gecko/20040913 Firefox/0.10 > > firefox (and other unamed browsers) incorrectly handles punycode encoded domain > names. This allows attackers (namely phishers) to spoof urls of just about any > domain name, including ssl certs. > > Proof of concept url: > > http://www.shmoo.com/testing_punycode/ > > The links are directed at "http://www.p&#1072;ypal.com/", which the punycode > handlers render as "www.xn--pypal-4ve.com" > > The domain was just registered, so the root servers may not have gotten it yet. > Point your dns servers at '216.254.21.212' if you have problems. > > Here's what I think the bug is: > > 1. firefox (and mozilla) should warn the user if punycode is in use at all > 2. You should consider validating the ssl cert with the non-decoded version of > the website > > Just in case it's not clear, an attack case could be an ebayer/phisher who > includes links to paypal in their auction. When the auction ends, the buyer > clicks on the paypal link (which is a punycode/proxy to the real paypal), and > proceeds to steal all of their private green bits. > > I have not done any platform testing, or tested any other versions of > mozilla/firefox/etc. I assume this bug is cross-platform. > > The proof of concept urls are hosted on a personal server, and as such, I'd like > to have a chance to bring them down before this bug becomes public. Please > email me at ericj@shmoo.com before marking this bug public. > > /me goes and reads up on the mozilla bounty program. > > Reproducible: Always A coworker of mine, while reading the IDNa spec (RFC 3490), found a potentially insidious variant on this vulnerability. Apparently it is possible to encode the label seperator (typically ASCII 0x2E, or '.'), as well as the other valid label seperators specified in RFC 3490, as an html entity embedded within a url. This includes U+002E (full stop), U+3002 (ideographic full stop), U+FF0E (fullwidth full stop), and U+FF61 (halfwidth ideographic full stop). A sample of this is here: http://www.sleepwalk.org/279099_test.html Essentially this means that the following url's all resolve to www.google.com: http://www.google&#x002E;com http://www.google&#x3002;com http://www.google&#x3002;com http://www.google&#x3002;com This is insidious because it's somewhat different from the homographic attack described in the bug. Instead of a url using punycode to look like something it isn't (and thus redirecting a user to a location different from the expected destination), there is an underlying translation going on that makes these different label seperators equivalent! This is bad because it makes it difficult for software to programatically parse urls if there is a way to obscure the label seperator. Also consider that encoding the seperator as an html entity is not the only way to obscure the url. A malicious sender could simply insert one of these equivalent seperators in UTF8 (<E3><80><82> for instance). See the above sleepwalk.org url for an example that also resolves to www.google.com. It seems to be a bug that firefox is treating these stop characters as equivalent.
The IDN spec requires that we treat those characters like a period. That we display them in the status bar and URL bar in the normalized form is a good thing: it means that authors can't get the strange forms that mean the same thing but look different into the URL bar or status bar. So early normalization is a good thing here, and I'm not sure why (or perhaps even whether) you think otherwise (unless you're testing 1.0 rather than 1.0.1).
(In reply to comment #232) > The IDN spec requires that we treat those characters like a period. That we > display them in the status bar and URL bar in the normalized form is a good > thing: it means that authors can't get the strange forms that mean the same > thing but look different into the URL bar or status bar. So early normalization > is a good thing here, and I'm not sure why (or perhaps even whether) you think > otherwise (unless you're testing 1.0 rather than 1.0.1). I totally agree that the IDN spec requires these seperator characters be declared equivalent by a compliant application. However, I disagree with Firefox's handling of data that contain unicode characters. According to the IDN spec, properly-encoded (ACE form) domain names should always contain *only* ASCII characters. Firefox is recognizing malformed domains -- those that contain 8-bit data. If my IDN was "www.google\u3002com", toASCII() would output "www.google.com." This output is correct and is the ACE form that should appear in data accepted and interpreted by Firefox. Firefox should only accept valid ACE-encoded domains in urls. By recognizing malformed (8-bit) domains, we're opening up a big hole. A malicious user could easily obscure a domain in this way. For a demo of the output of toASCII/toUnicode, see http://www-950.ibm.com/software/globalization/icu/demo/domain
(In reply to comment #233) > Firefox should only accept valid ACE-encoded domains in urls. Why? It defeats a significant part of the point of IDN if we require authors to have the ACE in their HTML rather than the Unicode, and has no security advantages whatsoever unless we're depending on view-source for security, which we're not. > By recognizing malformed (8-bit) domains, > we're opening up a big hole. A malicious user could easily obscure a > domain in this way. What hole? We normalize before showing a URL in the status bar, the URL bar, or even copying to the clipboard (copy link location). (See attachment 174532 [details] to test this.) So there's no way the user will ever see the non-normalized form unless the view source in the HTML.
(In reply to comment #234) > Why? It defeats a significant part of the point of IDN if we require authors to > have the ACE in their HTML rather than the Unicode, and has no security > advantages whatsoever unless we're depending on view-source for security, which > we're not. > What hole? We normalize before showing a URL in the status bar, the URL bar, or > even copying to the clipboard (copy link location). (See attachment 174532 [details] [edit] to > test this.) So there's no way the user will ever see the non-normalized form > unless the view source in the HTML. The primary hole that concerns me is in HTML email, specifically spam/phishing scams/etc. Anti-spam software tends to look at urls included in messages for suspect domains, from RBL's or other sources. By recognizing malformed domains in Firefox (as well as other browsers), we've just created an easy way for spammers to get around mail filters. I suppose that the anti-spam community could modify their programs to parse these malformed domains. Note that also according to the IDN spec, "domain name slots" should always domain ACE (ASCII) domain labels, the output of toASCII(), and this includes uri's in HTML data: > A "domain name slot" is defined in this document to be a protocol > element or a function argument or a return value (and so on) > explicitly designated for carrying a domain name. Examples of domain > name slots include: the QNAME field of a DNS query; the name argument > of the gethostbyname() library function; the part of an email address > following the at-sign (@) in the From: field of an email message > header; and the host portion of the URI in the src attribute of an > HTML <IMG> tag.
(In reply to comment #235) RFC 3490 (IDNA) section 3.1 requirement 2 appears to require the Punycode (ASCII) form, as you say. However, there are implementations that support numeric character references (e.g. &#x3002;) in domain names in URIs in HTML, including Mozilla, Opera and i-Nav, I believe. They may have been supporting this for a while, and there may now be quite a lot of HTML pages out there that depend on this behavior, so I don't know how realistic it would be to try to get the implementations to comply with this part of the spec. In any case, this issue is separate from the homograph issue. If you would like to pursue it, may I suggest filing a separate bug?
(In reply to comment #236) > (In reply to comment #235) > In any case, this issue is separate from the homograph issue. If you would > like to pursue it, may I suggest filing a separate bug? Done. Filed as bug 289183. I have checked the recent archive of spam on our lab machines (I work at an anti-spam company) and have not seen this in the field. Not yet. I assume this is because IE doesn't incorrectly interpret these malformed domains (I guarantee if it worked this way in IE, we'd see it). This is definitely the time to fix it in Firefox, before the spam starts coming in!
(In reply to comment #237) Thanks for filing the new bug. MSIE doesn't support IDNA yet. That's why I mentioned i-Nav (an IDN plug-in for MSIE).
Another possible new issue related to this bug: see Erik van der Poel's comments on the IDN mailing list, idn=at=ops=dot=ietf=dot=org. According to Erik, U+1160, HANGUL JUNGSEONG FILLER, is displayed in IDNs by Firefox (and presumably other Gecko-based products) as a wide space, and is therefore a homograph for ASCII space. This is a potential large security hole for phishing/spoofing. (The same is apparently true of the Internet Explorer plug-in). In a reply to that, Soobok Lee states that U+1160 is not touched by NFC normalization, and therefore gets through Nameprep/Stringprep. Apparently, U+1160 is only meaningful in conjunction with Hangul characters, and he recommends that a standalone U+1160 should always be deleted, regardless of what the existing IDN standards say. This also raises interesting questions about stray combining characters in general.
(In reply to comment #239) Filed bug 289588 to address the U+1160 font display issue itself. We need to watch IETF and Unicode to see how they respond to the Korean fillers and leading combining marks in IDNA/Stringprep.
Depends on: 290275
Darin, is any more work here planned to happen in the next few days? If not, then this probably needs to get pushed out to 1.8b3 or beyond.
No, I have no plans to work on this for 1.8b2. I'm not even sure that I will have time for Gecko 1.8. Help would be greatly appreciated.
Target Milestone: mozilla1.8beta2 → mozilla1.8beta3
darin doesn't have any time for this in beta2, may not have time to get it in to 1.1.
Flags: blocking1.8b3?
Flags: blocking1.8b2+
Flags: blocking1.8b-
Flags: blocking-aviary1.0.1-
Bug 286534 fixes part of this bug. We also need to have a small blacklist of characters which IDN allows but in fact we never allow because they are confusable with URL delimiters. I don't know if there is a bug for this yet. This will not cause significant interoperability problems because none of those characters are in the character tables of the TLDs which will be whitelisted. Gerv
The character blacklist issue is bug 283016 and the IDN tracking bug is bug 237820.
Blocks: sbb-
No longer blocks: sbb?
Flags: blocking1.8b3? → blocking1.8b3-
Can I reassign this bug to someone (gerv, jshin, ?) who is actually working on this? Thanks!
I'm not sure that this bug has significant remaining value, but I'll assign it to me for the moment. Gerv
Assignee: darin → gerv
Status: ASSIGNED → NEW
Whiteboard: [sg:fix] → [sg:spoof]
Flags: blocking-aviary1.5+
What is required for detecting mixed-scripts as outlined in UTR#36? Does Mozilla have internal data structure that stores the properties and script of Unicode characters?
jshin: are you able to answer the question in comment #248? Gerv
gerv, intl library has a currently disabled API for Unicode character property, but it doesn't have an API for script identification. gfx/src/win has an internal routine for that, but that's not public yet. Perhaps, we have to move it to intl, refine it and make it accessible by others.
(In reply to comment #250) > Perhaps, we have to move it to > intl, refine it and make it accessible by others. Sounds good to me :)
Blocks: 316730
Cross-reference: see bug 316727 for mixed-script detection code, which I currently plan to use to trigger Punycoded display when incompatible scripts are mixed. This is designed to be consistent with the version 2.0 ICANN IDN recommendations, which directly address homograph spoofing issues.
*** Bug 319397 has been marked as a duplicate of this bug. ***
Flags: testcase+
Flags: in-testsuite+ → in-testsuite?
Regarding blacklisting. I would be sad if you blacklisted certain characters and/or prevented mixing. I might want a subdomain in a certain language or with an odd character just for fun. For example: http://xn--7xa.m8y.org/ http://φ.m8y.org/ http://xn--h4h.m8y.org/ http://☠.m8y.org/ http://xn--j4j.m8y.org/ http://☢.m8y.org/ Heck. There are a reasonable number of combinations like that already registered as domains. They are harmless and fun. Kind of a nice "extra" for browsers that can handle it. I would prefer if options like notifying/colouring were not combined with blacklisting. Or at least, if blacklisting didn't eliminate all non-linguistic symbols. Thanks.
Oh. And regarding colouring and concerns for the visually impaired. Even if there was no additional notification text wouldn't someone using a screen reader get an actual character name for a spoof? Like, if it looked like an i but was a unicode char, wouldn't it read the unicode char name? I don't know, not having a screen reader handy to test.
We now implement a whitelist of TLDs which have sensible practices. Gerv
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Any pointers to a bug where that was implemented? Can't seem to find it.
(In reply to comment #257) > Any pointers to a bug where that was implemented? Can't seem to find it. > bug 286534 http://www.mozilla.org/projects/security/tld-idn-policy-list.html
There's now a useful tool for investigating spoofing at the Unicode Consortium site: http://unicode.org/cldr/utility/confusables.jsp
Is there a reason why Mozilla wouldn’t whitelist some obviously harmless characters for all TLDs? I am thinking of Latin-1 characters 192 through 255 (excluding 215: ×) for a start. I mean, the current policy makes IDNs practically useless for the most common TLDs, and this would make them work at least for some of the most common latin-based languages.
We don't have a character whitelist. Gerv
I see, but why? It doesn’t seem like something that is hard to implement. If I had to guess, some lines in nsIDNService::isInWhitelist, a key like 'network.IDN.whitelist_chars' in all.js, and some definitions.
Consider http://www.paypäl.com/, which uses only the characters you are proposing to whitelist. Yet in spite of being entirely made out of common Latin-1 characters, this is clearly a spoofing risk for http://www.paypal.com/ The registry is potentially in a position to prevent this sort of confusion, since it already knows which domains have already been issued, but a browser-based algorithm is not.
(In reply to comment #263) > Consider http://www.paypäl.com/, which uses only the characters you are > proposing to whitelist. Yet in spite of being entirely made out of common > Latin-1 characters, this is clearly a spoofing risk for http://www.paypal.com/ This is not a particular strong argument, otherwise you'd have to disallow 0/o/O or 1/I/l for being too similar. And e.g. German readers aren't likely to take an ä for an a anyway... > The registry is potentially in a position to prevent this sort of confusion, > since it already knows which domains have already been issued, but a > browser-based algorithm is not. Most homograph attacks are based upon similar characters from different "scripts" or "alphabets", e.g. "a" (0x61) vs "а" (0x430). That's why .eu IDNs must not mix between Latin/Greek/Cyrillic characters.
> otherwise you'd have to disallow > 0/o/O or 1/I/l for being too similar You're right about that: it's actually quite a strong argument for disallowing those at the registry, _in addition_ to non-ASCII confusables. In my opinion, the registries should do exactly that. Regarding German users: yes, German users might well be more likely to see the umlaut than others, but 1) most Internet users are not literate in German, and 2) even German-literate readers will generally read what they expect, if they are already expecting to read "paypal". Finally, regarding "whole-script" confusables, you might want to take a look at this http://unicode.org/cldr/utility/confusables.jsp?a=paypal&n=on&x=on&s=on for examples of how mere contraints on script mixing are not nearly enough to prevent confusion, even when substantial efforts have been made to restrict character repertoire.
Hmm, I would have thought that paypäl is impossible to confuse with paypal, that’s why I said “obviously harmless characters”. Maybe that really is a question of being used to diacritics. But still, unicode.org does not list a and ä as confusables. Also, Opera and IE would show that kind of IDNs. So this is probably not a clear case. I think it should be reconsidered. Indeed, paypa1.com would be far more dangerous (if it weren’t registered to an anti-fraud company), but you wouldn’t want to disallow 1und1.com and the like, even though registrars don’t check if there is a site lundl.com (actually, there is).
There's still more than can be done here. See http://www.idnnews.com/?p=7109 Chrome displays the fake www.аmazon.com as http://www.xn--mazon-3ve.com/ on hover, but Firefox still shows it as http://www.аmazon.com
I filed bug 750587 for Brad's concern.
This seems to be back and being publicly discussed. https://www.wordfence.com/blog/2017/04/chrome-firefox-unicode-phishing/
There is also a report on SUMO (Support Mozilla) as a question asking about the same wordfence blog > https://support.mozilla.org/t5/Firefox/firefox-phishing-warning/m-p/1391610 The poster of that question marked it solved after using about:config to toggle the pref > network.IDN_show_punycode to true. That is the workaround suggested in the wordfence blog. I note that unlike the examples in commennt 2 (In reply to Daniel Veditz [:dveditz] from comment #2) > Created attachment 171916 [details] > more examples > > from a spreadfirefox.com blog I found out this morning about > http://www.retrosynth.com/misc/phishing.html which plays with the same idea: > www.xn--amazn-mye.com > www.xn--micrsoft-qbh.com > www.xn--papal-fze.com > .... Where the fake and real URLs give distinct and different displays on mousover The example from wordfence gives a mousover result from the fake URL that visually matches the genuine URL using > <a href="https://www.xn--e1awd7f.com/" target="_blank"> to spoof > https://www.еріс.com/ As an additional twist they have also obtained a SSL cert from the Mozilla affiliated https://letsencrypt.org/ for the fake site.
Agreed that this bug is not yet fixed. This URL also popped up on Hackaday: https://www.xn--80ak6aa92e.com/ which spoofs apple.com. The idea behind how Firefox deals with IDN is explained in https://wiki.mozilla.org/IDN_Display_Algorithm: > Instead, we now augment our whitelist with something based on ascertaining whether all the characters in a label all come from the same script, or are from one of a limited and defined number of allowable combinations. The hope is that any intra-script near-homographs will be recognisable to people who understand that script. > We retain the whitelist as well, because a) removing it might break some domains which worked previously, and b) if a registry submits a good policy, we have the ability to give them more freedom than the default restrictions do. So an IDN is shown as Unicode if the TLD was on the whitelist or, if not, if it met the criteria above. The example I linked to uses just the Cyrillic alphabet and is thus displayed with its IDN label per the single-script considerations of the algorithm. Perhaps, even if you allow IDN labels, you need to visually distinguish them, for example by marking the domain in a different color.
This bug is old and is already resolved and marked as fixed. It is probably not productive to continue commenting further in this bug. A newer and currently reopened one covering the subject is Bug 1332714. Bugzilla is however not the best place for general discussion of problems of complex issues involving languages and ICANN policy. Any action to attempt to mitigate issues is likely to have downsides like hitting legitimate sites with either blocks or display problems. There are always the standard Mozilla forums: https://www.mozilla.org/about/forums/ https://www.mozilla.org/en-US/about/forums/#dev-security https://groups.google.com/forum/#!forum/mozilla.dev.security.policy
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: