Open Bug 1376641 Opened 7 years ago Updated 10 months ago

Dealing with confusables in domain names

Categories

(Firefox :: Address Bar, enhancement, P3)

enhancement

Tracking

()

Tracking Status
firefox57 --- wontfix

People

(Reporter: annevk, Unassigned)

References

(Blocks 1 open bug)

Details

Here's an idea for dealing with confusables in domain names:

1. We compute a canonical domain name for each entry in history. The exact details TDB, but consider removing hyphens, diacritics, replacing "1" with "l", "rn" with "m", etc. (See also https://unicode.org/reports/tr39/#Confusable_Detection.)
2. We compute a canonical domain name _destination_ for the domain the user is navigating to.
3. We compare _destination_ with each canonical domain name in history.
4. If there's a match and the corresponding actual domain names are different we ask the user if they meant to navigate to a different URL. We can still navigate since unless the user takes action their cookies et al should be safe.

The dialog should allow for navigating to the address in history (and if the user takes that action we shouldn't put _destination_ into history), but also allow for the user stating somehow that it's not a duplicate in which case we shouldn't show the dialog again. (That probably makes the above stated model a little more involved, but hopefully the general idea is clear.)
Blocks: 1332714
It seems to me that you want to deal with in-Latin-script confusables, rather than with whole-string cross-script confusables, so I'm not sure why this should affect bug 1332714 (which, by the way, seemed to be not a bug but intended behaviour, so I'm not sure why it was reopened).

More importantly, you cannot deal with confusables with one-way normalizations: the algorithm you propose would be able to detect that "goog1e.com" is a potential spoof of "google.com", but would not be able to detect that "formulal.com" is a potential spoof of "formula1.com". You would rather need to calculate every time the entire possible set of combinations that can be created by replacing each single letter with all potentially confusable other letters, and check for all of them - and they quickly become hundreds or thousands.

Moreover:
- this algorithm (in the complex version) already exists, it's the IDN variant mechanism, and it should be applied just once by TLD registries at registration, rather than each and every time by every Internet user whenever they encounter a domain name;
- however, at the policy level, no one ever thought that (e.g.) "1" and "l" should be considered variants, so that you cannot have legitimate domain names differing only by "1" in place of "l"; so your algorithm would not just be ineffective and CPU-heavy, but also generate false positives.
It wouldn't be CPU-heavy since we only compare against sites the user already visited. So if the user visited formula1.com and then gets directed to formulal.com it would be a simple string comparison (against each history entry, which can be further optimized) after normalization and then an alert of sorts to the user that they may be on the wrong site.
Hi,

I feel that this requires a single body to be form that research and maintains a publicly available list of 
du-duped characters sets for Unicode characters, that has been determined by a visual computer comparison algorithm of
each rendered character. See existing email communication below:

At least their will be a single source of truth, which everyone can use to determine the IDN Unique de-duping(IDN-UD) domain name. Weather they choose to register those domain names or ICAN blocks them;update the Domain name resolution protocol, browser and safe browser decided to warn or block, everyone is working to the same consistent of similar character maps sets.
Otherwise their is going to be alot of confusion for the public, which then depends on just too many factors they don't know of yet, browser, safe browser, virus guard, domain register, blocking white listed.... their are just to many possibilities
every for debugging or getting tot he route of problem of IT as to why I can't access a certain domain or why its the wrong domain. This needs to be kept simple from a public perspective, regardless of technical complexity for the industry to solve the issues.

It just a matter of time until, start automated comparing rendered characters for this types of spoofing, before the number of possible domain names explodes.


HI,

Thank you for your reply!

Would maybe just a common registry of the 20% of duplicate visual unique characters that is managed by some one google or mozillar,
not help, were we can at least all agree on which characters set are consider too similar to one another. The management and research of that be made public
on a single website, were everyone include those the register domain names and look to register similar domain names for this, would all agree to a common
set of homographs. Later on at least everyone could move over the the research that has been done by this group and maybe become the foundation of a new IETF.

Yes since this only really affects like less than 3000 domains, it is relatively a small problem.
But would be interesting to see how many different characters set their would be if this were done programatically.

Have to solve this problem for 4 Dimensions: Font Size Range, Platform, FontType, Unicode Characters.
Then have a computer algorithm run the 4 dimensions permutations for possible comparisons, brute force,
to produce a list of similar characters, that need to be indexed with a couple different primary compound keys:
Platform, Size Range, Font Type
Font, Size

I really think that having a single source of registry truth, would be a good thing, because then everyone is in agreement.
weather it be browsers or domain name resolution or domain register or those who choose to register all homographic 
names for their domain name aswell.

What are you thoughts on having a single source of truth?

Kind Regards,

Wesley Oliver


On Mon, May 7, 2018 at 2:14 PM, Xudong Zheng <7b2crpi3@slicealias.com> wrote:
Hi Wesley,

It's obvious that you've put a fair amount of thought in this. I'm not sure this issue is as simple as you make it out to be. At the end of the day, you are making tradeoffs between simplicity, accessibility, and security. The major browser vendors (Chrome and Firefox) have already made up their mind about how to deal with this, and the domain registrars have decided that this is an issue for browsers to deal with. I encourage you to look at Firefox's issue if you have not done so already - https://bugzilla.mozilla.org/show_bug.cgi?id=1332714 - especially considering that they decided the improvement in security isn't worth the cost in accessibility.

There are hundreds of registries, each of which have individual policies regarding how they deal with international domains. .com is very liberal due to its global nature. Others such as .es already limit themselves to a dozen or so characters. I'm not it's worthwhile to convince them to add further complexity to what I imagine to be an already complicated infrastructure.

On 05/07/2018 05:46 AM, Wesley Oliver wrote:
Hi,

I have include my friend a google in this email.

Leendert, if you would be able to possible tell me who I can talk with and chat with regards
to the following issues at google, that would potentially be able to assit in look at mitigating the
issues of these IDN security issues, with regards to IDN standard at and IETF level, looking to address the issues.

With regards to the following article:
https://www.xudongz.com/blog/2017/idn-phishing/ <https://www.xudongz.com/blog/2017/idn-phishing/>

I would like to discuss a solution that potentially, is a change to the standards for domain names, relating to what constitutes the unique identity of a domain name, which can then solve the issues
holistically in the future with regards to their still gaps for domain names written in the same language.

I typically would like to suggest this to the IETF and will have to go find the working group that is responsible for domain name resolution and unique identification, such that we can develop a new backwards compatible standard the mitigates the effects of these homograph and character glyph visual spoofing flaws. First I would like to run the high level concept passed yourself and then will have to get into the domain name resolution standard to suggest how to change the things.

https://www.chromium.org/developers/design-documents/idn-in-google-chrome <https://www.chromium.org/developers/design-documents/idn-in-google-chrome>

https://tools.ietf.org/html/rfc1035 <https://tools.ietf.org/html/rfc1035>


-------------------
Browser solutions on top of what has already been implemented.
--------------------
First simple approach would be display a flag identified in the browser before the url, just like the https security token. The flag, would be the symbol of the nearest matching country or flag for unicode characters set of the language, homograph of characters glyph, that would visual identify them as unique and potentially with optical character recognition, to be 80% distinguishable from one another. This means each domain name now has another unique flag that can clearly be distinguishable  from other flags were the flags must also be 80% or more distinguishable from one another.


----------------------
IETF - international domain name, changing the definition of what constitutes a unique domain name.
----------------------

Clearly similar to all existing approaches and clearly like the approaches above would classify all the unicode letters into unique set that are clearly distinguishable from one another by up to 80% or more. we can use OCR and machine learning many different simple techniques to do this for us now days in and automated way.

Yes this does ignore the fact that their are different characters sets, but then typically a characters set developed for IDN or pick the best one for this existing, that you guys have basically already done will work. Given that one can only ask those who display domain names or write those tools, to compile with visual representation standard, one would have to implemented a body their certifies browser to be compliant with the implementations, and the these logos are clearly display to user/operators of browser infuture. Similar to like Divx or dolby digital on music players, which average joe have become used to looking for. In which case we can then educate the public, that their browser should be barier the certification security logo, that needs to be clearly visible all the time in the omi bar in chrome or address bar for instance.

For improving the domain name resolution for IDN de-duping, we are interested in the 20% of characters that are visually similar,
from which we build a de-duping map.

Just like url and the domain name get url encoded, domain name will now additionally be
encoded by passing it thought the the IDN unique de-duping process.

Domain name resolution servers will then check the IDN Unique de-duping(IDN-UD) domain name, url against a list of register (IDN-UD) domain names
in their registry.

The list of Unique de-duping set of character maps, that would need to be used for this process, should be maintained a a body and additional stand
which can continue to expand the 20% threshhold as see fit, across multiple font types and change the thresh holds accordingly.
This part of the process would require the development of a new standard, were by these set can be securely download and update automated
by domain name resolution servers in-future. This is were I feel the bulk of the work in this exersize will set, is in the development of the protocol for
the secure exchange of this character set.

Please let me know you thoughts on this matter.

Kind Regards,

Wesley Oliver
See Also: → 1507582

As Anne indicated they were still hopeful about this idea, some reasons why I'm skeptical... (this is a hard problemspace, and I'm not saying we definitely shouldn't do this - but I've got concerns and they do not appear to be in the bug yet, so...):

  • quite some users disable history altogether. Would we just not protect them? That seems unfortunate;
  • as a variation of this point, it's unclear what should happen in private browsing mode. If we build on the non-private history, this is a significant fingerprinting risk, as an attacker can probably side-channel the lookup times and determine user history from that;
  • performance - history access is asynchronous (from a database on disk), and blocking navigation on iterating over all of history to see if it's potentially confusable with the current URI is likely to lead to a significant pageload penalty. Caching all the domains in memory may mitigate this a bit, but I'd still be worried about the combinatorial explosion of the entire set of history + all the permutations of the domain. It's hard to see how to optimize this, too. Keeping a local cascade of bloom filters based on history would potentially make lookups cheaper, but then you need to add every permutation of every visited domain when they are visited, and computing all those will itself be expensive - plus then you're storing history in multiple places, so now when removing history, you need to update such a filter (which you can't do by adding, as you'd normally update the filter, as the stored-on-disk thing needs to actually have no trace of the history, or it'd be a privacy risk! So then you have to regenerate the entire filter, which is expensive.);
  • what gets blocked diverges on a per-user basis, which makes it hard to troubleshoot or understand why the browser behaves the way it does. Having the same behaviour for everyone would be preferable;
  • the baseline assumption here is that the first visit is the correct one. That assumption seems shaky to me; sure, if you're phishing credentials for mytrustedsite.com, you want to appear as that site, but for lots of other personal data (SSN, address, credit card info, ...), just spoofing/phishing on anytrustedbrand.com is likely to work. In which case, again, we cannot protect the user if we're relying on history.

We could try to hybridize this with bug 1507582, which remedies some of these concerns, though some of the performance and potential privacy issues would still need to be addressed.

Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.