Closed Bug 289183 Opened 20 years ago Closed 20 years ago

href and src attributes in HTML should not be treated as IDN-aware

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: dml, Assigned: smontagu)

References

()

Details

(Whiteboard: WONTFIX/INVALID?)

User-Agent:       Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/125.5.6 (KHTML, like Gecko) Safari/125.12
Build Identifier: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0

A discussion of this issue originally took place in the bug comments of 279099.  

The mozilla core accepts and parses internationalized domain names (IDNs) per RFC 3490.  However, 
domains that are malformed are interpreted and displayed.  Domains that contain label seperators (the 
'.' for traditional non-IDN ASCII domains) which are ideographic homographs are accepted when 
improperly encoded.  That is, U+002E (full stop), U+3002 (ideographic full stop), U+FF0E (fullwidth full 
stop), and U+FF61 (halfwidth ideographic full stop) are treated as equivalent.

Essentially this means that the following url's all resolve to www.google.com:
http://www.google.com
http://www.google。com
http://www.google.com
http://www.google。com

According to RFC 3490, section 3.1 requirement 2, 

"Whenever a domain name is put into an IDN-unaware domain name slot (see section 2), it MUST 
contain only ASCII characters.  Given an internationalized domain name (IDN), an equivalent domain 
name satisfying this requirement can be obtained by applying the ToASCII operation (see section 4) to 
each label and, if dots are used as label separators, changing all the label separators to U+002E."

The result of the toASCII() function, which produces an ASCII equivalent for a domain containing non-
ASCII characters, is called the ACE (ASCII compatible encoding) form.  The core is accepting urls that are 
malformed and are not encoded in the ACE form.  In ACE form, all label seperators should be equal to 
U+002E (full stop, '.').

The problem here is one of obfuscation.  By obscuring the label seperator, a malicious user (such as a 
spammer) could hide content from programmatic analysis, such as is done in anti-spam software which 
might perform RBL or other matching on the domain in a url.  Typically urls are key indicators of spam 
or phishing attempts; this feature could allow spammers to bypass mail filters while the HTML mail 
directs recipients to their web sites.

Reproducible: Always

Steps to Reproduce:
1. See example url (http://www.sleepwalk.org/279099_test.html)
2. Note how urls resolve (see examples in the Details section)
Actual Results:  
Domains containing malformed IDNs with homographs for label seperators resolve.

Expected Results:  
IDNs should not resolve unless they are properly encoded in ACE format, which requires that all label 
seperators are U+002E (full stop, '.').

See bug 279099 for further discussion.
"By obscuring the label seperator, a malicious user (such as a spammer) could
hide content from programmatic analysis, such as is done in anti-spam software
which might perform RBL or other matching on the domain in a url."

Which seems to be easy enough to solve: The anti-spam software should do the
same set of transformations before looking up or storing the URL/domain in an RBL.

What I don't really understand is:
"[RFC...]if dots are used as label separators, changing all the label separators
to U+002E."
Isn't that exactly what mozilla does and therefore correct behaviour?
I do not agree that this is a bug or issue at all.    
    
You say:    
"That is, U+002E (full stop), U+3002 (ideographic full stop), U+FF0E    
(fullwidth full stop), and U+FF61 (halfwidth ideographic full stop) are    
treated as equivalent."    
    
They are equivalent.    
   
You also quote from the RFC:   
"Whenever a domain name is put into an IDN-unaware domain name slot (see   
section 2), it MUST contain only ASCII characters.".   
   
True, but I argue that HTML sources are IDN-aware. Therefore, we are allowed   
to write those characters as separators, just as we are allowed to place   
non-ASCII characters in HREF and SRC instead of the ACE form.  
  
The proper solution here is to make the software that tries to analyse the  
contents IDN-aware. If they are not, they might very well choke on other  
non-ASCII characters on the hostname. The way I see it, separators are the  
least of our worries, given that they cannot lead to phishing.  
  
(In reply to comment #3)
> I do not agree that this is a bug or issue at all.    

Please read the comments in bug 289183.

> True, but I argue that HTML sources are IDN-aware. Therefore, we are allowed   
> to write those characters as separators, just as we are allowed to place   
> non-ASCII characters in HREF and SRC instead of the ACE form.  

You're not allowed to put non-ASCII characters in HREF and SRC fields, according to RFC 3490.  Domain 
name slots must be in the ACE form.

To quote from Section 2:

   "A "domain name slot" is defined in this document to be a protocol
   element or a function argument or a return value (and so on)
   explicitly designated for carrying a domain name.  Examples of domain
   name slots include: the QNAME field of a DNS query; the name argument
   of the gethostbyname() library function; the part of an email address
   following the at-sign (@) in the From: field of an email message
   header; and the host portion of the URI in the src attribute of an
   HTML <IMG> tag.  General text that just happens to contain a domain
   name is not a domain name slot; for example, a domain name appearing
   in the plain text body of an email message is not occupying a domain
   name slot."
(In reply to comment #1)

> Which seems to be easy enough to solve: The anti-spam software should do the
> same set of transformations before looking up or storing the URL/domain in an RBL.

Yes.  I'm sure that anti-spam software will have to change to take this behavior into account.  I have 
already made changes in the software I work on.  However, it doesn't change the fact that this is still 
bad behavior.  Other software that analyzes HTML email or web content could be effected.  It's best to 
fix the problem at the source.  

As I mentioned in my comments on bug 279099, my corpus of spam does not show this to be a 
problem yet, which I'm sure is because these urls do not resolve correctly in IE.  Now is the time to fix 
this before spammers start utilizing this obfuscation technique.
So what says that the href and src attributes in HTML are not allowed to be
IDN-aware?  I think they should be, and that this bug is invalid.

Also, resummarizing the bug since the issue you're harping on is not actually
the problem that you're complaining about.
Summary: malformed domain label limiters interpreted incorrectly (homographs for '.' in IDNs) → href and src attributes in HTML should not be treated as IDN-aware
RFC 3490 section 2 says:

   An "IDN-unaware domain name slot" is defined in this document to be
   any domain name slot that is not an IDN-aware domain name slot.
   Obviously, this includes any domain name slot whose specification
   predates IDNA.

HTML 4.01 is from 24 Dec 1999, RFC 3490 is from Mar 2003. Mozilla could comply
with RFC 3490 in both standard and quirks modes, in standard mode only, or in
neither mode.
So if I declare that we implement HTML 4.01.1, which is the standard published
today that is a copy of HTML 4.01, then we're OK?
:-)

Seriously, RFC 3490 is at the Proposed Standard Maturity Level. Maybe this area
could be cleared up for the next Maturity Level, Draft Standard.
This issue was actually discussed on the idn mailing list:

http://www.imc.org/idn/mail-archive/msg06179.html
http://www.imc.org/idn/entire-arch.txt

If you think about it from the point of view of an HTML author, you don't want
to put a non-Punycode IDN name in your HTML because many people are still
using non-IDNA-capable browsers, so they don't know how to convert the Unicode
domain name to one with Punycode labels. This may be one of the reasons why
RFC 3490 is written the way it is (i.e. requiring Punycode in IDNA-unaware
domain name slots).
(The deeper point behind what I was saying, however, is that HTML itself isn't
versioned, and we shouldn't treat it like it is.)
HTML5 will be (is) IDN-aware. I hereby announce that we are an HTML5 UA. Problem
solved. We should (and do, as I understand it) support IDN in HTML.
The question is whether to allow non-Punycode IDNs in HTML. How about looking
at some pros and cons of being permissive (i.e. allowing non-Punycode IDNs in
HTML)?

Pros:

P1. Consistent with other implementations of IDNA (e.g. Opera, i-Nav).
P2. Allows both Punycode and non-Punycode IDNs in HTML to work.

Cons:

C1. Some spam/phishing filters may not catch non-Punycode IDNs, allowing
tricky domain name encodings to pass through.
C2. HTML authors that only test their documents in permissive implementations
will not notice that their documents don't work in older or strict
implementations.

Of course, some of these pros and cons are more important than others. I have
not attempted to weigh them.

A beta of MSIE7 may be released this summer, possibly with IDNA support. That
may be another datapoint to consider (or not).
Um, you missed one major Pro, which outweighs everything else. It allows people
who don't use US ASCII domains to actually write their domain names in their
documents. There really is no argument to have here. HTML is IDN-aware.
True, that is another pro. So obvious, yet somehow I forgot to include it.
Whether people will actually write their non-ASCII domain names in a
non-Punycode encoding in HTML before many of their users are using IDNA
implementations is another matter. Maybe it doesn't matter to Mozilla.
That's a chicken-and-egg problem, Erik.  If no browsers allow it, people won't
use it.  If all browsers but Mozilla allow it, people will likely use it (and be
broken in Mozilla) or not use it (and curse Mozilla for keeping them from using
it).  If some browsers allow it and some don't, people will be likely to
recommend browsers that _do_ allow it, so they can use it.

So from this point of view, not implementing something like this is only ok if
other browsers also won't implement it or if implementing it causes serious
problems in some way (security issues, large engineering time investment,
whatever).  Otherwise we're just making people feel that Mozilla is broken and
breaking their content, for no good reason.
...and other browsers already implement this. As do we. This is really not an
issue IMHO. :-) WONTFIX/INVALID?
Whiteboard: WONTFIX/INVALID?
Yes, I'd say this is a WONTFIX as originally stated. We should allow people (who
are using an appropriate charset for their documents) to write IDN domains in
them in the form in which they are supposed to be typed/written.

Gerv
It may even be an INVALID. I can't seem to find anything in RFC 3490 that states
that a browser must not accept non-Punycode IDNs. It does state that non-Punycode
IDNs should not be placed in IDN-unaware domain name slots in section 3.1
requirement 2. However, section 6.1 almost seems to contradict this, since it
does not reiterate the "whose specification predates IDNA".
(In reply to comment #19)
> It may even be an INVALID. I can't seem to find anything in RFC 3490 that states
> that a browser must not accept non-Punycode IDNs. It does state that non-Punycode
> IDNs should not be placed in IDN-unaware domain name slots in section 3.1
> requirement 2. However, section 6.1 almost seems to contradict this, since it
> does not reiterate the "whose specification predates IDNA".

From RFC 3490:
> Whenever a domain name is put into an IDN-unaware domain name slot (see section 2), it MUST 
> contain only ASCII characters.
[snip]
> An "IDN-unaware domain name slot" is defined in this document to be any domain name slot that is
> not an IDN-aware domain name slot. Obviously, this includes any domain name slot whose
> specification predates IDNA.

Therefore, for example, as the src attribute in an img tag predates IDNA it is an IDN-unaware domain 
name slot and should only accept ASCII text.
Regarding comment 20, see comment 12.
Status: UNCONFIRMED → RESOLVED
Closed: 20 years ago
Resolution: --- → INVALID
(In reply to comment #21)
> Regarding comment 20, see comment 12.

(Reviews earlier comment....)  Good point.  Ok, problem solved.
(In reply to comment #8)
> So if I declare that we implement HTML 4.01.1, which is the standard published
> today that is a copy of HTML 4.01, then we're OK?

RFC-3490 says that if the spec for a slot predates IDNA then the slot is
IDN-unaware (if P then Q).  You seem to be concluding that if the spec for a
slot postdates IDNA then the slot is IDN-aware (if not-P then not-Q), which
simply does not follow.

RFC-3490 says 'An "IDN-aware domain name slot" is defined in this document to be
a domain name slot explicitly designated for carrying an internationalized
domain name as defined in this document.'

The latest HTML specs (HTML 4.01 and XHTML 1.1) both say that the href attribute
(and various others) contains a URI, and they cite RFC-2396, which defines the
host field of URIs and does not explicitly designate it for carrying IDNs (not
surprising, since 2396 < 3490).  Before RFC-3986 appeared, it was very clear
that the host field of a URI in an href attribute was an IDN-unaware slot, and
therefore non-ASCII IDNs must not be put there.

RFC-3986 obsoletes RFC-2396, and invites all sorts of names into the host field,
including non-domain-names.  The spec "delegates the issue of registered name
syntax conformance to the operating system of each application performing URI
resolution, and that operating system decides what it will allow for the purpose
of host identification".  At one point it mentions the possibility that a
reg-name could be an IDN and cites RFC-3490.  This is not quite the "explicit
designation" I had in mind, but I suppose it suffices to make URIs now
IDN-aware.  (Notice that this non-backward-compatible retroactive redefinition
of URIs deliberately bypasses the backward-compatibility protection designed
into IDNA, so that use of the backward-compatible representation is merely
recommended rather than required.)

However, even though URIs are now IDN-aware, they are still ASCII-only! 
RFC-3986 requires that all non-ASCII characters in all parts of a URI be
represented as percent-escaped UTF-8!  So they still aren't human-readable at
all.  If you want human-readable non-ASCII IDNs, you need IRIs (RFC-3987).

RFC-3987, unlike RFC-3986, does not obsolete or update the concept of URIs, and
does not retroactively alter any specs that refer to URIs.  It explicitly states
that IRIs are not to be inserted into places that expect URIs, unless the spec
explicitly invites IRIs as well.  The HTML specs don't (at least not yet).

However, even while the HTML and IDNA specs say that certain slots are only
allowed to contain certain things, and that writers must not put anything else
there, they don't say that readers must reject stuff that should never have been
written.  As far as I can tell, that's left to the discretion of the
implementation.  I think it's okay for a browser to accept IRIs wherever it
accepts URIs.  It would not be okay for an HTML editor to copy a non-ASCII IRI
verbatim from the user interface into the HTML file; it must convert the IRI to
a URI (using percent-escaped UTF-8 as described in RFC-3987).
(In reply to comment #19)
> I can't seem to find anything in RFC 3490 that states
> that a browser must not accept non-Punycode IDNs.

Me neither.  :)

> It does state that non-Punycode
> IDNs should not be placed in IDN-unaware domain name slots in section 3.1
> requirement 2.  However, section 6.1 almost seems to contradict this, since
> it does not reiterate the "whose specification predates IDNA".

Specifically what part of 6.1 do you think almost contradicts 3.1 requirement 2?
The 4th and 5th paragraphs of 6.1 almost contradict 3.1 requirement 2. When I
first read that, it was not immediately clear to me whether it was talking about
the encoding of the characters that would be visible after ToUnicode or the
encoding of the characters that appear after ToASCII.

Also, these paragraphs do not reiterate the "predates IDNA" thing, making it
even more likely to be misunderstood.

You may say that I misunderstand, and that may be true, but I still feel that
this part of the spec could be clearer.
You need to log in before you can comment on or make changes to this bug.