Closed
Bug 289183
Opened 20 years ago
Closed 20 years ago
href and src attributes in HTML should not be treated as IDN-aware
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
RESOLVED
INVALID
People
(Reporter: dml, Assigned: smontagu)
References
()
Details
(Whiteboard: WONTFIX/INVALID?)
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/125.5.6 (KHTML, like Gecko) Safari/125.12 Build Identifier: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0 A discussion of this issue originally took place in the bug comments of 279099. The mozilla core accepts and parses internationalized domain names (IDNs) per RFC 3490. However, domains that are malformed are interpreted and displayed. Domains that contain label seperators (the '.' for traditional non-IDN ASCII domains) which are ideographic homographs are accepted when improperly encoded. That is, U+002E (full stop), U+3002 (ideographic full stop), U+FF0E (fullwidth full stop), and U+FF61 (halfwidth ideographic full stop) are treated as equivalent. Essentially this means that the following url's all resolve to www.google.com: http://www.google.com http://www.google。com http://www.google.com http://www.google。com According to RFC 3490, section 3.1 requirement 2, "Whenever a domain name is put into an IDN-unaware domain name slot (see section 2), it MUST contain only ASCII characters. Given an internationalized domain name (IDN), an equivalent domain name satisfying this requirement can be obtained by applying the ToASCII operation (see section 4) to each label and, if dots are used as label separators, changing all the label separators to U+002E." The result of the toASCII() function, which produces an ASCII equivalent for a domain containing non- ASCII characters, is called the ACE (ASCII compatible encoding) form. The core is accepting urls that are malformed and are not encoded in the ACE form. In ACE form, all label seperators should be equal to U+002E (full stop, '.'). The problem here is one of obfuscation. By obscuring the label seperator, a malicious user (such as a spammer) could hide content from programmatic analysis, such as is done in anti-spam software which might perform RBL or other matching on the domain in a url. Typically urls are key indicators of spam or phishing attempts; this feature could allow spammers to bypass mail filters while the HTML mail directs recipients to their web sites. Reproducible: Always Steps to Reproduce: 1. See example url (http://www.sleepwalk.org/279099_test.html) 2. Note how urls resolve (see examples in the Details section) Actual Results: Domains containing malformed IDNs with homographs for label seperators resolve. Expected Results: IDNs should not resolve unless they are properly encoded in ACE format, which requires that all label seperators are U+002E (full stop, '.'). See bug 279099 for further discussion.
"By obscuring the label seperator, a malicious user (such as a spammer) could hide content from programmatic analysis, such as is done in anti-spam software which might perform RBL or other matching on the domain in a url." Which seems to be easy enough to solve: The anti-spam software should do the same set of transformations before looking up or storing the URL/domain in an RBL. What I don't really understand is: "[RFC...]if dots are used as label separators, changing all the label separators to U+002E." Isn't that exactly what mozilla does and therefore correct behaviour?
Comment 2•20 years ago
|
||
note bug 279099 comment 231 and the following ones
Comment 3•20 years ago
|
||
I do not agree that this is a bug or issue at all.
You say:
"That is, U+002E (full stop), U+3002 (ideographic full stop), U+FF0E
(fullwidth full stop), and U+FF61 (halfwidth ideographic full stop) are
treated as equivalent."
They are equivalent.
You also quote from the RFC:
"Whenever a domain name is put into an IDN-unaware domain name slot (see
section 2), it MUST contain only ASCII characters.".
True, but I argue that HTML sources are IDN-aware. Therefore, we are allowed
to write those characters as separators, just as we are allowed to place
non-ASCII characters in HREF and SRC instead of the ACE form.
The proper solution here is to make the software that tries to analyse the
contents IDN-aware. If they are not, they might very well choke on other
non-ASCII characters on the hostname. The way I see it, separators are the
least of our worries, given that they cannot lead to phishing.
| Reporter | ||
Comment 4•20 years ago
|
||
(In reply to comment #3) > I do not agree that this is a bug or issue at all. Please read the comments in bug 289183. > True, but I argue that HTML sources are IDN-aware. Therefore, we are allowed > to write those characters as separators, just as we are allowed to place > non-ASCII characters in HREF and SRC instead of the ACE form. You're not allowed to put non-ASCII characters in HREF and SRC fields, according to RFC 3490. Domain name slots must be in the ACE form. To quote from Section 2: "A "domain name slot" is defined in this document to be a protocol element or a function argument or a return value (and so on) explicitly designated for carrying a domain name. Examples of domain name slots include: the QNAME field of a DNS query; the name argument of the gethostbyname() library function; the part of an email address following the at-sign (@) in the From: field of an email message header; and the host portion of the URI in the src attribute of an HTML <IMG> tag. General text that just happens to contain a domain name is not a domain name slot; for example, a domain name appearing in the plain text body of an email message is not occupying a domain name slot."
| Reporter | ||
Comment 5•20 years ago
|
||
(In reply to comment #1) > Which seems to be easy enough to solve: The anti-spam software should do the > same set of transformations before looking up or storing the URL/domain in an RBL. Yes. I'm sure that anti-spam software will have to change to take this behavior into account. I have already made changes in the software I work on. However, it doesn't change the fact that this is still bad behavior. Other software that analyzes HTML email or web content could be effected. It's best to fix the problem at the source. As I mentioned in my comments on bug 279099, my corpus of spam does not show this to be a problem yet, which I'm sure is because these urls do not resolve correctly in IE. Now is the time to fix this before spammers start utilizing this obfuscation technique.
So what says that the href and src attributes in HTML are not allowed to be IDN-aware? I think they should be, and that this bug is invalid. Also, resummarizing the bug since the issue you're harping on is not actually the problem that you're complaining about.
Summary: malformed domain label limiters interpreted incorrectly (homographs for '.' in IDNs) → href and src attributes in HTML should not be treated as IDN-aware
Comment 7•20 years ago
|
||
RFC 3490 section 2 says: An "IDN-unaware domain name slot" is defined in this document to be any domain name slot that is not an IDN-aware domain name slot. Obviously, this includes any domain name slot whose specification predates IDNA. HTML 4.01 is from 24 Dec 1999, RFC 3490 is from Mar 2003. Mozilla could comply with RFC 3490 in both standard and quirks modes, in standard mode only, or in neither mode.
So if I declare that we implement HTML 4.01.1, which is the standard published today that is a copy of HTML 4.01, then we're OK?
Comment 9•20 years ago
|
||
:-) Seriously, RFC 3490 is at the Proposed Standard Maturity Level. Maybe this area could be cleared up for the next Maturity Level, Draft Standard.
Comment 10•20 years ago
|
||
This issue was actually discussed on the idn mailing list: http://www.imc.org/idn/mail-archive/msg06179.html http://www.imc.org/idn/entire-arch.txt If you think about it from the point of view of an HTML author, you don't want to put a non-Punycode IDN name in your HTML because many people are still using non-IDNA-capable browsers, so they don't know how to convert the Unicode domain name to one with Punycode labels. This may be one of the reasons why RFC 3490 is written the way it is (i.e. requiring Punycode in IDNA-unaware domain name slots).
(The deeper point behind what I was saying, however, is that HTML itself isn't versioned, and we shouldn't treat it like it is.)
Comment 12•20 years ago
|
||
HTML5 will be (is) IDN-aware. I hereby announce that we are an HTML5 UA. Problem solved. We should (and do, as I understand it) support IDN in HTML.
Comment 13•20 years ago
|
||
The question is whether to allow non-Punycode IDNs in HTML. How about looking at some pros and cons of being permissive (i.e. allowing non-Punycode IDNs in HTML)? Pros: P1. Consistent with other implementations of IDNA (e.g. Opera, i-Nav). P2. Allows both Punycode and non-Punycode IDNs in HTML to work. Cons: C1. Some spam/phishing filters may not catch non-Punycode IDNs, allowing tricky domain name encodings to pass through. C2. HTML authors that only test their documents in permissive implementations will not notice that their documents don't work in older or strict implementations. Of course, some of these pros and cons are more important than others. I have not attempted to weigh them. A beta of MSIE7 may be released this summer, possibly with IDNA support. That may be another datapoint to consider (or not).
Comment 14•20 years ago
|
||
Um, you missed one major Pro, which outweighs everything else. It allows people who don't use US ASCII domains to actually write their domain names in their documents. There really is no argument to have here. HTML is IDN-aware.
Comment 15•20 years ago
|
||
True, that is another pro. So obvious, yet somehow I forgot to include it. Whether people will actually write their non-ASCII domain names in a non-Punycode encoding in HTML before many of their users are using IDNA implementations is another matter. Maybe it doesn't matter to Mozilla.
Comment 16•20 years ago
|
||
That's a chicken-and-egg problem, Erik. If no browsers allow it, people won't use it. If all browsers but Mozilla allow it, people will likely use it (and be broken in Mozilla) or not use it (and curse Mozilla for keeping them from using it). If some browsers allow it and some don't, people will be likely to recommend browsers that _do_ allow it, so they can use it. So from this point of view, not implementing something like this is only ok if other browsers also won't implement it or if implementing it causes serious problems in some way (security issues, large engineering time investment, whatever). Otherwise we're just making people feel that Mozilla is broken and breaking their content, for no good reason.
Comment 17•20 years ago
|
||
...and other browsers already implement this. As do we. This is really not an issue IMHO. :-) WONTFIX/INVALID?
Whiteboard: WONTFIX/INVALID?
Comment 18•20 years ago
|
||
Yes, I'd say this is a WONTFIX as originally stated. We should allow people (who are using an appropriate charset for their documents) to write IDN domains in them in the form in which they are supposed to be typed/written. Gerv
Comment 19•20 years ago
|
||
It may even be an INVALID. I can't seem to find anything in RFC 3490 that states that a browser must not accept non-Punycode IDNs. It does state that non-Punycode IDNs should not be placed in IDN-unaware domain name slots in section 3.1 requirement 2. However, section 6.1 almost seems to contradict this, since it does not reiterate the "whose specification predates IDNA".
| Reporter | ||
Comment 20•20 years ago
|
||
(In reply to comment #19) > It may even be an INVALID. I can't seem to find anything in RFC 3490 that states > that a browser must not accept non-Punycode IDNs. It does state that non-Punycode > IDNs should not be placed in IDN-unaware domain name slots in section 3.1 > requirement 2. However, section 6.1 almost seems to contradict this, since it > does not reiterate the "whose specification predates IDNA". From RFC 3490: > Whenever a domain name is put into an IDN-unaware domain name slot (see section 2), it MUST > contain only ASCII characters. [snip] > An "IDN-unaware domain name slot" is defined in this document to be any domain name slot that is > not an IDN-aware domain name slot. Obviously, this includes any domain name slot whose > specification predates IDNA. Therefore, for example, as the src attribute in an img tag predates IDNA it is an IDN-unaware domain name slot and should only accept ASCII text.
Comment 21•20 years ago
|
||
Regarding comment 20, see comment 12.
Status: UNCONFIRMED → RESOLVED
Closed: 20 years ago
Resolution: --- → INVALID
| Reporter | ||
Comment 22•20 years ago
|
||
(In reply to comment #21) > Regarding comment 20, see comment 12. (Reviews earlier comment....) Good point. Ok, problem solved.
Comment 23•20 years ago
|
||
(In reply to comment #8) > So if I declare that we implement HTML 4.01.1, which is the standard published > today that is a copy of HTML 4.01, then we're OK? RFC-3490 says that if the spec for a slot predates IDNA then the slot is IDN-unaware (if P then Q). You seem to be concluding that if the spec for a slot postdates IDNA then the slot is IDN-aware (if not-P then not-Q), which simply does not follow. RFC-3490 says 'An "IDN-aware domain name slot" is defined in this document to be a domain name slot explicitly designated for carrying an internationalized domain name as defined in this document.' The latest HTML specs (HTML 4.01 and XHTML 1.1) both say that the href attribute (and various others) contains a URI, and they cite RFC-2396, which defines the host field of URIs and does not explicitly designate it for carrying IDNs (not surprising, since 2396 < 3490). Before RFC-3986 appeared, it was very clear that the host field of a URI in an href attribute was an IDN-unaware slot, and therefore non-ASCII IDNs must not be put there. RFC-3986 obsoletes RFC-2396, and invites all sorts of names into the host field, including non-domain-names. The spec "delegates the issue of registered name syntax conformance to the operating system of each application performing URI resolution, and that operating system decides what it will allow for the purpose of host identification". At one point it mentions the possibility that a reg-name could be an IDN and cites RFC-3490. This is not quite the "explicit designation" I had in mind, but I suppose it suffices to make URIs now IDN-aware. (Notice that this non-backward-compatible retroactive redefinition of URIs deliberately bypasses the backward-compatibility protection designed into IDNA, so that use of the backward-compatible representation is merely recommended rather than required.) However, even though URIs are now IDN-aware, they are still ASCII-only! RFC-3986 requires that all non-ASCII characters in all parts of a URI be represented as percent-escaped UTF-8! So they still aren't human-readable at all. If you want human-readable non-ASCII IDNs, you need IRIs (RFC-3987). RFC-3987, unlike RFC-3986, does not obsolete or update the concept of URIs, and does not retroactively alter any specs that refer to URIs. It explicitly states that IRIs are not to be inserted into places that expect URIs, unless the spec explicitly invites IRIs as well. The HTML specs don't (at least not yet). However, even while the HTML and IDNA specs say that certain slots are only allowed to contain certain things, and that writers must not put anything else there, they don't say that readers must reject stuff that should never have been written. As far as I can tell, that's left to the discretion of the implementation. I think it's okay for a browser to accept IRIs wherever it accepts URIs. It would not be okay for an HTML editor to copy a non-ASCII IRI verbatim from the user interface into the HTML file; it must convert the IRI to a URI (using percent-escaped UTF-8 as described in RFC-3987).
Comment 24•20 years ago
|
||
(In reply to comment #19) > I can't seem to find anything in RFC 3490 that states > that a browser must not accept non-Punycode IDNs. Me neither. :) > It does state that non-Punycode > IDNs should not be placed in IDN-unaware domain name slots in section 3.1 > requirement 2. However, section 6.1 almost seems to contradict this, since > it does not reiterate the "whose specification predates IDNA". Specifically what part of 6.1 do you think almost contradicts 3.1 requirement 2?
Comment 25•20 years ago
|
||
The 4th and 5th paragraphs of 6.1 almost contradict 3.1 requirement 2. When I first read that, it was not immediately clear to me whether it was talking about the encoding of the characters that would be visible after ToUnicode or the encoding of the characters that appear after ToASCII. Also, these paragraphs do not reiterate the "predates IDNA" thing, making it even more likely to be misunderstood. You may say that I misunderstand, and that may be true, but I still feel that this part of the spec could be clearer.
You need to log in
before you can comment on or make changes to this bug.
Description
•