Last Comment Bug 412457 - should unescape hostname first, then perform IDNA
: should unescape hostname first, then perform IDNA
Status: RESOLVED FIXED
: dev-doc-needed
Product: Core
Classification: Components
Component: Networking (show other bugs)
: unspecified
: All All
: -- critical with 1 vote (vote)
: mozilla45
Assigned To: Valentin Gosu [:valentin] (vacation until 1 Sept)
:
Mentors:
: 43659 309671 1023468 (view as bug list)
Depends on:
Blocks: url
  Show dependency treegraph
 
Reported: 2008-01-15 09:38 PST by Erik van der Poel
Modified: 2016-01-19 11:00 PST (History)
18 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
fixed
wanted


Attachments
fix + unit test (2.65 KB, patch)
2008-06-11 05:40 PDT, Michal Novotny (:michal)
no flags Details | Diff | Splinter Review
new patch, validity of hostname is checked (2.93 KB, patch)
2008-06-12 11:20 PDT, Michal Novotny (:michal)
no flags Details | Diff | Splinter Review
added some tests (3.14 KB, patch)
2008-06-14 02:00 PDT, Michal Novotny (:michal)
no flags Details | Diff | Splinter Review
Added some tests and comments to unittest (4.71 KB, patch)
2008-06-16 06:08 PDT, Michal Novotny (:michal)
no flags Details | Diff | Splinter Review
minor change in comment (4.80 KB, patch)
2008-06-17 06:11 PDT, Michal Novotny (:michal)
cbiesinger: review-
Details | Diff | Splinter Review
should unescape hostname first, then perform IDNA (12.16 KB, patch)
2015-11-11 18:51 PST, Valentin Gosu [:valentin] (vacation until 1 Sept)
mcmanus: review+
Details | Diff | Splinter Review

Description Erik van der Poel 2008-01-15 09:38:24 PST
Host names in URIs (and IRIs) may have %-escaped octets in them. Firefox (and
other Mozilla software) should unescape the host name before applying the IDNA
(Internationalized Domain Names in Applications) process. For example:

<a href="http://&#x5341;%2ecom/">blah</a>

The above causes Firefox 2 to display an error message containing the string
xn--%2ecom-9b5j. Other browsers like MSIE 7 and Opera 9 will convert the above
domain name into xn--kkr.com and will succeed.

The steps should be:

1. convert incoming HTML to Unicode
2. resolve NCRs like &#x5341;
3. parse URI/IRI to get host name
4. unescape host name (%2e becomes .)
5. perform IDNA's ToASCII
6. perform DNS lookup
7. perform HTTP request with Host: header
Comment 1 Erik van der Poel 2008-01-15 09:42:06 PST
If step 4 yields invalid UTF-8 in some labels (label = between dots), skip step
5 or perform step 5 only on the valid UTF-8 labels.
Comment 2 Erik van der Poel 2008-01-24 09:13:36 PST
On 2nd thought, if step 4 yields malformed UTF-8, abort the lookup, and produce
an error message (or something). Firefox currently already returns an error if
there are any %-escapes in the host name, so aborting on malformed UTF-8 would
not be a step backwards.
Comment 3 Michal Novotny (:michal) 2008-06-11 05:40:24 PDT
Created attachment 324616 [details] [diff] [review]
fix + unit test

Host name is unescaped just before NormalizeIDN() is called.
Comment 4 Erik van der Poel 2008-06-11 09:31:04 PDT
Thanks for working on this. With this patch, what happens when the host name contains an escaped slash (i.e. %2F)?
Comment 5 Michal Novotny (:michal) 2008-06-11 16:40:33 PDT
Sequence %2f in the hostname will be unescaped and will be part of the hostname. F.e. URL "http://a%2fb/c" is parsed as follows (see mHost):

(gdb) p *url
$4 = {<nsIFileURL> = {<nsIURL> = {<nsIURI> = {<nsISupports> = {
          _vptr.nsISupports = 0x235ee88}, <No data fields>}, <No data fields>}, <No data fields>}, <nsIStandardURL> = {<nsIMutable> = {<nsISupports> = {
        _vptr.nsISupports = 0x235ef98}, <No data fields>}, <No data fields>}, <nsISerializable> = {<nsISupports> = {
      _vptr.nsISupports = 0x235efb8}, <No data fields>}, <nsIClassInfo> = {<nsISupports> = {_vptr.nsISupports = 0x235efd4}, <No data fields>}, mRefCnt = {
    mValue = 1}, _mOwningThread = {mThread = 0x9bd5550}, 
  mSpec = {<nsACString_internal> = {<nsCSubstring_base> = {<No data fields>}, 
      mData = 0x9cf7478 "http://a/b/c", mLength = 12, 
      mFlags = 5}, <No data fields>}, mDefaultPort = 80, mPort = -1, 
  mScheme = {mPos = 0, mLen = 4}, mAuthority = {mPos = 7, mLen = 3}, 
  mUsername = {mPos = 7, mLen = -1}, mPassword = {mPos = 7, mLen = -1}, 
  mHost = {mPos = 7, mLen = 3}, mPath = {mPos = 10, mLen = 2}, mFilepath = {
    mPos = 10, mLen = 2}, mDirectory = {mPos = 10, mLen = 1}, mBasename = {
    mPos = 11, mLen = 1}, mExtension = {mPos = 13, mLen = -1}, mParam = {
    mPos = 12, mLen = -1}, mQuery = {mPos = 12, mLen = -1}, mRef = {mPos = 12, 
    mLen = -1}, 
  mOriginCharset = {<nsACString_internal> = {<nsCSubstring_base> = {<No data fields>}, mData = 0x23269a2 "", mLength = 0, mFlags = 1}, <No data fields>}, 
  mParser = {mRawPtr = 0x9cc1eb8}, mFile = {mRawPtr = 0x0}, mHostA = 0x0, 
  mHostEncoding = 1, mSpecEncoding = 0, mURLType = 2, mMutable = 1, 
  mSupportsFileURL = 0, static gIDN = 0x9c9d1e8, static gCharsetMgr = 0x0, 
  static gInitialized = 1, static gEscapeUTF8 = 1, 
  static gAlwaysEncodeInUTF8 = 1, static gEncodeQueryInUTF8 = 0}


Connecting to such host fails at net_IsValidHostName(). But something like this is possible:

js> var newURI2 = ios.newURI("http://example.com%2fxxx/", null, null);
js> var newURI3 = ios.newURI(newURI2.spec, null, null);
js> newURI2.host
example.com/xxx
js> newURI3.host
example.com

Is this behaviour correct? Maybe there should be check if '/' is present after hostname is unescaped.
Comment 6 Erik van der Poel 2008-06-12 08:29:41 PDT
First of all, when you re-insert a host name back into a URL, you must first escape it (e.g. / -> %2F). The host name terminators are :/;?# although the ; is questionable in the http: case. (MSIE does not treat ; as a host/path terminator in http: nor ftp:.) However, see below.

Secondly, I suppose there are a couple of different places to check for bad characters in the host name. One might be just before a DNS lookup, but if an HTTP request is going through a proxy, the host is not looked up in DNS. So, checking for bad characters after unescaping and running it through IDNA might be good.
Comment 7 Michal Novotny (:michal) 2008-06-12 11:20:54 PDT
Created attachment 324833 [details] [diff] [review]
new patch, validity of hostname is checked 

(In reply to comment #6)
> First of all, when you re-insert a host name back into a URL, you must first
> escape it (e.g. / -> %2F). The host name terminators are :/;?# although the ;
> is questionable in the http: case. (MSIE does not treat ; as a host/path
> terminator in http: nor ftp:.) However, see below.
> 
> Secondly, I suppose there are a couple of different places to check for bad
> characters in the host name. One might be just before a DNS lookup, but if an
> HTTP request is going through a proxy, the host is not looked up in DNS. So,
> checking for bad characters after unescaping and running it through IDNA might
> be good.

I added the check after unescaping and IDNA. No character that is allowed in hostname by net_IsValidHostName() needs escaping, so there is no nsEscapeCount()...
Comment 8 Erik van der Poel 2008-06-12 18:33:24 PDT
Looks good. I recommend adding a couple of tests:
http://%e5%8d%81.com -> http://xn--kkr.com
http://%80.com -> error (invalid UTF-8)
Comment 9 Michal Novotny (:michal) 2008-06-14 02:00:37 PDT
Created attachment 325078 [details] [diff] [review]
added some tests

(In reply to comment #8)
> Looks good. I recommend adding a couple of tests:
> http://%e5%8d%81.com -> http://xn--kkr.com
> http://%80.com -> error (invalid UTF-8)

Added. Erik, can you do the review?
Comment 10 Erik van der Poel 2008-06-14 10:06:14 PDT
I recommend adding a comment to each unit test. For example:

// Make sure escaped dot (%2e) is unescaped before applying IDNA

I also recommend adding unit tests for all of the host terminators :/;?#
I.e. %3A %2F %3B %3F %23

The code looks good, but I think we should ask a regular Mozilla reviewer to take a look.
Comment 11 Hixie (not reading bugmail) 2008-06-14 13:11:13 PDT
Should we really be unescaping %-escaped octets in URIs? In particular, the "." part in a URI must be literally given, according to the RFC, and the label part must be a set of "alphanum"s, not "escaped"s. This seems dangerous.
Comment 12 Hixie (not reading bugmail) 2008-06-14 13:13:12 PDT
Nevermind, I see the new RFCs allow it.
Comment 13 Hixie (not reading bugmail) 2008-06-14 13:19:42 PDT
The new RFCs don't treat a semicolon as a host delimiter, by the way -- the ; character is allowed in hostnames.

How should %-encoded bytes be treated? IE7 seems to just send the raw bytes to the DNS resolution layer as far as Erik's testing seems to show. Opera decodes the %-encoded bytes as UTF-8 and reencodes them as IDN before doing the query. Nobody else seems to do anything useful. What should we do?
Comment 14 Erik van der Poel 2008-06-14 16:43:27 PDT
The last paragraph of section 3.2.2 in the following RFC seems to suggest that Opera's interpretation is fine:

http://www.ietf.org/rfc/rfc3986.txt

I find it somewhat strange to send %-escaped text in DNS, since %-escapes are for URIs, not DNS.

I think it's fine to stop treating the semicolon as a host, port or path terminator, especially in http: and https:. However, for ftp: it may still be useful (though MSIE doesn't support it).
Comment 15 Michal Novotny (:michal) 2008-06-16 06:08:20 PDT
Created attachment 325254 [details] [diff] [review]
Added some tests and comments to unittest
Comment 16 Erik van der Poel 2008-06-16 08:19:27 PDT
> there should be only allowed characters in hostname after
> unescaping and applying IDNA

How about "after unescaping and attempting to apply IDNA. "\x80" is illegal in UTF-8, so IDNA fails, and 0x80 is illegal in DNS too."

Have you tried some actual hrefs in real HTML files? (I'm not familiar with the rest of the Mozilla code, so I'd like to be sure that real HTML files work properly, in addition to the JavaScript tests that you have written.)
Comment 17 Michal Novotny (:michal) 2008-06-17 06:11:05 PDT
Created attachment 325403 [details] [diff] [review]
minor change in comment

(In reply to comment #16)
> Have you tried some actual hrefs in real HTML files? (I'm not familiar with the
> rest of the Mozilla code, so I'd like to be sure that real HTML files work
> properly, in addition to the JavaScript tests that you have written.)

Yes, of course I did also some tests with URLs in HTML files and it worked correctly.
Comment 18 Erik van der Poel 2008-06-17 10:38:59 PDT
Looks good. Thanks!

By the way, MSIE 6 and 7 reject host names with %-escapes in the range %00-1F, and %2F, %40 and %5C. MSIE 6 rejects %25 under some circumstances and MSIE 7 rejects %3F. Firefox should probably be at least as careful as MSIE, if not much more careful.
Comment 19 Christian :Biesinger (don't email me, ping me on IRC) 2008-07-25 13:45:32 PDT
looks like this is a duplicate of bug 309671, although now that there's a patch here it probably shouldn't be marked as such
Comment 20 Christian :Biesinger (don't email me, ping me on IRC) 2008-07-25 14:27:09 PDT
Comment on attachment 325403 [details] [diff] [review]
minor change in comment

Don't you also have to change SetHost?


(Hm, if you're adding the hostname validity check here we might be able to remove it from the other places where we currently have it)
Comment 21 Gunnar Vestergaard 2011-07-25 04:18:59 PDT
Hello. I don't know if the issue I'm about to present should be a separate issue.

When an e-mail message is displayed in Thunderbird, the text body might contain a URI with an international domain name, such as
http://www.henriksørensen.dk/

I hope that your program will display the above URI with a LATIN SMALL LETTER O WITH STROKE. I am not fully confident with how text is displayed in different programs when sent from Bugzilla to you.

When I click on that URI in Thunderbird, it is sent to Firefox. Firefox tries to display the page at that URI but Firefox says Server not found.

Firefox tries to resolve the domain name. Here is output from the plugin HttpFox:
Host	www.henriks%c3%b8rensen.dk

The error code is NS_ERROR_UNKNOWN_HOST.

Here is output from HttpFox when I manually type the same URL in Firefox:
Host	www.xn--henriksrensen-hnb.dk

In the latter case the international domain name is correctly interpreted and handled by Firefox.

Shouldn't Firefox treat a URI equally whether the URI is sent from Thunderbird  or the URI is typed manually?

Then I changed which browser should handle HTTP URIs. I temporarily chose Safari. When I click on the URI in Thunderbird, the URI is sent to Safari. The correct host is found right away. I use Thunderbird and Firefox for Mac OS X.
Comment 22 Valentin Gosu [:valentin] (vacation until 1 Sept) 2015-11-10 05:59:17 PST
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ba2bb2369ba4
Comment 23 Valentin Gosu [:valentin] (vacation until 1 Sept) 2015-11-11 18:51:57 PST
Created attachment 8686406 [details] [diff] [review]
should unescape hostname first, then perform IDNA
Comment 24 Valentin Gosu [:valentin] (vacation until 1 Sept) 2015-11-11 19:09:59 PST
Comment on attachment 8686406 [details] [diff] [review]
should unescape hostname first, then perform IDNA

This patch is based on code that has been r+d by Honza in Bug 1023468, but enough things have changed that it warrants a new review.
Changes:
* We do percent decoding of hostnames in BuildNormalizedSpec and SetHost.
* NormalizeIDN returns an nsresult. We abort the parsing if the call fails.
* NS_UnescapeURL is modified, so when called with the esc_Host flag, it only unescapes ASCII characters that are allowed in the hostname, and non-ascii characters. This is mostly so that hostnames containing % are not rejected, since they are part of the ID of some addons.
* I also change NS_UnescapeURL to only deref *(p+1) and *(p+2) only once. It's clearer this way.
Comment 25 Valentin Gosu [:valentin] (vacation until 1 Sept) 2015-11-11 19:10:59 PST
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8c6a52809582
Comment 26 Valentin Gosu [:valentin] (vacation until 1 Sept) 2015-11-12 00:13:52 PST
*** Bug 1023468 has been marked as a duplicate of this bug. ***
Comment 27 Valentin Gosu [:valentin] (vacation until 1 Sept) 2015-11-18 06:29:13 PST
https://hg.mozilla.org/integration/mozilla-inbound/rev/cc4f5b4bb374b95c36c818e666c2a087fe6f20ca
Bug 412457 - should unescape hostname first, then perform IDNA r=mcmanus
Comment 28 Carsten Book [:Tomcat] - PTO-back Sept 4th 2015-11-19 06:30:10 PST
https://hg.mozilla.org/mozilla-central/rev/cc4f5b4bb374
Comment 29 Valentin Gosu [:valentin] (vacation until 1 Sept) 2015-12-02 15:04:18 PST
This needs a dev-doc as it changes the way URL are parsed.

Previously percent encoded characters in the hostname wouldn't be decoded, and clicking on http://%65%78%61%6d%70%6c%65%2e%6f%72%67 would fail as the domain %65%78%61%6d%70%6c%65%2e%6f%72%67 didn't exist.
Now however, the hostname is percent decoded, thus correctly loading http://example.org
Note that for http://example.org%3a %3a wouldn't be decoded, as it's a character that isn't allowed in the hostname (the colon).
Comment 30 Valentin Gosu [:valentin] (vacation until 1 Sept) 2015-12-02 15:13:59 PST
*** Bug 309671 has been marked as a duplicate of this bug. ***
Comment 31 Valentin Gosu [:valentin] (vacation until 1 Sept) 2016-01-19 11:00:16 PST
*** Bug 43659 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.