The default bug view has changed. See this FAQ.

should unescape hostname first, then perform IDNA

RESOLVED FIXED in Firefox 45

Status

()

Core
Networking
--
critical
RESOLVED FIXED
9 years ago
a year ago

People

(Reporter: Erik van der Poel, Assigned: valentin)

Tracking

(Blocks: 1 bug, {dev-doc-needed})

unspecified
mozilla45
dev-doc-needed
Points:
---

Firefox Tracking Flags

(firefox45 fixed, status2.0 wanted)

Details

Attachments

(2 attachments, 4 obsolete attachments)

(Reporter)

Description

9 years ago
Host names in URIs (and IRIs) may have %-escaped octets in them. Firefox (and
other Mozilla software) should unescape the host name before applying the IDNA
(Internationalized Domain Names in Applications) process. For example:

<a href="http://&#x5341;%2ecom/">blah</a>

The above causes Firefox 2 to display an error message containing the string
xn--%2ecom-9b5j. Other browsers like MSIE 7 and Opera 9 will convert the above
domain name into xn--kkr.com and will succeed.

The steps should be:

1. convert incoming HTML to Unicode
2. resolve NCRs like &#x5341;
3. parse URI/IRI to get host name
4. unescape host name (%2e becomes .)
5. perform IDNA's ToASCII
6. perform DNS lookup
7. perform HTTP request with Host: header
(Reporter)

Comment 1

9 years ago
If step 4 yields invalid UTF-8 in some labels (label = between dots), skip step
5 or perform step 5 only on the valid UTF-8 labels.
(Reporter)

Comment 2

9 years ago
On 2nd thought, if step 4 yields malformed UTF-8, abort the lookup, and produce
an error message (or something). Firefox currently already returns an error if
there are any %-escapes in the host name, so aborting on malformed UTF-8 would
not be a step backwards.
Created attachment 324616 [details] [diff] [review]
fix + unit test

Host name is unescaped just before NormalizeIDN() is called.
Assignee: nobody → michal
Status: NEW → ASSIGNED
Attachment #324616 - Flags: review?(cbiesinger)
(Reporter)

Comment 4

9 years ago
Thanks for working on this. With this patch, what happens when the host name contains an escaped slash (i.e. %2F)?
Sequence %2f in the hostname will be unescaped and will be part of the hostname. F.e. URL "http://a%2fb/c" is parsed as follows (see mHost):

(gdb) p *url
$4 = {<nsIFileURL> = {<nsIURL> = {<nsIURI> = {<nsISupports> = {
          _vptr.nsISupports = 0x235ee88}, <No data fields>}, <No data fields>}, <No data fields>}, <nsIStandardURL> = {<nsIMutable> = {<nsISupports> = {
        _vptr.nsISupports = 0x235ef98}, <No data fields>}, <No data fields>}, <nsISerializable> = {<nsISupports> = {
      _vptr.nsISupports = 0x235efb8}, <No data fields>}, <nsIClassInfo> = {<nsISupports> = {_vptr.nsISupports = 0x235efd4}, <No data fields>}, mRefCnt = {
    mValue = 1}, _mOwningThread = {mThread = 0x9bd5550}, 
  mSpec = {<nsACString_internal> = {<nsCSubstring_base> = {<No data fields>}, 
      mData = 0x9cf7478 "http://a/b/c", mLength = 12, 
      mFlags = 5}, <No data fields>}, mDefaultPort = 80, mPort = -1, 
  mScheme = {mPos = 0, mLen = 4}, mAuthority = {mPos = 7, mLen = 3}, 
  mUsername = {mPos = 7, mLen = -1}, mPassword = {mPos = 7, mLen = -1}, 
  mHost = {mPos = 7, mLen = 3}, mPath = {mPos = 10, mLen = 2}, mFilepath = {
    mPos = 10, mLen = 2}, mDirectory = {mPos = 10, mLen = 1}, mBasename = {
    mPos = 11, mLen = 1}, mExtension = {mPos = 13, mLen = -1}, mParam = {
    mPos = 12, mLen = -1}, mQuery = {mPos = 12, mLen = -1}, mRef = {mPos = 12, 
    mLen = -1}, 
  mOriginCharset = {<nsACString_internal> = {<nsCSubstring_base> = {<No data fields>}, mData = 0x23269a2 "", mLength = 0, mFlags = 1}, <No data fields>}, 
  mParser = {mRawPtr = 0x9cc1eb8}, mFile = {mRawPtr = 0x0}, mHostA = 0x0, 
  mHostEncoding = 1, mSpecEncoding = 0, mURLType = 2, mMutable = 1, 
  mSupportsFileURL = 0, static gIDN = 0x9c9d1e8, static gCharsetMgr = 0x0, 
  static gInitialized = 1, static gEscapeUTF8 = 1, 
  static gAlwaysEncodeInUTF8 = 1, static gEncodeQueryInUTF8 = 0}


Connecting to such host fails at net_IsValidHostName(). But something like this is possible:

js> var newURI2 = ios.newURI("http://example.com%2fxxx/", null, null);
js> var newURI3 = ios.newURI(newURI2.spec, null, null);
js> newURI2.host
example.com/xxx
js> newURI3.host
example.com

Is this behaviour correct? Maybe there should be check if '/' is present after hostname is unescaped.
(Reporter)

Comment 6

9 years ago
First of all, when you re-insert a host name back into a URL, you must first escape it (e.g. / -> %2F). The host name terminators are :/;?# although the ; is questionable in the http: case. (MSIE does not treat ; as a host/path terminator in http: nor ftp:.) However, see below.

Secondly, I suppose there are a couple of different places to check for bad characters in the host name. One might be just before a DNS lookup, but if an HTTP request is going through a proxy, the host is not looked up in DNS. So, checking for bad characters after unescaping and running it through IDNA might be good.
Created attachment 324833 [details] [diff] [review]
new patch, validity of hostname is checked 

(In reply to comment #6)
> First of all, when you re-insert a host name back into a URL, you must first
> escape it (e.g. / -> %2F). The host name terminators are :/;?# although the ;
> is questionable in the http: case. (MSIE does not treat ; as a host/path
> terminator in http: nor ftp:.) However, see below.
> 
> Secondly, I suppose there are a couple of different places to check for bad
> characters in the host name. One might be just before a DNS lookup, but if an
> HTTP request is going through a proxy, the host is not looked up in DNS. So,
> checking for bad characters after unescaping and running it through IDNA might
> be good.

I added the check after unescaping and IDNA. No character that is allowed in hostname by net_IsValidHostName() needs escaping, so there is no nsEscapeCount()...
Attachment #324616 - Attachment is obsolete: true
Attachment #324833 - Flags: review?(cbiesinger)
Attachment #324616 - Flags: review?(cbiesinger)
(Reporter)

Comment 8

9 years ago
Looks good. I recommend adding a couple of tests:
http://%e5%8d%81.com -> http://xn--kkr.com
http://%80.com -> error (invalid UTF-8)
Created attachment 325078 [details] [diff] [review]
added some tests

(In reply to comment #8)
> Looks good. I recommend adding a couple of tests:
> http://%e5%8d%81.com -> http://xn--kkr.com
> http://%80.com -> error (invalid UTF-8)

Added. Erik, can you do the review?
Attachment #324833 - Attachment is obsolete: true
Attachment #325078 - Flags: review?(cbiesinger)
Attachment #324833 - Flags: review?(cbiesinger)
(Reporter)

Comment 10

9 years ago
I recommend adding a comment to each unit test. For example:

// Make sure escaped dot (%2e) is unescaped before applying IDNA

I also recommend adding unit tests for all of the host terminators :/;?#
I.e. %3A %2F %3B %3F %23

The code looks good, but I think we should ask a regular Mozilla reviewer to take a look.
Should we really be unescaping %-escaped octets in URIs? In particular, the "." part in a URI must be literally given, according to the RFC, and the label part must be a set of "alphanum"s, not "escaped"s. This seems dangerous.
Nevermind, I see the new RFCs allow it.
The new RFCs don't treat a semicolon as a host delimiter, by the way -- the ; character is allowed in hostnames.

How should %-encoded bytes be treated? IE7 seems to just send the raw bytes to the DNS resolution layer as far as Erik's testing seems to show. Opera decodes the %-encoded bytes as UTF-8 and reencodes them as IDN before doing the query. Nobody else seems to do anything useful. What should we do?
(Reporter)

Comment 14

9 years ago
The last paragraph of section 3.2.2 in the following RFC seems to suggest that Opera's interpretation is fine:

http://www.ietf.org/rfc/rfc3986.txt

I find it somewhat strange to send %-escaped text in DNS, since %-escapes are for URIs, not DNS.

I think it's fine to stop treating the semicolon as a host, port or path terminator, especially in http: and https:. However, for ftp: it may still be useful (though MSIE doesn't support it).
Created attachment 325254 [details] [diff] [review]
Added some tests and comments to unittest
Attachment #325078 - Attachment is obsolete: true
Attachment #325254 - Flags: review?(cbiesinger)
Attachment #325078 - Flags: review?(cbiesinger)
(Reporter)

Comment 16

9 years ago
> there should be only allowed characters in hostname after
> unescaping and applying IDNA

How about "after unescaping and attempting to apply IDNA. "\x80" is illegal in UTF-8, so IDNA fails, and 0x80 is illegal in DNS too."

Have you tried some actual hrefs in real HTML files? (I'm not familiar with the rest of the Mozilla code, so I'd like to be sure that real HTML files work properly, in addition to the JavaScript tests that you have written.)
Created attachment 325403 [details] [diff] [review]
minor change in comment

(In reply to comment #16)
> Have you tried some actual hrefs in real HTML files? (I'm not familiar with the
> rest of the Mozilla code, so I'd like to be sure that real HTML files work
> properly, in addition to the JavaScript tests that you have written.)

Yes, of course I did also some tests with URLs in HTML files and it worked correctly.
Attachment #325254 - Attachment is obsolete: true
Attachment #325403 - Flags: review?(cbiesinger)
Attachment #325254 - Flags: review?(cbiesinger)
(Reporter)

Comment 18

9 years ago
Looks good. Thanks!

By the way, MSIE 6 and 7 reject host names with %-escapes in the range %00-1F, and %2F, %40 and %5C. MSIE 6 rejects %25 under some circumstances and MSIE 7 rejects %3F. Firefox should probably be at least as careful as MSIE, if not much more careful.
looks like this is a duplicate of bug 309671, although now that there's a patch here it probably shouldn't be marked as such
Comment on attachment 325403 [details] [diff] [review]
minor change in comment

Don't you also have to change SetHost?


(Hm, if you're adding the hostname validity check here we might be able to remove it from the other places where we currently have it)
Attachment #325403 - Flags: review?(cbiesinger) → review-

Updated

7 years ago
status2.0: --- → wanted

Comment 21

6 years ago
Hello. I don't know if the issue I'm about to present should be a separate issue.

When an e-mail message is displayed in Thunderbird, the text body might contain a URI with an international domain name, such as
http://www.henriksørensen.dk/

I hope that your program will display the above URI with a LATIN SMALL LETTER O WITH STROKE. I am not fully confident with how text is displayed in different programs when sent from Bugzilla to you.

When I click on that URI in Thunderbird, it is sent to Firefox. Firefox tries to display the page at that URI but Firefox says Server not found.

Firefox tries to resolve the domain name. Here is output from the plugin HttpFox:
Host	www.henriks%c3%b8rensen.dk

The error code is NS_ERROR_UNKNOWN_HOST.

Here is output from HttpFox when I manually type the same URL in Firefox:
Host	www.xn--henriksrensen-hnb.dk

In the latter case the international domain name is correctly interpreted and handled by Firefox.

Shouldn't Firefox treat a URI equally whether the URI is sent from Thunderbird  or the URI is typed manually?

Then I changed which browser should handle HTTP URIs. I temporarily chose Safari. When I click on the URI in Thunderbird, the URI is sent to Safari. The correct host is found right away. I use Thunderbird and Firefox for Mac OS X.
(Assignee)

Comment 22

a year ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ba2bb2369ba4
(Assignee)

Comment 23

a year ago
Created attachment 8686406 [details] [diff] [review]
should unescape hostname first, then perform IDNA
(Assignee)

Updated

a year ago
Assignee: michal.novotny → valentin.gosu
(Assignee)

Comment 24

a year ago
Comment on attachment 8686406 [details] [diff] [review]
should unescape hostname first, then perform IDNA

This patch is based on code that has been r+d by Honza in Bug 1023468, but enough things have changed that it warrants a new review.
Changes:
* We do percent decoding of hostnames in BuildNormalizedSpec and SetHost.
* NormalizeIDN returns an nsresult. We abort the parsing if the call fails.
* NS_UnescapeURL is modified, so when called with the esc_Host flag, it only unescapes ASCII characters that are allowed in the hostname, and non-ascii characters. This is mostly so that hostnames containing % are not rejected, since they are part of the ID of some addons.
* I also change NS_UnescapeURL to only deref *(p+1) and *(p+2) only once. It's clearer this way.
Attachment #8686406 - Flags: review?(mcmanus)
(Assignee)

Comment 25

a year ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8c6a52809582
(Assignee)

Updated

a year ago
Duplicate of this bug: 1023468
Attachment #8686406 - Flags: review?(mcmanus) → review+
(Assignee)

Comment 27

a year ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/cc4f5b4bb374b95c36c818e666c2a087fe6f20ca
Bug 412457 - should unescape hostname first, then perform IDNA r=mcmanus

Comment 28

a year ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/cc4f5b4bb374
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
status-firefox45: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla45
(Assignee)

Comment 29

a year ago
This needs a dev-doc as it changes the way URL are parsed.

Previously percent encoded characters in the hostname wouldn't be decoded, and clicking on http://%65%78%61%6d%70%6c%65%2e%6f%72%67 would fail as the domain %65%78%61%6d%70%6c%65%2e%6f%72%67 didn't exist.
Now however, the hostname is percent decoded, thus correctly loading http://example.org
Note that for http://example.org%3a %3a wouldn't be decoded, as it's a character that isn't allowed in the hostname (the colon).
Blocks: 906714
Keywords: dev-doc-needed
(Assignee)

Updated

a year ago
Duplicate of this bug: 309671
(Assignee)

Updated

a year ago
Duplicate of this bug: 43659
You need to log in before you can comment on or make changes to this bug.