Open Bug 1026938 Opened 10 years ago Updated 2 years ago

\ (backslash) should be encoded as %5C in the URL query

Categories

(Firefox :: Address Bar, defect, P3)

30 Branch
defect

Tracking

()

People

(Reporter: novamarco, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Whiteboard: [necko-backlog])

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:30.0) Gecko/20100101 Firefox/30.0 (Beta/Release)
Build ID: 20140605174243

Steps to reproduce:

we are developing a website with a search element.
When the user want to filter on a specific tag he clicks on a link.
A javascript will replace the location.href with the new filter and refresh the page.
Everythings works fine until the new filter contains a \.

The sample javascript is:

var url = location.href;
var fq = 'tag_dimension.Person:Elizabeth\\ Sherwood\\-Randall';
url += "&fq=" + encodeURIComponent(fq);
location.href = url;




Actual results:

suppose you have this url:

http://localhost:8080/cmlink/polopoly-post?q=us

REQUEST:

GET /cmlink/polopoly-post?q=us HTTP/1.1
Host: localhost:8080
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:30.0) Gecko/20100101 Firefox/30.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
DNT: 1
Connection: keep-alive
Cache-Control: max-age=0

RESPONSE:

HTTP/1.1 200 OK
Content-Type: text/html;charset=UTF-8
Cache-Control: public, max-age=300, s-maxage=300
Expires: Wed, 18 Jun 2014 08:07:23 GMT
Content-Length: 14725
Server: Jetty(7.4.5.v20110725)


when you click on the new filter the url submitted is:

http://localhost:8080/cmlink/polopoly-post?q=us&fq=tag_dimension.Person%3AElizabeth%5C%20Sherwood%5C-Randall

REQUEST:

GET /cmlink/polopoly-post?q=us&fq=tag_dimension.Person%3AElizabeth%5C%20Sherwood%5C-Randall HTTP/1.1
Host: localhost:8080
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:30.0) Gecko/20100101 Firefox/30.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
DNT: 1
Referer: http://localhost:8080/cmlink/polopoly-post?q=us
Connection: keep-alive

RESPONSE:

HTTP/1.1 200 OK
Content-Type: text/html;charset=UTF-8
Cache-Control: public, max-age=300, s-maxage=300
Expires: Wed, 18 Jun 2014 08:08:05 GMT
Content-Length: 11248
Server: Jetty(7.4.5.v20110725)

firefox display the url like this:

http://localhost:8080/cmlink/polopoly-post?q=us&fq=tag_dimension.Person%3AElizabeth\%20Sherwood\-Randall

When you reload the page the new url submitted is:

http://localhost:8080/cmlink/polopoly-post?q=us&fq=tag_dimension.Person%3AElizabeth\%20Sherwood\-Randall

REQUEST:

GET /cmlink/polopoly-post?q=us&fq=tag_dimension.Person%3AElizabeth\%20Sherwood\-Randall HTTP/1.1
Host: localhost:8080
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:30.0) Gecko/20100101 Firefox/30.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
DNT: 1
Connection: keep-alive

RESPONSE:

HTTP/1.1 500 java.net.URISyntaxException: Illegal character in query at index 83: http://localhost:8080/cmlink/polopoly-post?q=us&fq=tag_dimension.Person%3AElizabeth\%20Sherwood\-Randall
Content-Type: text/html;charset=ISO-8859-1
Cache-Control: must-revalidate,no-cache,no-store
Content-Length: 8265
Server: Jetty(7.4.5.v20110725)

which is not the same as the original one.


Expected results:

the URL should not be decoded in the browser location url, it should display:

http://localhost:8080/cmlink/polopoly-post?q=us&fq=tag_dimension.Person%3AElizabeth%5C%20Sherwood%5C-Randall
Severity: normal → critical
Summary: %5C characters in url are replaced with \ in local url → %5C characters in url are replaced with \ in location url
QA Whiteboard: [DUPEME?]
Summary: %5C characters in url are replaced with \ in location url → %5C characters in url are replaced with \ in location bar
See Also: → 652186
Dao, could you take this?
Status: UNCONFIRMED → NEW
Ever confirmed: true
(In reply to Marco Nova from comment #0)
> HTTP/1.1 500 java.net.URISyntaxException: Illegal character in query at
> index 83:

This behavior isn't spec-compliant, is it? Seems like we should evangelize this as a server bug, or change our behavior as well as the URL spec...
Rank: 1
Product: Firefox → Core
I don't think this has much to do with bug 652186.

Also, I didn't mean to move this from Firefox to Core (yet).
Product: Core → Firefox
See Also: 652186
Rank: 1
As I see it, the URL spec already lists \ in the query as an invalid character. Looking at https://url.spec.whatwg.org/#query-state , and seeing how \ is not a URL code point... URLs with containing \ in the query, are actually invalid.

The problem seems to be not when refreshing the page with Ctrl-R, but when clicking the location bar and hitting Enter - it ends up loading the decoded URL.

As I see it, there are 2 possible solutions to this problem.
1. Make sure we don't decode \ (seems awkward, but should work).
2. If the contents of the url-bar haven't changed, reload the page, instead of loading the decoded URL (seems like a better long term solution).
We should really have a separate display URL and underlying URL. And never use the display URL to fetch something.
(In reply to Anne (:annevk) from comment #5)
> We should really have a separate display URL and underlying URL. And never
> use the display URL to fetch something.

That sounds like a good plan. Dão would have a better understanding of what's actually possible here, but how about this:

Show the decoded URL in the location bar, but as soon as the user clicks the bar, or maybe even on hover, replace the contents with the original URL. This should solve a lot of problems, including the ones with copying from the URL bar.
(In reply to Valentin Gosu [:valentin] from comment #4)
> As I see it, the URL spec already lists \ in the query as an invalid
> character. Looking at https://url.spec.whatwg.org/#query-state , and seeing
> how \ is not a URL code point... URLs with containing \ in the query, are
> actually invalid.

Do you have plans to make the URL parser comply with that? Creating a nsIURI from "http://x/?y= \" gives me "http://x/?y=%20\" rather than "http://x/?y=%20%5C".
Making the URL parser not accept that character would in no way fix this bug. We need to keep the url bar from turning valid URLs into invalid ones.
(In reply to Valentin Gosu [:valentin] from comment #8)
> Making the URL parser not accept that character would in no way fix this
> bug.

Why not?
(In reply to Dão Gottwald [:dao] from comment #9)
> (In reply to Valentin Gosu [:valentin] from comment #8)
> > Making the URL parser not accept that character would in no way fix this
> > bug.
> 
> Why not?

If we make the parser FAIL on invalid characters, the bug wouldn't be fixed, but instead the page wouldn't load at all when the user clicked the url-bar and hit enter.

If we made the parser forcefully encode the \ character, it would indeed get rid of THIS bug, but it wouldn't fix the underlying cause. The url-bar still decodes characters and turns valid characters into invalid ones (and sometimes it turns valid URLs into different valid URLs by doing percent decoding).
Also, having the parser encode the \ character is also something other browsers simply don't do. See bug 1152455 comment 6.
(In reply to Valentin Gosu [:valentin] from comment #10)
> If we made the parser forcefully encode the \ character, it would indeed get
> rid of THIS bug,

Yes, that's what I expected "treat as illegal" would mean. Just like with spaces.

> but it wouldn't fix the underlying cause. The url-bar still
> decodes characters and turns valid characters into invalid ones

Well, yes, so far we've been relying on the URL parser to normalize our illegal URLs. What you proposed in comment 6 isn't a great user experience when editing URLs, so we'd like to avoid that if at all possible.

> (and
> sometimes it turns valid URLs into different valid URLs by doing percent
> decoding).

That may indeed be a problem but it's kind of off-topic for this bug.

> Also, having the parser encode the \ character is also something other
> browsers simply don't do. See bug 1152455 comment 6.

So we don't comply with the spec and that's it? :( Is there some ongoing discussion as to whether the spec should be changed to treat \ as legal?
(In reply to Dão Gottwald [:dao] from comment #11)
> (In reply to Valentin Gosu [:valentin] from comment #10)
> > If we made the parser forcefully encode the \ character, it would indeed get
> > rid of THIS bug,
> 
> Yes, that's what I expected "treat as illegal" would mean. Just like with
> spaces.
> 
> > but it wouldn't fix the underlying cause. The url-bar still
> > decodes characters and turns valid characters into invalid ones
> 
> Well, yes, so far we've been relying on the URL parser to normalize our
> illegal URLs. What you proposed in comment 6 isn't a great user experience
> when editing URLs, so we'd like to avoid that if at all possible.
> 

The URL parser isn't meant to be used to correct invalid URLs, but to parse valid ones.
By automatically fixing invalid URLs we are only encouraging the spread of them on the internet, and it could lead to bad web-compat issues if other browsers start doing it too, but in slightly different ways.

Can you explain the reasoning behind the current user experience of the url-bar?
Do you have a list of the upsides and downsides the percent decoding of the URL is supposed to bring?

> > (and
> > sometimes it turns valid URLs into different valid URLs by doing percent
> > decoding).
> 
> That may indeed be a problem but it's kind of off-topic for this bug.

The following case is actually really bad:

Load URL: http://example.com/?x=%33&a=hello
I click on the URL bar and change hello to bye, and hit enter.
Now x is 3 instead of %33 - even though the user never touched it.

And it's the same bug as this one, only that the URL parser has no way of knowing what the user intended.
If we fix this case, than this bug no longer applies.

> 
> > Also, having the parser encode the \ character is also something other
> > browsers simply don't do. See bug 1152455 comment 6.
> 
> So we don't comply with the spec and that's it? :( Is there some ongoing
> discussion as to whether the spec should be changed to treat \ as legal?

I don't see doing percent encoding in the URL parser of invalid characters as "complying with the spec".
We are currently looking into using other URL parsing libraries as a backend to our implementation. We can't rely on the URL parser to fix such things for us, especially when the user doesn't do anything invalid. If the link he clicks on is correct, then using the url-bar to refresh the page shouldn't break anything.
(In reply to Valentin Gosu [:valentin] from comment #12)
> The URL parser isn't meant to be used to correct invalid URLs, but to parse
> valid ones.
> By automatically fixing invalid URLs we are only encouraging the spread of
> them on the internet,

We're not on the same page here. Clearly creating an nsIURI from a string does normalize the URL. It encodes thousands of illegal characters (but notably not the backslash). You seem to be painting a completely different picture which makes no sense to me.

Normalizing URLs on our side, e.g. when loading a URL from the location bar, theoretically ensures that invalid URLs do *not* spread on the internet, because they don't hit the internet in the first place.

> and it could lead to bad web-compat issues if other
> browsers start doing it too, but in slightly different ways.

That's why we have a spec...

> Can you explain the reasoning behind the current user experience of the
> url-bar?
> Do you have a list of the upsides and downsides the percent decoding of the
> URL is supposed to bring?

It's supposed to make URLs easier to read, understand and handle (e.g. edit) for end users.

> The following case is actually really bad:
> 
> Load URL: http://example.com/?x=%33&a=hello
> I click on the URL bar and change hello to bye, and hit enter.
> Now x is 3 instead of %33 - even though the user never touched it.
> 
> And it's the same bug as this one, only that the URL parser has no way of
> knowing what the user intended.
> If we fix this case, than this bug no longer applies.

Not really. If the user manually puts a backslash in the location bar, we'll still send that unencoded unless we fix nsIURI.
(In reply to Dão Gottwald [:dao] from comment #13)
> (In reply to Valentin Gosu [:valentin] from comment #12)
> > The URL parser isn't meant to be used to correct invalid URLs, but to parse
> > valid ones.
> > By automatically fixing invalid URLs we are only encouraging the spread of
> > them on the internet,
> 
> We're not on the same page here. Clearly creating an nsIURI from a string
> does normalize the URL. It encodes thousands of illegal characters (but
> notably not the backslash). You seem to be painting a completely different
> picture which makes no sense to me.
> 

If we change the parser to automatically encode a \ it gets from the URL bar, it will also accept and encode a slash it will get from clicking a URL in a web page.

> Normalizing URLs on our side, e.g. when loading a URL from the location bar,
> theoretically ensures that invalid URLs do *not* spread on the internet,
> because they don't hit the internet in the first place.

Copying from the URL bar gives you percent decoded URLs. So... I don't see how that stops me from spreading that URL on the internet.

> 
> > and it could lead to bad web-compat issues if other
> > browsers start doing it too, but in slightly different ways.
> 
> That's why we have a spec...

All browsers comply with the spec to a varying degree. Also servers, some of which comply with older RFCs, and are never going to be updated.

> 
> > Can you explain the reasoning behind the current user experience of the
> > url-bar?
> > Do you have a list of the upsides and downsides the percent decoding of the
> > URL is supposed to bring?
> 
> It's supposed to make URLs easier to read, understand and handle (e.g. edit)
> for end users.
> 
> > The following case is actually really bad:
> > 
> > Load URL: http://example.com/?x=%33&a=hello
> > I click on the URL bar and change hello to bye, and hit enter.
> > Now x is 3 instead of %33 - even though the user never touched it.
> > 
> > And it's the same bug as this one, only that the URL parser has no way of
> > knowing what the user intended.
> > If we fix this case, than this bug no longer applies.
> 
> Not really. If the user manually puts a backslash in the location bar, we'll
> still send that unencoded unless we fix nsIURI.

Are you saying the example I gave isn't a bug?

Also, if that's what the user wants to do (send a \ to the server), why should we stop them? But at the same time if he entered %5C, we should send %5C.

It seems to me that this issue is easily fixed in the Firefox code. Sure, we could force encode \ characters, but that doesn't make this issue go away, but the spec doesn't mandate that, and no other browsers do it. Feel free to open a bug on the spec if you feel differently.

I hope you understand my point of view. If not, maybe we could have a vidyo talk to clear things out.
(In reply to Valentin Gosu [:valentin] from comment #14)
> (In reply to Dão Gottwald [:dao] from comment #13)
> > (In reply to Valentin Gosu [:valentin] from comment #12)
> > > The URL parser isn't meant to be used to correct invalid URLs, but to parse
> > > valid ones.
> > > By automatically fixing invalid URLs we are only encouraging the spread of
> > > them on the internet,
> > 
> > We're not on the same page here. Clearly creating an nsIURI from a string
> > does normalize the URL. It encodes thousands of illegal characters (but
> > notably not the backslash). You seem to be painting a completely different
> > picture which makes no sense to me.
> > 
> 
> If we change the parser to automatically encode a \ it gets from the URL
> bar, it will also accept and encode a slash it will get from clicking a URL
> in a web page.

Just as with other illegal characters, yes.

> > Normalizing URLs on our side, e.g. when loading a URL from the location bar,
> > theoretically ensures that invalid URLs do *not* spread on the internet,
> > because they don't hit the internet in the first place.
> 
> Copying from the URL bar gives you percent decoded URLs. So... I don't see
> how that stops me from spreading that URL on the internet.

Copying from the URL bar gives you encoded URLs, unless you manually modified the value; at that point we assume you know what you're doing and the location bar just acts like any other text field. (There's only so far we can guide users, we can't prevent them from typing faulty URLs and spreading them wherever they may see fit.)

> > > and it could lead to bad web-compat issues if other
> > > browsers start doing it too, but in slightly different ways.
> > 
> > That's why we have a spec...
> 
> All browsers comply with the spec to a varying degree. Also servers, some of
> which comply with older RFCs, and are never going to be updated.

I'm not sure what this has to do with this bug anymore.

> > > The following case is actually really bad:
> > > 
> > > Load URL: http://example.com/?x=%33&a=hello
> > > I click on the URL bar and change hello to bye, and hit enter.
> > > Now x is 3 instead of %33 - even though the user never touched it.
> > > 
> > > And it's the same bug as this one, only that the URL parser has no way of
> > > knowing what the user intended.
> > > If we fix this case, than this bug no longer applies.
> > 
> > Not really. If the user manually puts a backslash in the location bar, we'll
> > still send that unencoded unless we fix nsIURI.
> 
> Are you saying the example I gave isn't a bug?

I'm saying if we fix that case, this bug still applies.

> Also, if that's what the user wants to do (send a \ to the server), why
> should we stop them?

We do this for thousands of other illegal characters, so I'm not sure you're serious with that question. Earlier you argued you want to encourage the spread of illegal URLs on the internet and now you're saying we should pass through what the user types without encoding it?
(In reply to Dão Gottwald [:dao] from comment #15)
> > Also, if that's what the user wants to do (send a \ to the server), why
> > should we stop them?
> 
> We do this for thousands of other illegal characters, so I'm not sure you're
> serious with that question. Earlier you argued you want to encourage the
> spread of illegal URLs on the internet and now you're saying we should pass
> through what the user types without encoding it?

Oops, should have been: earlier you argued you *don't* want to encourage the spread of illegal URLs
I'm moving this to networking, since:

* By not encoding \, we produce invalid URLs
* There's evidence that spec-compliant servers choke on the unencoded backslash
* The proposed front-end fix doesn't fully address that problem
* Even though other browsers don't automatically encode \, servers should handle %5C just fine (regardless of whether they accept \ or not)
Severity: critical → normal
OS: Mac OS X → All
Product: Firefox → Core
Hardware: x86 → All
Summary: %5C characters in url are replaced with \ in location bar → \ (backslash) should be encoded as %5C in the URL query
Component: Untriaged → Networking
(In reply to Dão Gottwald [:dao] from comment #17)
> I'm moving this to networking, since:
> 
> * By not encoding \, we produce invalid URLs
> * There's evidence that spec-compliant servers choke on the unencoded
> backslash

Specification-complaint does not trump compat these days. Rather, the specification is simply wrong if all clients do something else.


> * Even though other browsers don't automatically encode \, servers should
> handle %5C just fine (regardless of whether they accept \ or not)

We should not change our behavior to be more different from other browsers unless they change as well.
(In reply to Anne (:annevk) from comment #18)
> (In reply to Dão Gottwald [:dao] from comment #17)
> > I'm moving this to networking, since:
> > 
> > * By not encoding \, we produce invalid URLs
> > * There's evidence that spec-compliant servers choke on the unencoded
> > backslash
> 
> Specification-complaint does not trump compat these days. Rather, the
> specification is simply wrong if all clients do something else.

That's only half of the story, the spec isn't just for browsers. There are servers treating \ as illegal, so in this case I suspect changing browser behavior will be easier than changing the spec and getting servers updated to accept \.

> > * Even though other browsers don't automatically encode \, servers should
> > handle %5C just fine (regardless of whether they accept \ or not)
> 
> We should not change our behavior to be more different from other browsers
> unless they change as well.

I agree we should aim for cross-browser compatibility, but as I said I don't think that's the only factor that needs to be considered here. However we should definitely reach out to other browser vendors through the appropriate channels.
There are also servers treating \ and %5C differently and since all browsers match that it's more likely we'd break those.

The problem here is that the address bar code modifies URLs browsers (and some set of servers) treat as distinct.
(In reply to Anne (:annevk) from comment #20)
> There are also servers treating \ and %5C differently

Are there? Do you have an example?
The discussion derived quite a bit, from 

"FF decodes URI-encoded backslashes set from JS"

to 

"should we always encode backslashes?"

Firefox _creates_ broken URIs by decoding valid ones when they end up in the URL bar.

In chrome, this works fine and puts #%5C in the URL bar.

    window.location.hash = encodeURIComponent("\\")

In FF, it will put a non-encoded backslash. The resulting URL breaks other tools. 

Try to link to [foo](https://example.com/#\) or [foo](https://example.com/#\n) from a github issue, for example.

In the first case, the Markdown parser ignore the construct and outputs '[foo](https://example.com/#\)', and the second case, the backslash is dropped resulting in <a href ="https://example.com/#n">foo</a>.

This breaks sharing http://rollupjs.org snippets from Firefox, for example (the code is URI encoded in the hash part of the URL).

http://rollupjs.org/#%7B%22options%22%3A%7B%22format%22%3A%22cjs%22%2C%22moduleName%22%3A%22myBundle%22%2C%22globals%22%3A%7B%7D%2C%22moduleId%22%3A%22%22%7D%2C%22modules%22%3A%5B%7B%22name%22%3A%22main.js%22%2C%22code%22%3A%22%2F%2F%20Hello%2C%5Cn%2F%2F%20Mozilla%20folks%22%7D%5D%7D
Case in point, https://example.com/#\ even breaks the MDN linkifier.
Well s/MDN/Bugzilla web interface/ and sorry for the flood.
Assignee: nobody → valentin.gosu
Whiteboard: [necko-active]
No longer depends on: 1163959
Blocks: url
Blocks: 1064700
Whiteboard: [necko-active] → [necko-backlog]
An example of this problem occurs when sending a request to Apache Tomcat 8, which is very strict with URLs.   If you send %5c in a parameter value to Tomcat it is happy, but Firefox transforms this into the unencoded \ so when you copy and paste the URL into an email and send it to someone now the URL doesn't work because it isn't encoded properly and Tomcat throws and Exception in the log file "IllegalArgumentException: Invalid character found in the request".   I have to remember to always manually re-encode the URL parameters before copying and pasting since Firefox mangles them now.   How is this now a high priority issue?  Firefox is clearly breaking the web spec.
This is similar to bug 1163959.

If I type http://example.com/?bla=%5C\ and hit go, it goes to http://example.com/?bla=%5C\
If I copy the URL it copies http://example.com/?bla=%5C\
But if I just click the URL bar and hit enter (or click refresh), it goes to http://example.com/?bla=\\
Assignee: valentin.gosu → nobody
Component: Networking → Location Bar
Depends on: 1163959
Product: Core → Firefox
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: P1 → P3
See Also: → 1421431
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.