Closed Bug 407172 Opened 17 years ago Closed 10 years ago

Apostrophe in URLs is automatically escaped by Firefox, but not Internet Explorer

Categories

(Core :: Networking, defect)

defect
Not set
major

Tracking

()

RESOLVED DUPLICATE of bug 1040285

People

(Reporter: bwporter, Unassigned)

References

()

Details

(Keywords: html5, regression)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b1) Gecko/2007110904 Firefox/3.0b1
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b1) Gecko/2007110904 Firefox/3.0b1

The location bar for Internet Explorer 6, 7 and Firefox 2.0.11 and Firefox 3.0b1 all display the URL as:

http://www.brandonporter.com/mozilla_bug/test'2.html

However, Firefox 3.0b1 escapes that before sending to the browser and IE and Firefox 2.0.11 dosn't:

207.171.191.60 - - [13/Jul/2007:03:04:00 -0700] "GET /mozilla_bug/test'2.html HTTP/1.1" 200 14 "http://www.brandonporter.com/mozilla_bug/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
207.171.191.60 - - [13/Jul/2007:03:24:38 -0700] "GET /mozilla_bug/test'2.html HTTP/1.1" 200 14 "http://www.brandonporter.com/mozilla_bug/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11"
207.171.191.60 - - [13/Jul/2007:03:04:06 -0700] "GET /mozilla_bug/test%272.html HTTP/1.1" 200 14 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b1) Gecko/2007110904 Firefox/3.0b1"

Most web server installations map these both to the same document, but some MediaWiki installations I've seen consider these two separate MediaWiki pages. 

RFC 1738 suggests that an apostrophe does not need to be URL encoded:

   Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
   reserved characters used for their reserved purposes may be used
   unencoded within a URL.
 

Reproducible: Always

Steps to Reproduce:
1. Create a URL including an apostrophe
2. Tail the HTTP logs to see what URL the server sees from the browser
3. Compare Firefox 2.0, 3.0, and IE to see if they are the same
Actual Results:  

207.171.191.60 - - [13/Jul/2007:03:04:00 -0700] "GET /mozilla_bug/test'2.html HTTP/1.1" 200 14 "http://www.brandonporter.com/mozilla_bug/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
207.171.191.60 - - [13/Jul/2007:03:24:38 -0700] "GET /mozilla_bug/test'2.html HTTP/1.1" 200 14 "http://www.brandonporter.com/mozilla_bug/" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11"
207.171.191.60 - - [13/Jul/2007:03:04:06 -0700] "GET /mozilla_bug/test%272.html HTTP/1.1" 200 14 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b1) Gecko/2007110904 Firefox/3.0b1"

Expected Results:  
Expect not to see single-quotes converted to %27 before sending to server.
Is this a known issue already?  Haven't seen any response.
I have verified this issue still exists in Firefox 3 Beta 2 with the Friday December 14th bug-hunt build.
I pulled the official Beta 2 build today and have confirmed this behavior is still happening.  

This is a significant change in behavior relative to previous browsers and appears to contradict the URI specification.  

Have I mis-filed this or do community bug reports just tend to go ignored by Mozilla?
The change was introduced in bug 376844. This is a Core:Networking issue, as the uri is passed unescaped to the docshell.
Component: Location Bar and Autocomplete → Networking
OS: Windows XP → All
Product: Firefox → Core
QA Contact: location.bar → networking
Hardware: PC → All
Version: unspecified → Trunk
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: blocking1.9?
Keywords: regression
Um, if that's the actual request line from the HTTP transmission, that's a MediaWiki bug, is it not (and we should get them to fix it)?  From RFC 2616:

  The Request-URI is transmitted in the format specified in section 3.2.1.
  If the Request-URI is encoded using the “% HEX HEX” encoding, the origin
  server MUST decode the Request-URI in order to properly interpret the
  request. Servers SHOULD respond to invalid Request-URIs with an
  appropriate status code.

Not decoding the hex encoding isn't an option.
If the server is expected to treat both as identical as the RFC text suggests then clearly that isn't happening with mediawiki.

Still seems somewhat shaky ground to choose to encode more conservatively than both the implied standard (prior browsers from multiple vendors) and the rfc required encoded character set.
Not necessarily disagreeing, but if we can get the MediaWiki people to fix it, the payoff in preventing some types of exploits may be worth the temporary incompatibility, especially if they fix it soon.  Could you file a bug in their bug tracker about this?  Also, are you sure the problem couldn't be reproduced in a user page on Wikipedia or similar?
I created a test page in MDC - both http://developer.mozilla.org/en/docs/User:Wladimir_Palant/Test'test and http://developer.mozilla.org/en/docs/User:Wladimir_Palant/Test%27test are handled correctly (and I even get redirected to the latter after creating the page). This is also expected since MediaWiki has to handle UTF8 in article titles - it cannot work properly without unescaping requests. My guess: the MediaWiki installations we are talking about are either using custom modifications or some buggy plugin.
I'll do some investigation of our internal wiki to try to understand at what layer this arises and see if I can find any reproducible cases on public wikis and then file bugs to the appropriate folks.
(In reply to comment #9)
> I'll do some investigation of our internal wiki to try to understand at what
> layer this arises and see if I can find any reproducible cases on public wikis
> and then file bugs to the appropriate folks.
> 

Brian - can you let us know how it goes?  Until then I'm removing this bug from the blocking list.
Flags: blocking1.9? → blocking1.9-
I do not have direct access to configuration files or other details of the installed MediaWiki instance that demonstrates this bug so I wasn't able to learn much.  I was able to ascertain that the version of MediaWiki deployed that exhibits this is 1.5.5 (relatively old).

Does anyone have a MediaWiki contact we could add to this bug to answer whether this was a known problem or not?  Otherwise, I'll try to reach out through their community channels.
Community channels would be the way to go; I don't remember seeing a formal contact mentioned anywhere before.  It's also conceivable it'd be in their bug tracker, but if it's not that of course proves nothing.
Blocks: 376844
Hi, This breaks an RFC and conflicts with a browser that has more users than you ... and you choose to lobby the web developers to change their web servers to break the RFC instead of correct a problem you introduced to working code? Are you programmers, or just a bunch of managers and PR guys?

I was going to ask, where do I send the payola to get this fixed ... but I've been thinking, hey, you send me the money to change my server instead, OK?

http://www.ietf.org/rfc/rfc3986.txt

------------------------------------------------------------------------
unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
reserved    = gen-delims / sub-delims
     gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
     sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                     / "*" / "+" / "," / ";" / "="

URIs that differ in the replacement of a reserved character with its
corresponding percent-encoded octet are not equivalent.
------------------------------------------------------------------------

thanks in advance for your genuine attention to this matter.
here there are duplicate bugs: https://bugzilla.mozilla.org/show_bug.cgi?id=434211
* Request for Comments: 3986 -- Uniform Resource Identifier (URI): Generic Syntax *

http://www.ietf.org/rfc/rfc3986.txt

From _2.1.  Percent-Encoding_

   A percent-encoding mechanism is used to represent a data octet in a
   component when that octet's corresponding character is outside the
   allowed set or is being used as a delimiter of, or within, the
   component.

...

From _2.2.  Reserved Characters_

   URIs include components and subcomponents that are delimited by
   characters in the "reserved" set.  These characters are called
   "reserved" because they may (or may not) be defined as delimiters by
   the generic syntax, by each scheme-specific syntax, or by the
   implementation-specific syntax of a URI's dereferencing algorithm.
   If data for a URI component would conflict with a reserved
   character's purpose as a delimiter, then the conflicting data must be
   percent-encoded before the URI is formed.

      reserved    = gen-delims / sub-delims

      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

   The purpose of reserved characters is to provide a set of delimiting
   characters that are distinguishable from other data within a URI.
   URIs that differ in the replacement of a reserved character with its
   corresponding percent-encoded octet are not equivalent.  Percent-
   encoding a reserved character, or decoding a percent-encoded octet
   that corresponds to a reserved character, will change how the URI is
   interpreted by most applications.  Thus, characters in the reserved
   set are protected from normalization and are therefore safe to be
   used by scheme-specific and producer-specific algorithms for
   delimiting data subcomponents within a URI.

...

                                                                  any of
   those characters that are also in the reserved set are "reserved" for
   use as subcomponent delimiters within the component.

...

   URI producing applications should percent-encode data octets that
   correspond to characters in the reserved set unless these characters
   are specifically allowed by the URI scheme to represent data in that
   component.

From _2.4.  When to Encode or Decode_

   Under normal circumstances, the only time when octets within a URI
   are percent-encoded is during the process of producing the URI from
   its component parts.

...

   When a URI is dereferenced, the components and subcomponents
   significant to the scheme-specific dereferencing process (if any)
   must be parsed and separated before the percent-encoded octets within
   those components can be safely decoded, as otherwise the data may be
   mistaken for component delimiters.  The only exception is for
   percent-encoded octets corresponding to characters in the unreserved
   set, which can be decoded at any time.

In other words, you are preventing the use of apostrophe for delimiting purposes, for which it is reserved in the RFC.

And the RFC allows use of the sub-delims set *as data* not merely as delimiters, within both path segments, and in query data.

From _3.3.  Path_

      segment       = *pchar

...

      pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

...

   Aside from dot-segments in hierarchical paths, a path segment is
   considered opaque by the generic syntax.

ie, not to be transformed or interpreted, other than by the server.

From _3.4.  Query_

      query       = *( pchar / "/" / "?" )

From _6.2.1.  Simple String Comparison_

With regard to searching cached data for matches to newer requests,

   False negatives are caused by the production and use of URI aliases.
   Unnecessary aliases can be reduced, regardless of the comparison
   method, by consistently providing URI references in an already-
   normalized form (i.e., a form identical to what would be produced
   after normalization is applied, as described below).

ie, if there are two names for given filenames, you may be halving my cache efficiency for those files. Of course, as the number of possible 'valid' names increases for a given file, then the cache efficiency may be drastically reduced.

From _6.2.2.2.  Percent-Encoding Normalization_

   The percent-encoding mechanism (Section 2.1) is a frequent source of
   variance among otherwise identical URIs.  In addition to the case
   normalization issue noted above, some URI producers percent-encode
   octets that do not require percent-encoding, resulting in URIs that
   are equivalent to their non-encoded counterparts.  These URIs should
   be normalized by decoding any percent-encoded octet that corresponds
   to an unreserved character, as described in Section 2.3.

Let me just repeat that last bit.

                                                      These URIs should
   be normalized by decoding any percent-encoded octet that corresponds
   to an unreserved character, as described in Section 2.3.

From _7.2.  Malicious Construction_

   When a URI contains percent-encoded octets that match the delimiters
   for a given resolution or dereference protocol (for example, CR and
   LF characters for the TELNET protocol), these percent-encodings must
   not be decoded before transmission across that protocol.

Apostrophe is *not* one of the delimiters for the HTTP protocol. So the RFC does not provide an excuse here to send encoded apostrophes.

From _7.3.  Back-End Transcoding_

                         Applications must split the URI into its
   components and subcomponents prior to decoding the octets, as
   otherwise the decoded octets might be mistaken for delimiters.
   Security checks of the data within a URI should be applied after
   decoding the octets.

ie, servers are supposed to decode the apostrophe, then sanity-check it. So they have to be able to handle a raw apostrophe, your encoding of it is not making any server safer. Yet, by always encoding it, you remove the use of the apostrophe for its reserved use, as a delimiter.

If you manage to direct our programming practices away from public RFC standards, the door opens for all sorts of abuse.


* Request for Comments: 2616 -- Hypertext Transfer Protocol -- HTTP/1.1 *

http://www.ietf.org/rfc/rfc2616.txt

From _3.2.3 URI Comparison_

   Characters other than those in the "reserved" and "unsafe" sets (see
   RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.

* Request for Comments: 2396 -- Uniform Resource Identifiers (URI): Generic Syntax *

http://www.ietf.org/rfc/rfc2396.txt

Refers to RFC 2396 which was superseded by 3986, but let's see what 2396 said, just to see if you have been ignoring the problem, or did the goalposts shift on you?

From _2.3. Unreserved Characters_

      unreserved  = alphanum | mark

      mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

OK, the apostrophe was moved into the reserved set between 2396 and 3986.

But,

   Unreserved characters can be escaped without changing the semantics
   of the URI, but this should not be done unless the URI is being used
   in a context that does not allow the unescaped character to appear.

So, you shouldn't have been munging it. The apostrophe causes no conceivable problem whatsoever in the context of a HTTP header, which was specifically designed alongside the same set of guides as lead to RFC 2396/3896. The wording is "does not allow", not "may be problematic", so even if you think the apostrophe dangerous, the superseded 2396 RFC is still not your friend.

From _2.4.2. When to Escape and Unescape_

                                               escaping or unescaping a
   completed URI might change its semantics.  Normally, the only time
   escape encodings can safely be made is when the URI is being created
   from its component parts; each component may have its own set of
   characters that are reserved, so only the mechanism responsible for
   generating or interpreting that component can determine whether or
   not escaping a character will change its semantics.
marginal, there is no point quoting RFCs here. See bug 376844, when we implemented this change we already knew that we don't have to escape apostrophes by the RFC. We still decided to do it because it takes care of an entire class of cross-site scripting vulnerabilities (meaning that even if the website is vulnerable this vulnerability won't affect Firefox users). This doesn't change the fact that the server is required to decode encoded characters (see comment 5).

What would be actually useful: please describe the use case we are breaking and why it cannot/shouldn't be fixed (at least not with less effort than finding and posting here tons of RFC quotations).
Ah yes, those 'useless' old RFCs.

wladimir, where once it was "don't have to escape
apostrophes by the RFC" now it is "escaping apostrophe's breaks the RFC". The effort above explains this, and includes the RFC drafters several reasons why it important to consistently apply the RFC. 

I imagine if the drafters of the RFC saw fit to add apostrophe to the class of characters that are not to be escaped, they had a reason. I doubt my own usage case is going to convince you, that is why I quoted an RFC, and it's why we have those. 

Individual cases are rarely convincing, but to establish an inter-communication network, we need consistency, therefore we agree standards. I want firefox to support the current URI standard, RFC 3986. I want firefox 3 to apply the same behaviour that afaik all other browsers do.

I don't want to be told to code around something that is in an RFC. I want permanent URIs which can be relied on to find the same resource in some years time. For this, I look to a widely agreed standard, evolved openly over some decades and documented effectively and authoritatively. I do not look to some lax client-end bugfix for random scripting exploits that you have kludged together without consultation. I have no confidence at all that your browser's behaviour is going to be sustained and widely adopted, since it now directly contravenes stipulations of an RFC.

Perhaps it would help us all for you to provide specific case examples of the types of problems that you think are alleviated by exessively escaping the apostrophe. I am sure we can build at least as many examples of where it is harmful, although they may be more involved to explain, since likely are manifested in things like over-complex code, shifting goalposts, lack of faith in consultation processes, as much as they are in straight-out kiddie-script hacks and typos. I think the onus rests with you to justify firefox 3's behaviour, not with me to justify the wording of some RFCs.
It may benefit you to look at the current draft of the next iteration of 2616, which certainly clarifies (again) that you should NOT encode characters that are in the do-not-encode-these-characters list, and adds a lot of discussion of just why it is we use STANDARDS.

*HTTP/1.1, part 1: URIs, Connections, and Message Parsing -- draft-ietf-httpbis-p1-messaging-08 *

http://www.ietf.org/id/draft-ietf-httpbis-p1-messaging-08.txt

   One consequence of HTTP flexibility is that the protocol cannot be
   defined in terms of what occurs behind the interface.  Instead, we
   are limited to defining the syntax of communication, the intent of
   received communication, and the expected behavior of recipients.

If you want to say that firefox implements HTTP 1.1, then you must adhere to the HTTP 1.1 standard.

If you think you have a convincing case of why apostrophe should be an unreserved character in HTTP, then why isn't that reflected in the current RFC draft? You took the initiative to escape the apostrophe three years ago, have you not approached the RFC discussion at all with this supposedly crucial need?
sorry, my reading of the draft draft-ietf-httpbis-p1-messaging-08 was incorrect on whether it clarifies this case, but please consider the rest of my previous post here.

The RFC has not been changing in the direction that you are headed. Why not, if you have a strong case?
http://tools.ietf.org/wg/httpbis/trac/ticket/34

Internet Engineering Task Force (IETF) Hypertext Transfer Protocol Bis Ticket #34 "Out-of-date reference for URIs" (closed design: fixed)

   RFC2616 refers to RFC2396, RFC1630, RFC1738, and RFC1808, all of which have
   been obsoleted by RFC3986.

   If the reference is changed, it will important to remember to make a 
   decision about the "mark" production in the former (which are unreserved 
   characters) which have been moved to have reserved status in the latter.

   By referencing 2396, HTTP allows these characters unescaped in path 
   segments, etc. If 3986 is referenced as-is, they will be effectively 
   disallowed, thereby effectively making a number of existing HTTP URIs 
   invalid.

   This includes the exclamation mark ("!"), asterisk ("*"), single-quote 
   ("'"), and open and close parentheses ("(" and ")").

This is where the change in the draft of HTTP 1.1 supports my interpretation. Note the following annotation on the page:

   Changed 23 months ago by fielding@gbiv.com

   RFC3986 is the standard based on full knowledge of its effect on HTTP. IOW, 
   the decision has already been made.

HTTP 1.1 RFC 2616 refers to URI RFC 2396, but since URI RFC 3986 explicitely supersedes URI RFC 2616, that is the URI form which should be observed wrt HTTP 1.1 URI forms -- even though an updated release of the HTTP RFC has not yet been finalised.

btw I'm not convinced that the bug complaint #34 is complete or accurate in its description of effect of the change, but I think I and others have described well on this page how the client should behave wrt these characters. What the server does with the information it is passed is its business, which is also backed up in the RFCs.
Others have accepted that the RFC allows unencoded apostrophes in valid URIs:

* GNU Mailman Bug #310124 *

https://bugs.launchpad.net/mailman/+bug/310124

OK.

Here is yet another reason why you should not frutz with the characters: Encoded octets may be being used for representing data in encodings other than US ASCII. This is permitted, and exists. I can't point you to an example, but I am sure they exist. ie %27 may mean something other than apostrophe -- including non-character data (for example, we can put data onto a server in some cases using a GET, not just using a POST) -- to the server, and apostrophe may require some other octet number if you choose to translate it. You cannot know in advance what encoding system the server uses for GET request filenames, they are by definition in US-ASCII and described in the RFCs I already mentioned. There is no header which allows you to determine this in advance, to do so would require an additional transaction, prior to a GET. This kind of overhead would be unreasonable in HTTP.

Here is the discussion about the percent-encoding issue between URI RFC 2396, becoming updated to URI RFC 3986. The third-last contribution (Tim Bray, 07 Mar 2003, URI-WG mailing list) to the discussion of this point (041-encoding Section 2 on encoding causes too much confusion) was noted as being accepted, with wording changed only to avoid duplication of other parts of the RFC. Perhaps this gives a clear summary of this issue. The most notable part is this:

http://labs.apache.org/webarch/uri/rev-2002/issues.html#041-encoding

   - ASCII characters which may legally appear in the component MUST
     appear directly as themselves, i.e. 'a' may not be encoded as %61.

Is that clear enough? It seems a pity to me that the final version of the RFC did not use such singularly clear language on this issue, but it says the same thing in more words.
One mis-statement in an earlier post of mine: "And the RFC allows use of the sub-delims set *as data* not merely as delimiters, within both path segments, and in query data."

Not really ... Well, in the encoded form, yes, the encoded values might be used as data. So if you encode in every case, then how to distinguish from the use as delims ... double-encode? That is explicitely forbidden, and would be needlessly complex.
Perhaps you should submit a bug-request to the upstream authority?

http://rfc-editor.org/errata_search.php?rfc=3986&rec_status=15&presentation=table

Not being purely sarcastic, I can't think of a reason to care one way or the other whether apostrophe is escaped ... just that each browser does the same thing with them, for the numerous reasons outlined above, chief for me being the provision of a single canonical URI for any given resource.

If you have really uncovered a bug in the URI syntax, I would like to know that some years from now, all functional browsers will be observing the same convention as yours.
marginal, how about you STOP SPAMMING THIS BUG HERE? This was now your seventh comment in a row. At least think what you want to write and write this all in one comment. This bug is certainly not the right place for RfC discussions, maybe mozilla.dev.platform is http://groups.google.de/group/mozilla.dev.platform/topics?hl=de&lnk
Comments about bug discussion etiquette aside, I think marginal makes a good point that interoperability matters.  On this particular issue, the browsers were interoperable and consistent with the RFC until Mozilla decided to change their behavior.

So long as no one browser has 90+ percent market share, interoperability will continue to be a very serious issue for developers.  To come up with examples of how encoding differences cause developer pain, we can simply look at the + vs %20 encoding pain.  

It seems that the argument here is that XSS issues trump interoperability, particularly if the interoperability challenge is considered "minor".  I'm going to question whether this really helps with XSS, where we know for certain it is inconsistent with other browsers and the RFCs.

The assumption that allows changing the default is that web servers treat the unencoded and encoded URLs as equivalent.  The further assmption is that that web servers which don't treat them as equivalent (even though the specs say they shouldn't be considered equivalent) should be fixed.

If we take as an assumption motivating this change that most web servers are treating them as equivalent already and the others should be "fixed"... then how does this change help cross-side-scripting?  It seems that in order for this change to help XSS situations the assumption has to be the inverse... that servers treat the two encodings differently, with one encoding being more susceptible to XSS than the other.  This contradicts the assumption above that allows the change.  

Consequently I'm not convinced the XSS benefit is significant relative to the long-term costs of standards divergence.

I would also argue that changing behavior in a way that's incompatible with both other browsers and the RFCs in order to automatically fix a set of systems with XSS problems seems like a very slippery slope.  How far do you go do this road?  Do you automatically rewrite someone's Javascript to fix their eval of a string in a form field (which is clearly an XSS hole and should be detectable through static or runtime analysis)?  

Finally, I entirely agree with Marginal's point that if Mozilla still prioritizes being developer-friendly highly (which has been one of the keys to Firefox's success) then Mozilla should be an advocate for interoperability and should be pushing any divergence from accepted standards for security reasons or otherwise back through the relevant standards bodies.
OK, lemme put this another way.

I don't expect this case to be fixed anytime soon. I realise it's a case that many servers handle and that as a community project, people only take on the work they want to do. I doubt anyone wants to fix this, hence some reluctance to admit it's incorrect behaviour.

But please, I would like at least the admission that this change was a mistake and should be reverted if practical.

The reason I want this acknowledgment is not out of pride or to push some agenda (thanks for your kind comment about SPAM after all my work!)

The reason is, because, as a developer of web server software, I can see the other side of the issue, which most users of firefox may not. I can see the shaky uncertainty that this change makes for us, in developing software, and in naming web pages, for ordinary, less technical web developers.

When we name a web page, it's usually a consideration to think, "is this page going to be saved to the person's hard drive? if so, are they going to have trouble with the filename characters?" I for one, like if possible, to have the files have the same filename characters on the server, on the URI line, and when saved to the hard drive, and perhaps even in the title of the document.

Yes, apostrophe can cause headaches for the user. But actually, it is supported by all the filesystems in common use, as a valid filename character. And most of us use a GUI now, so most users can actually deal with apostrophe. It is only a headache for people who are trying to learn to develop command-line, scripts, and software skills. Mature developers and users do not see apostrophe as a headache, and we understand that if not one given escape character, then another, must be used. There is no way to obviate or minimise the problem, the best bet is simplicity. stability, and not adding complications that are not needed.

Anyway, what *does* a user see as a headache, if not a bare apostrophe in a filename? Actually, what you have done is what they see as the headache. The file that someone creates on a website might be called "Hi, I'm a file.html". The user grabs the file with wget perhaps. They get "Hi, I%27m a file.html". WTF? Where is my file? Is it corrupt?

So, OK, now they have to deal with that, because you won't fix things. Fine. Have it your way, there are other web browsers, choice is a marvelous thing.

But as an author of web pages, and as a developer, this raises a question.

When I go to create a new page, yes, I know now to avoid using apostrophe. But next week, Mozilla may choose to obviate some API bug by doing the same thing you have done with apostrophe, with some other 'reserved' character, eg plus sign.

Next week, I may find that my web pages where they had plus signs, now have escape sequences. And so on. For this reason, websites usually restrict the characters that appear in web page URIs. We choose a subset that are stable, which are not headaches for the web server, which the browser, though not absolutely required to, usually can be relied on to send in plain ASCII with no escaping. It just makes life simpler.

So for the sake of planning, I'd like to know, are Mozilla going to continue to eat away at the internet RFC concensus on what characters are 'reserved' and which are not? Or can you at least admit that this is a change that would be better if it had not occured? Or are we left in the dark? In which case, gee thanks for ruining a really good platform that was the main force in bringing a stable development environment to the world's computing.

Am I limited to [a-z][0-9] hyphen dot underscore tilde, or what? Please give an indication of your intentions. We will evaluate whether you are consistent in applying them, and your users will go or leave over time, according to the outcome. Trust me, this is not a small issue. This is the stuff internets are made of.
I don't think Bug#376844 made a valid case. In the wordpress example, the programmer should have done several things:

1. separate the URL into parts, using delimiters of his choice
2. percent-decode the part of interest
3. check that the part is valid
4. properly escape that part when embedding it into a HTML document

the programmer failed to do (2) and (4), both are fatal mistakes. Firefox's "help" will only work if both (2) and (4) are violated, but NOT when only one of them is violated. 

From my experience, it is far more likely that programmers will do (2) properly but fail to do (4). Everybody knows that they have to percent-decode their parameters, right? Therefore Firefox's encoding of apostrophe is futile in this case. The apostrophe will reappear in HTML if (4) is not done properly.

Firefox has to *rely on the programming error of failing to percent decode* for its trick to reach its intended purpose.

On the other hand, if a programmer failed to percent-decode a URL part, he probably failed to percent-encode the URL part too when producing his URLs. Two wrongs make a right, his code would've worked - but Firefox breaks it by automatically encoding apostrophes. Now whose fault is that? The programmer can legitimately argue that in his universe of all possible URL parts, percent-decoding/encoding isn't necessary, and Firefox's breaking RFC and screwing around with apostrophes is the real fault.

Now let's imagine a good programmer who understands basic things like (2) and (4). Firefox screws him too. In (1), programmer should be able to choose delimiters of his choice, RFC permitting. But Firefox forbids using apostrophes as delimiters. OK, a sensible programmer probably should and do avoid using apostrophes as delimiters in his applications if a browser breaks it.

But what if he is writing a framework concerning URLs? What is he supposed to document? "You should not use apostrophes because one major browser violates the RFC, and it is not IE."?

This is not good for our confidence in Firefox, if it breaks RFC willy nilly with very poor justifications. Understanding RFC 3986 is hard enough (it sucks), but then you cannot even trust what you read and you have to test each character on each browser? We are all programmers, life is hard enough, let's don't do this to each other.
(In reply to comment #25)
> I would also argue that changing behavior in a way that's incompatible with
> both other browsers and the RFCs in order to automatically fix a set of systems
> with XSS problems seems like a very slippery slope.  How far do you go do this
> road?  

<joking>Since single quotes are the root of all evils, Firefox should automatically convert them to &apos; in any received HTML. This will break some applications, but I don't see why would any one insist on using single quotes as attribute delimiters instead of double quotes.</joking> (I can't help joking, but this analogy is way over the head and I don't think percent-encoding apostrophes in URLs is anything like that. No disrespect to anybody)
The XML spec reiterates what the HTTP RFC says, that bytes should only be encoded iff there is no other way to send the request.

Furthermore, the URN spec does not put apostrophe into a 'reserved' group of any type. It explicitely states apostrophe 0x27 must be used UNENCODED in any URN,  "a character MUST NOT be "%"-encoded if the character is not a reserved character". <http://tools.ietf.org/html/rfc2141>

So whatever gain this incorrect behaviour is obtaining, if any, Firefox is going to have to learn to cope with the apostrophe character soon in order to be able to request resolution of URN ids into URIs from which documents may be retrieved. See  <http://tools.ietf.org/html/rfc3401> and friends.
Blocks: 434211
We should revisit this once the web URI spec work is done, and depending on the outcome of that...
Keywords: html5
The only example given in Bug#376844 does not justify the escaping whatsoever.

Suppose wordpress didn't fix that bug, and Firefox escapes apostrophes, wordpress would still have the same security hole. Note characters "<" ">" " " are used in the example URL, which are escaped by all browsers, but still end up injected in the html. Firefox does not help in any way in that scenario.

Therefore, Firefox currently does not show any valid reason why it breaks RFC3986.
Firefox's URI implementation is based on RFC 2396, not 3986 yet.  We've tried to update to 3986 and encountered real-world compat issues that forced us to back the patches out....  Hence my comments about the web uri spec, which will involve some actual real-world compatibility testing of the specification.
Boris, any chance you could please provide a link to what specifically you mean by 'the web URI spec work'? I assume you mean something the Firefox team are working on? Or you mean finalisation of new/updated IETF RFCs?
Also it would be interesting to see a link to the 'real-world compat issues'...? Perhaps provide by email if you think it's off-topic to elaborate here.
http://github.com/abarth/url-spec/ is being worked on by Adam Barth (I believe he's a member of the Chrome team, but the goal is to have whatwg and the W3C adopt the result).  In particular, unlike the cited RFCs this specification would define how to parse strings that are not actually valid URLs, which is something web browsers have to do all the time.

I don't have links to the compat stuff off the top of my head; you can probably find them if you search for some of the utf-8-related bugs in the networking component.
This extra encoding broke AngularJS. See: https://github.com/angular/angular.js/issues/920

The problem is that we need to check if the location represented by an in-memory model is the same as the current location. What we do is we compose the url according to the rules described in rfc3986 and then we compare this url with location.href. This used to work great in older FF and still works great in all other browsers.

Given that location.href always automatically decodes any unnecessarily encoded characters (except for single quote in FF), I don't see why we should change our implementation to be compatible with FF.

So I agree with others that the inconsistency is a serious interoperability issue. We'll work around this, but I'm saddened by having to add comment+hack to our code for FF, this is typically necessary only for IE.
> This used to work great in older FF

Um... how much older?  The behavior here hasn't changed since mid-2007 on the development trunk.  Which means it hasn't changed since Firefox 3.0 shipped.
You are right. I didn't realize that this went that far back. We received a bug report only recently, which is kind of surprising for such an old issue.

I also made one more incorrect statement: "Given that location.href always automatically decodes any unnecessarily encoded characters..." - this is not true, location.href doesn't do any automatic decoding AFAICT, Angular is doing this and I jumped the gun during my exploratory testing.

The issue comes down to that we expect that when we set location.href to something with a single quote we can later check to see if the value had changed since then by comparing the current location.href value with what we set it to. In FF this fails because of the unexpected encoding rules. Example:

var rfc3986EncodedUrl = "http://server/foo?bar's";
location.href = rfc3986EncodedUrl
...
...
// later (e.g. onHashChange)
if (location.href !== rfc3986EncodedUrl) {
  //will be called event though the urls are equal
}
Right, I understand the issue.  That's why the bug is open....
(In reply to Boris Zbarsky (:bz) from comment #35)
> http://github.com/abarth/url-spec/ is being worked on by Adam Barth (I
> believe he's a member of the Chrome team, but the goal is to have whatwg and
> the W3C adopt the result).  In particular, unlike the cited RFCs this
> specification would define how to parse strings that are not actually valid
> URLs, which is something web browsers have to do all the time.
> ...

- RFC 3986 does not require escaping the apostrophe in path components. Nor does RFC 2616 or httpbis.

- RFC 3986 *does* require mapping the escaped apostrophe back to an apostrophe for recipients.

- Thus browsers do not need to escape it. I haven't read the full background on why the escaping was added, but it seems that it would only affect those recipients that to do not process the URI was per RFC 3986.

- http://github.com/abarth/url-spec/ has been inactive for over a year. I believe there's a never document in W3C space but which is also inactive.

Proposal:

Compare with other implementations; and if at least one major UA does *not* do the escaping then consider removing it from Firefox.
Wladimir claims great security benefit of percent-encode apostrophe

  it will significantly reduce the number of exploitable XSS vulnerabilities.

  it takes care of an entire class of cross-site scripting vulnerabilities

without presenting diligent analysis. He quoted only one wordpress example in bug 376844. This is the attack story in that example:

user clicks a uri 
    /index.php/'%3E%3Cimg%20src=a%20onerror=alert(1)%3E%3C.php
upon receiving the request, php sets $PHP_SELF to
    index.php/'><img src=a onerror=alert(1)><.php
notice the auto percent-decoding.
then the application embeds part of $PHP_SELF directly in HTML without proper escaping. the HTML therefore contains injected javascript.

Percent-encoding the apostrophe does not change anything, the attack still works. Therefore the original justification for percent-encoding apostrophe is invalid.

It falls on Wladimir or his successor to present a more robust reason.
Single apostrophes were meant to be available for use as "subcomponent delimiters". The fact that FF automatically encodes them prevents such usage. It prevents such usage when a) single apostrophes have meaning to the application *and* b) single apostrophes can appear in user input.

For example, an application might form part of a URL with something like:

    "'" + escape("It's a string!") + "'"

That would produce:

    'It%27s a string!'

Reversing the procedure would then entail something like:

    unescape(parseValueBetweenSingleQuotes())

FF's encoding prevents that from working because the two possible meanings of a single apostrophe are no longer kept separate.

I believe the use case I'm describing is the use case intended by RFC 2396 (unreserved characters) and strengthened by the language of RFC 3986 (reserved for use as subcomponent delimiters).
We just spent an hour debugging a nasty problem with our app that was caused by this issue. Mostly by the fact that even calling encodeURI in javascript doesn't produce the same URI as the one that FF sends to the server. If nothing else, you should at least be able to make the same URI with encodeURI as the one that will be sent by FF.

Thanks,
Ryan
The algorithm for encodeURI is fixed in the relevant spec, so we can't in fact change it.
Marking as duplicate of the newer bug since it has far less noise.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.