Closed Bug 61269 Opened 24 years ago Closed 23 years ago

'%' will get URL-escaped to "%25" if not followed by 2 hexadecimal digits

Categories

(Core :: Networking, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla0.9.6

People

(Reporter: baldauf--2015--bugzilla.mozilla.org, Assigned: dougt)

References

()

Details

Attachments

(2 files, 1 obsolete file)

From Bugzilla Helper:
User-Agent: Mozilla/4.75 [en] (Win98; U)
BuildID:    200111908

Mozilla escapes '%' characters in a URL to "%25" if the '%' is not followed by 
two hexadecimal digits [0-9a-fA-F]. This URL both may be contained in a <A 
HREF=>-Statement or may be typed in directly into the URL bar. This escaping is 
very arbitrary and should be avoided due to its arbitrarity and incompatibility 
with current browsers. Just never encode '%', because '%' already is the 
encoding character.


Reproducible: Always
Steps to Reproduce:
1. Go to http://justchat4.medium.net:4000/chat/test/URLTest.html
2. Click on the link on that page.


Actual Results:  You will see (using mozilla):

Your data supplied is %u20AC test â0AC.


Expected Results:  You should see (as with Netscape 4.7 and IE5):

Your data supplied is € test â0AC.

You will see that mozilla escapes "%u20AC" but it keeps "%e20AC", so escaping is 
very arbitrary. Mozilla should not "know better" how to handle such URLs, it 
just should pass them 1:1 to the server. Why? As a transition-mechanism, I have 
to use %uHHHH-escapes to denote UTF16-characters not found in ASCII. Because I 
cannot know wether browsers URL-charset is Latin1, UTF8 or something else (and 
because URLs don't offer a reliable way to detect it), I need to use this. 
Escaping with '%' but not with a hexadecimal character after that '%' is one of 
the remaining namespace in URLs to be used. This mechanism is site specific and 
free of namespace conflicts as the unreliable Latin1 vs UTF8 transition.

As browsers have to pass URLs as given (at least to the site which generated the 
URL), I consider this a bug.
From RFC 1738 (the RFC on Uniform Resource Locators):

   The character "%" is unsafe because it is used for encodings of other
   characters. [...]  All unsafe characters must always be encoded within
   a URL.

So in cases when the % is not obviously being used to encode another character,
it itself must be encoded.

Ideally, this would be done by the person writing the URL.  That is, you would
have in your page:

"%25u20AC test %25e20AC"
if what you want to get passed to the server is "%u20AC test %e20AC".

I would recommend that this be marked invalid as Mozilla's behavior is correct
per the RFC.
The character "%" is unsafe because it is used for encodings of other
characters. [...]  All unsafe characters must always be encoded within
a URL.

You know that this is a recursive definition. If you follow this thinking 
exactly, you would to have encode "%" to "%25", which already contains a '%',
so you would have to encode it again zo "%2525", this to "%252525" and so on,
leading to an infinitely long string. This can't be the right interpretation.

To get a non-infinitely long result, you have to interprete a "%" as an escape 
character and "%25" as the literal percent sign. If you interprete this way, you 
must not escape an escape sign. If you do that, escaping is impossible.

Why do I report this as bug? Because else there is _no_ clear way where theory 
and practice are consistent in the "transport" of an UTF16-value to the browser 
and back. When I receive "%da%bf" within an URL, the server _cannot_ know wether
this is latin-1 encoded (as browsers frequently do) or wether this is UTF-8 (as 
the standard recently says). To solve the problem, at least for the case when I 
have to pass UTF-16 strings from the server to the browser and back, I encode 
every UTF-16 value which is not within ASCII by "%uHHHH", where H is one hex 
digit. This is unambigous and clear. So when you think that '%' is an escape 
character to escape other characters and "%25" is the literal percent sign, you 
can be happy. My '%' escapes other characters. It's clear that escape characters 
must not be escaped within the same information representation layer. But this 
is exactly what mozilla does.

I may want to learn how I should transport UTF-16 characters in a namespace-safe 
fashion. UTF-8 over %HH is not namespace-safe, because it could be interpreted 
as latin-1. UTF-7 is not namespace-safe either. I do not want my server to make 
guesswork, because guessing can always be wrong. Give me a namespace-safe, 
unambiguous way to encode UTF-16 characters within URLs for all browsers. 
The %uHHHH-way is the only way I know of.

Note: all URLs point only back my server, no other server are involved with 
%uHHHH-URLs.
I also do not know where encoding and escaping the escape character '%' is 
useful, give me an example. :o)
mozilla trys to detect if a string given to it is already escaped or not,
because escaping an already escaped string is bad and over-"un"escaping is too. 

I know, there are certain cornercases where the algorithm failes, but it does
work most of the time and that is the best we can hope. There simply is no way
to do this right all the time. If you want your % characters untouched, then
escape them properly. mozilla provides an escaping mask that forces the escaping
of characters. This mask is used in cases when we know we are dealing with real
% signs. Maybe it can also be used when converting between different charsets. 

I don't think the current behaviour can be changed, it would break tons of
sites. Sorry.
Marking as Invalid as per comments.
Status: UNCONFIRMED → RESOLVED
Closed: 24 years ago
Resolution: --- → INVALID
Sorry, but it won't break ton's of sites, because the behaviour of IE, Netscape 
Communicator 4.7 and Opera is "Don't escape a '%', because you won't get rid of 
the '%', it would appear in the resulting '%25'". You see that escaping the 
escape character within the same logical level is silly. There is no single site 
that would break, because that site would break when being viewed with every non 
mozilla-browser, too. Trust me or disprove it. :-)

Here's a patch:

diff -Nur netwerk/base/src/nsURLHelper.cpp.orig netwerk/base/src/nsURLHelper.cpp
--- netwerk/base/src/nsURLHelper.cpp.orig       Fri Dec  8 20:14:56 2000
+++ netwerk/base/src/nsURLHelper.cpp    Fri Dec  8 20:51:36 2000
@@ -121,11 +121,19 @@
           c2[0] = *(src+2);
       unsigned char c = *src++;
 
-      /* if the char has not to be escaped or whatever follows % is 
-         a valid escaped string, just copy the char */
-      if (IS_OK(c) || (c == HEX_ESCAPE && !(forced) && (pc1) && (pc2) &&
-         PL_strpbrk(pc1, CheckHexChars) != 0 &&  
-         PL_strpbrk(pc2, CheckHexChars) != 0)) {
+      /* if the char has not to be escaped or is the '%' char, just copy the
+         char. Why not escaping the escape char '%'? Because
+         (1) escaping the escape char is silly, and 
+         (2) some people use the '%' to get access to some unused namespace
+             to work around the undefined URL-charset problem (RFC1630 tells
+             users to use Latin-1, while RFC2396 recommends UTF-8). To make
+             the URLs received from the server at least for that case
+             unambiguous in that the server wrote the URL without knowing
+             the charset the browser prefers, some servers write
+             %uHHHH-encoded UTF-16 chars into URLs referencing the server
+             itself. Encoding a '%' would break that short-term solution.
+      */
+      if (IS_OK(c) || (c == HEX_ESCAPE && !(forced)) {
                  tempBuffer[tempBufferPos++]=c;
       }
       else 
Status: RESOLVED → UNCONFIRMED
Hardware: PC → All
Resolution: INVALID → ---
Example.  I have a file with a % in its name.  I would like to link to it using
the actual name of the file and let the browser handle the escaping.

If I do that with NS 4.x it breaks: "Your browser sent a request that this
server could not understand."

With Mozilla I can retrieve the file just fine.  There is no reason to require
users to type escape codes into the url entry field if they want to go to such a
file.
Boris,

suppose your filename is "AbC%dEf.html" and this file is within the directory
accessible by "http://justchat4.medium.net:4000/chat/test/".

Will you refer this file with

http://justchat4.medium.net:4000/chat/test/AbC%dEf.html

or with

http://justchat4.medium.net:4000/chat/test/AbC%25dEf.html

?

Try it out, with Netscape 4.7 or Mozilla 0.6, in which cases will you succeed? 
You can try this with every server and every client out there. The behaviour 
will be identical, the first URL would not work, the second URL form would.

<QUOTE>
There is no reason to require users to type escape codes into the url entry 
field if they want to go to such a file.
</QUOTE>

There is a reason. It's the same reason why users should not write quotation 
marks '"', question marks '?', hash marks '#', exclamation marks '!', spaces ' ' 
, tabulators and certain other characters within their URLs. It is because all 
of those characters have a special meaning, regardless of the meaning, it is 
special. If the user does not want to trigger such a special meaning, she|he has 
to avoid to write those characters in literal form. The user may use the URL 
escape mechanism to denote such a special character in its literal meaning.
Here a cleaned version of the patch. The last patch contains a typo (missing 
right parenthesis). The resulting code also will be faster.

diff -Nur netwerk/base/src/nsURLHelper.cpp.orig 
netwerk/base/src/nsURLHelper.cpp
--- nsURLHelper.cpp.orig        Fri Dec  8 20:14:56 2000
+++ nsURLHelper.cpp     Mon Dec 11 22:58:38 2000
@@ -93,7 +93,6 @@
 
     int i = 0;
     char* hexChars = "0123456789ABCDEF";
-    static const char CheckHexChars[] = "0123456789ABCDEFabcdef";
     int len = PL_strlen(str);
     PRBool forced = PR_FALSE;
 
@@ -107,25 +106,23 @@
     char tempBuffer[100];
     unsigned int tempBufferPos = 0;
 
-    char c1[] = " ";
-    char c2[] = " ";
-    char* const pc1 = c1;
-    char* const pc2 = c2;
-
     for (i = 0; i < len; i++)
     {
-      c1[0] = *(src+1);
-      if (*(src+1) == '\0') 
-          c2[0] = '\0';
-      else
-          c2[0] = *(src+2);
       unsigned char c = *src++;
 
-      /* if the char has not to be escaped or whatever follows % is 
-         a valid escaped string, just copy the char */
-      if (IS_OK(c) || (c == HEX_ESCAPE && !(forced) && (pc1) && (pc2) &&
-         PL_strpbrk(pc1, CheckHexChars) != 0 &&  
-         PL_strpbrk(pc2, CheckHexChars) != 0)) {
+      /* if the char has not to be escaped or is the '%' char, just copy the
+         char. Why not escaping the escape char '%'? Because
+         (1) escaping the escape char is silly, and 
+         (2) some people use the '%' to get access to some unused namespace
+             to work around the undefined URL-charset problem (RFC1630 tells
+             users to use Latin-1, while RFC2396 recommends UTF-8). To make
+             the URLs received from the server at least for that case
+             unambiguous in that the server wrote the URL without knowing
+             the charset the browser prefers, some servers write
+             %uHHHH-encoded UTF-16 chars into URLs referencing the server
+             itself. Encoding a '%' would break that short-term solution.
+      */
+      if (IS_OK(c) || (c == HEX_ESCAPE && !(forced))) {
                  tempBuffer[tempBufferPos++]=c;
       }
       else 
Okay Xuan, I'm beginning to see your point. The cases Boris and I were talking
about can be done using the forced-URLmask. That is the mechanism that can be
used to force the escape of the % sign. Sometimes that has to be done, there is
no way around that. There are numerous examples with files that contain % signs
that have to escape the % to preserve the filename. But that mechanism is not
disturbed by your patch, at least not when the file is acessed over the file
selection box or the web location box or similar access methods. Direct access
to an url over the urlbar has to confirm the parsing rules of an url. Sometimes
that includes the usage of the escape char by the user.

We have to go through a massive amount of related closed bug reports and check
with every case. Of special interest are mail adresses that replace the @ with %
and so on. This will take some time and mine on the project is currently very
very limited, so any help is very much appreciated.

I have confirmed the report to make it more visible.
Status: UNCONFIRMED → NEW
Ever confirmed: true
I've checked against some of the old known bugs and couldn't find anything
negative using the patch. cc'ing ftang for some internationalisation/filename
encoding issues. Frank, can you take a look at this from your point and do some
tests with the patch applied?
Hi, folks!

For three months, nothing has been happening.
Is there any plan|timeframe to apply the patch to the main tree? There don't 
seem to be objections (anymore) so far.

I currently have no time to look after this stuff. I'm still running with the
patch for months now and haven't seen any problems. But I'm still not sure about
some of the character encoding aspects of this. 
<quote>
I currently have no time to look after this stuff. I'm still running with the
patch for months now and haven't seen any problems. 
</quote>

Maybe you do not see any problems because this patch makes mozilla behave like 
all other browsers out there. ;-)

Is it possible to apply this patch to the nightly build trees so that we have a 
greater test base?

<quote>
But I'm still not sure about some of the character encoding aspects of this. 
</quote>

What are your uncertainties about? You do not feel good because you leave the 
namespace of %xx (where x is not a hexadecimal character) open|undefined by 
current standards? Maybe I can help solving such problems. :-)
It's simple: I can't check all of the old bugs related with url parsing, some of
them involve special netscape test servers I have no access to, mainly some
internationalisation stuff, charset encoding and conversion, something like
that.

If someone could do some quick tests that there are no regressions in that area
I would agree with you that the patch should be applied to the tree.
mass move, v2.
qa to me.
QA Contact: tever → benc
benc can u verify andreas suggestion? thx
I can run a test suite, but I do not think I have sufficient experience to write
a test suite.

Can someone provide some technical backup for me here?
Target Milestone: --- → mozilla1.0
Keywords: makingtest, qawanted
Blocks: 35209
If the code cleanup in bug 94796 becomes active this patch has to move to
xpcom/io/nsEscape.cpp 
Attachment #20824 - Attachment is obsolete: true
*** Bug 93095 has been marked as a duplicate of this bug. ***
Keywords: patch
It seems some more explanation is necessary to get a r=/sr= on this bug:

What will happen with this patch applied? We will no longer escape a % which is
not a part of an escaped octet until we are forced to do so. Currently we escape
abc%xx to abc%25xx. This is in conformance with rfc 2396, section 2.4.2:

   Because the percent "%" character always has the reserved purpose of
   being the escape indicator, it must be escaped as "%25" in order to
   be used as data within a URI.

So everything seems fine. However it is not necessary to escape the % in this
case since there is no danger of misinterpretation. In fact there seem to be
several applications/cgis that build on not escaping a lonely %. Since there is
no danger we should drop escaping a lonely % in favour of compatibility with
other browsers/cgis. We can still handle urls that escape a lonely %, but we
will not do it ourselfs until forced with the esc_Forced flag.
<cite>
What will happen with this patch applied? We will no longer escape a % which is
not a part of an escaped octet until we are forced to do so. Currently we escape
abc%xx to abc%25xx. This is in conformance with rfc 2396, section 2.4.2:

   Because the percent "%" character always has the reserved purpose of
   being the escape indicator, it must be escaped as "%25" in order to
   be used as data within a URI.
</cite>

I think that this is _not_ in conformance with rfc 2396, because this
interpretation is recursive. If you say that it is okay to escape an "%" to
"%25", it must be okay to escape a "%25" to "%2525" and then to "%252525" and so
on. You see that a recursive interpretation of that rule does not work out.

There is an alternative view on that problem. See two representations of a URL,
one "plaintext" and one "ciphertext":

plaintext: {
  protocol: "http";
  host:     "bugzilla.mozilla.org";
  port:     null;
  rel_path: "/tes%test.html"
}

ciphertext: "http://bugzilla.mozilla.org/tes%25test.html"

The rule is - in this interpretation - to represent a "%" of the plaintext by a
"%25" within ciphertext. That is - in my opinion - much clearer than a recursive
interpretation.

Thus, representing a "%" with a "%25" happens only within a transformation from
plaintext to ciphertext.

Now, if you (as a browser) encounter some HTML text

  <FRAME SRC="abc/xy%u123456.html" />

what representation of URL is contained within the attribute "SRC" of tag
"FRAME"? It's ciphertext, not plaintext. Thus, it is not to be transformed into
ciphertext (because it is ciphertext already). And thus, a substitution to
something like "abc/xy%25u123456.html" must not happen.

Yet, it does happen, for a more or less good reason: People sometimes like to
use the ciphertext as plaintext. If they want to reference a folder "my files",
they write "my files" instead of "my%20files" into places where URLs are
requested, thus forming an URL which contains illegal characters (the space mark
' '). Mozilla is forgiving about this and applies the
plainttext-to-ciphertext-encoding onto such URL strings (which are, by format
definition, already ciphertext). The problem is that blindly applying such an
encoding twice results in formally correct but textual wrong URLs, which is, in
my opinion, a bug.

I hope that explanation helps to get a r=|sr=. If further explanation is needed:
just ask. :-)
The definition in rfc2396 is not really recursive because it says "be careful to
not escape a string more than once" or something similar. 

Escaping (nsStdEscape) in mozilla respects this and (until not forced by
esc_Forced) will not escape an already escaped octet. Otherwise an %20 would
also become %2520, %252520, and so on.

I think we should not waste our energies on this interpretation aspect of an rfc
that has already proven to be sometimes misleading (query handling!). I think we
all agree this patch should be applied, not depending on if the current
behaviour is correct according to rfc 2396 or not. 
I warmly agree. :-)
Comment on attachment 46106 [details] [diff] [review]
patch in the new nsEscape world

how about adding some of Xuan's comments from his first patch?  .. or at least reference this 
bug, so casual readers of the code will know why this is being done.
Attachment #46106 - Flags: superreview+
Comment on attachment 54507 [details] [diff] [review]
same patch as before, but including a comment referencing the bug

carryover sr= from darin's review.  also adding r=dougt
Attachment #54507 - Flags: superreview+
Attachment #54507 - Flags: review+
gagan, can you check this in?
Thanks Doug! I can check this in if you want. 
Hmmmm ... unless anyone objects, I will check this in over the weekend.
sounds great.  If you miss, I will check it in on monday.
Assignee: gagan → dougt
Target Milestone: mozilla1.0 → mozilla0.9.6
fix finally checked in. Wohowwww!!!
Status: NEW → RESOLVED
Closed: 24 years ago23 years ago
Resolution: --- → FIXED
*** Bug 194416 has been marked as a duplicate of this bug. ***
Keywords: qawanted
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: