61269 - '%' will get URL-escaped to "%25" if not followed by 2 hexadecimal digits

Reporter

Description

•

25 years ago

From Bugzilla Helper: User-Agent: Mozilla/4.75 [en] (Win98; U) BuildID: 200111908 Mozilla escapes '%' characters in a URL to "%25" if the '%' is not followed by two hexadecimal digits [0-9a-fA-F]. This URL both may be contained in a <A HREF=>-Statement or may be typed in directly into the URL bar. This escaping is very arbitrary and should be avoided due to its arbitrarity and incompatibility with current browsers. Just never encode '%', because '%' already is the encoding character. Reproducible: Always Steps to Reproduce: 1. Go to http://justchat4.medium.net:4000/chat/test/URLTest.html 2. Click on the link on that page. Actual Results: You will see (using mozilla): Your data supplied is %u20AC test â0AC. Expected Results: You should see (as with Netscape 4.7 and IE5): Your data supplied is € test â0AC. You will see that mozilla escapes "%u20AC" but it keeps "%e20AC", so escaping is very arbitrary. Mozilla should not "know better" how to handle such URLs, it just should pass them 1:1 to the server. Why? As a transition-mechanism, I have to use %uHHHH-escapes to denote UTF16-characters not found in ASCII. Because I cannot know wether browsers URL-charset is Latin1, UTF8 or something else (and because URLs don't offer a reliable way to detect it), I need to use this. Escaping with '%' but not with a hexadecimal character after that '%' is one of the remaining namespace in URLs to be used. This mechanism is site specific and free of namespace conflicts as the unreliable Latin1 vs UTF8 transition. As browsers have to pass URLs as given (at least to the site which generated the URL), I consider this a bug.

Boris Zbarsky [:bzbarsky]

Comment 1

•

25 years ago

From RFC 1738 (the RFC on Uniform Resource Locators): The character "%" is unsafe because it is used for encodings of other characters. [...] All unsafe characters must always be encoded within a URL. So in cases when the % is not obviously being used to encode another character, it itself must be encoded. Ideally, this would be done by the person writing the URL. That is, you would have in your page: "%25u20AC test %25e20AC" if what you want to get passed to the server is "%u20AC test %e20AC". I would recommend that this be marked invalid as Mozilla's behavior is correct per the RFC.

Xuân Baldauf

Reporter

Comment 2

•

25 years ago

The character "%" is unsafe because it is used for encodings of other characters. [...] All unsafe characters must always be encoded within a URL. You know that this is a recursive definition. If you follow this thinking exactly, you would to have encode "%" to "%25", which already contains a '%', so you would have to encode it again zo "%2525", this to "%252525" and so on, leading to an infinitely long string. This can't be the right interpretation. To get a non-infinitely long result, you have to interprete a "%" as an escape character and "%25" as the literal percent sign. If you interprete this way, you must not escape an escape sign. If you do that, escaping is impossible. Why do I report this as bug? Because else there is _no_ clear way where theory and practice are consistent in the "transport" of an UTF16-value to the browser and back. When I receive "%da%bf" within an URL, the server _cannot_ know wether this is latin-1 encoded (as browsers frequently do) or wether this is UTF-8 (as the standard recently says). To solve the problem, at least for the case when I have to pass UTF-16 strings from the server to the browser and back, I encode every UTF-16 value which is not within ASCII by "%uHHHH", where H is one hex digit. This is unambigous and clear. So when you think that '%' is an escape character to escape other characters and "%25" is the literal percent sign, you can be happy. My '%' escapes other characters. It's clear that escape characters must not be escaped within the same information representation layer. But this is exactly what mozilla does. I may want to learn how I should transport UTF-16 characters in a namespace-safe fashion. UTF-8 over %HH is not namespace-safe, because it could be interpreted as latin-1. UTF-7 is not namespace-safe either. I do not want my server to make guesswork, because guessing can always be wrong. Give me a namespace-safe, unambiguous way to encode UTF-16 characters within URLs for all browsers. The %uHHHH-way is the only way I know of. Note: all URLs point only back my server, no other server are involved with %uHHHH-URLs.

Xuân Baldauf

Reporter

Comment 3

•

25 years ago

I also do not know where encoding and escaping the escape character '%' is useful, give me an example. :o)

patch to switch off escaping of the % sign until forced (from Xuan Baldauf's patch) 25 years ago Andreas Otte 1.75 KB, patch		Details \| Diff \| Splinter Review
patch in the new nsEscape world 24 years ago Andreas Otte 1.34 KB, patch	darin.moz : superreview+	Details \| Diff \| Splinter Review
same patch as before, but including a comment referencing the bug 24 years ago Andreas Otte 1.41 KB, patch	dougt : review+ dougt : superreview+	Details \| Diff \| Splinter Review