Closed
Bug 909264
Opened 11 years ago
Closed 11 years ago
ASCII control characters stripped from address bar
Categories
(Firefox :: Address Bar, defect)
Tracking
()
RESOLVED
FIXED
Firefox 26
People
(Reporter: firefoxbugreporter, Assigned: jfkthame)
References
Details
(Keywords: regression, Whiteboard: spoof)
Attachments
(5 files, 1 obsolete file)
151.15 KB,
image/png
|
Details | |
123.94 KB,
image/png
|
Details | |
55.46 KB,
image/png
|
Details | |
1.37 KB,
patch
|
Details | Diff | Splinter Review | |
1.80 KB,
patch
|
Gavin
:
review+
|
Details | Diff | Splinter Review |
User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20100101 Firefox/13.0.1 (Beta/Release) Build ID: 20120614114901 Steps to reproduce: I typed an URL like this in the address bar: http://somedomain.com/%1A/PageName and I hit enter on it. Actual results: My input was changed to: http://somedomain.com//PageName The ASCII control character that is non-printable was removed. It left an URL which is not correct for the requested resource. Further more for example: http://somedomain.com/somename%1A/PageName would turn to: http://somedomain.com/somename/PageName which might be a completely different resource on the server and confuse the user. Expected results: ASCII control characters that are non-printable should remain in the address bar with their URL encoding. This has nothing to do with localization (https://bugzilla.mozilla.org/show_bug.cgi?id=105909) as there is not another standardized representation for them in any locale. Useful information is lost from the address bar by stripping them and even possible a conflict created where the displayed URL is for another resource than the displayed page. I have not tested this with other versions, but this should be easy and quick to test for people using those versions.
Why is your UA displaying Firefox 13?
Component: Untriaged → Location Bar
Flags: needinfo?(firefoxbugreporter)
Reporter | ||
Comment 2•11 years ago
|
||
Ok, Firefox is discriminating between control characters. Try the following URL: https://www.google.com/%00%01%02%03%04%05%06%07%08%09%0A%0B%0C%0D%0E%0F%10%11%12%13%14%15%16%17%18%19%1A%1B%1C%1D%1E%1F%7F Firefox will change the address bar to: https://www.google.com/%09%0A%0B%0C%0D%1C%1D%1E%1F[] (where [] is a box) Google will report that the page can not be found (404) and display the URL to the requested page containing the typed input. I see no reason for this discrimination. The stripped control characters, although probably not often used in URLs, are valid in an URL when encoded. For static pages these control characters are probably avoided by most if not all, but with web applications they can be very much part of an URL when that URL is rewritten and does not use a query string for arguments.
Reporter | ||
Comment 3•11 years ago
|
||
(In reply to Loic from comment #1) > Why is your UA displaying Firefox 13? Because that is the version of Firefox on this system? If this issues has been resolved in later versions and there is no support on version 13 then that is one thing. If this issue exists in later versions of Firefox then it has no relevance why it is 13 at all.
Flags: needinfo?(firefoxbugreporter)
So update to Firefox 23.0.1 and try to repro the issue. Firefox 13 is EOL and not supported anymore.
Reporter | ||
Comment 5•11 years ago
|
||
Issues reproduced using Firefox 23.0.1
Reporter | ||
Updated•11 years ago
|
Version: 13 Branch → 23 Branch
Comment 6•11 years ago
|
||
Open URL( https://www.google.com/%00%01%02%03%04%05%06%07%08%09%0A%0B%0C%0D%0E%0F%10%11%12%13%14%15%16%17%18%19%1A%1B%1C%1D%1E%1F%7F ) in commnet#2 Regression window(m-c) Good: http://hg.mozilla.org/mozilla-central/rev/8ae16e346bd0 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120105 Firefox/12.0a1 ID:20120106015923 Bad: http://hg.mozilla.org/mozilla-central/rev/fcc32e70c95f Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120105 Firefox/12.0a1 ID:20120106042423 Pushlog: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=8ae16e346bd0&tochange=fcc32e70c95f Regression window(m-i) Good: http://hg.mozilla.org/integration/mozilla-inbound/rev/511078d51f71 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120105 Firefox/12.0a1 ID:20120105035122 Bad: http://hg.mozilla.org/integration/mozilla-inbound/rev/c0b62edd2917 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120105 Firefox/12.0a1 ID:20120105041225 Pushlog: http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=511078d51f71&tochange=c0b62edd2917 Regressed by: bug 703100
Updated•11 years ago
|
Blocks: 703100
status-firefox22:
--- → affected
status-firefox23:
--- → affected
status-firefox24:
--- → affected
status-firefox25:
--- → affected
status-firefox26:
--- → affected
status-firefox-esr17:
--- → affected
Keywords: regression
OS: Windows 7 → All
Whiteboard: spoof
Version: 23 Branch → 12 Branch
Updated•11 years ago
|
Attachment #795410 -
Attachment description: screenshot → screenshot windows7
Comment 7•11 years ago
|
||
Updated•11 years ago
|
Status: UNCONFIRMED → NEW
Ever confirmed: true
Reporter | ||
Comment 8•11 years ago
|
||
Thanks Alice0775 White
Assignee | ||
Comment 9•11 years ago
|
||
Actually, the old behavior was dependent on the system's installed fonts, as we'd have sent these characters through the normal font-matching process like any other, and rendered them with whatever font we found that included them in its cmap. In many cases, there'd be no such font, and so we'd draw hexboxes. But if -any- installed font did cover these codes, we'd use that instead - and most likely it would have blank, zero-width glyphs. Or maybe it'd include the characters in the cmap but they'd point to its .notdef glyph - in which case we'd render whatever glyph that is, rather than our own hexbox. So we didn't really control whether or not they'd have any visible representation; it was a lottery, depending on the user's font collection. As stated in http://en.wikipedia.org/wiki/Unicode_control_characters, "these characters themselves have no visual or spatial representation", so it really doesn't make sense to even try to font-match and paint them as such. Probably we should %-escape them in the URL bar, just like other "invisible" characters higher up in Unicode.
Assignee | ||
Comment 10•11 years ago
|
||
(In reply to firefoxbugreporter from comment #0) > I typed an URL like this in the address bar: > > http://somedomain.com/%1A/PageName > > and I hit enter on it. > > > Actual results: > > My input was changed to: > > http://somedomain.com//PageName > > The ASCII control character that is non-printable was removed. It left an > URL which is not correct for the requested resource. Actually, the character was not removed; it's still present, as you can tell by moving the arrow keys to move through the URL bar one character at a time. It's just that it is an invisible, zero-width character, so you can't see it. So nothing is being "stripped"; but given that the control characters are expected to be invisible, I do think we should represent them in %-escaped form here.
Reporter | ||
Comment 11•11 years ago
|
||
(In reply to Jonathan Kew (:jfkthame) from comment #10) > (In reply to firefoxbugreporter from comment #0) > > > I typed an URL like this in the address bar: > > > > http://somedomain.com/%1A/PageName > > > > and I hit enter on it. > > > > > > Actual results: > > > > My input was changed to: > > > > http://somedomain.com//PageName > > > > The ASCII control character that is non-printable was removed. It left an > > URL which is not correct for the requested resource. > > Actually, the character was not removed; it's still present, as you can tell > by moving the arrow keys to move through the URL bar one character at a > time. It's just that it is an invisible, zero-width character, so you can't > see it. > > So nothing is being "stripped"; but given that the control characters are > expected to be invisible, I do think we should represent them in %-escaped > form here. If you select the entire URL after it has been modified and copy/paste it to a different target (i.e. notepad on windows) you get the original back (with %1A). So it's a GUI representation issue, internally no data is lost from the URL. When you use the arrow keys and you have to hit them twice to move a single position on the screen then that is non-standard behavior on any platform I think.
Reporter | ||
Comment 12•11 years ago
|
||
(In reply to Jonathan Kew (:jfkthame) from comment #9) > Actually, the old behavior was dependent on the system's installed fonts, as > we'd have sent these characters through the normal font-matching process > like any other, and rendered them with whatever font we found that included > them in its cmap. > > In many cases, there'd be no such font, and so we'd draw hexboxes. But if > -any- installed font did cover these codes, we'd use that instead - and most > likely it would have blank, zero-width glyphs. Or maybe it'd include the > characters in the cmap but they'd point to its .notdef glyph - in which case > we'd render whatever glyph that is, rather than our own hexbox. So we didn't > really control whether or not they'd have any visible representation; it was > a lottery, depending on the user's font collection. > > As stated in http://en.wikipedia.org/wiki/Unicode_control_characters, "these > characters themselves have no visual or spatial representation", so it > really doesn't make sense to even try to font-match and paint them as such. > Probably we should %-escape them in the URL bar, just like other "invisible" > characters higher up in Unicode. From http://www.ietf.org/rfc/rfc1738.txt Octets must be encoded if they have no corresponding graphic character within the US-ASCII coded character set, if the use of the corresponding character is unsafe, or if the corresponding character is reserved for some other interpretation within the particular URL scheme. No corresponding graphic US-ASCII: URLs are written only with the graphic printable characters of the US-ASCII coded character set. The octets 80-FF hexadecimal are not used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent control characters; these must be encoded. The business of interpreting an URL as Unicode is very weird. When a developer redirects a browser to an URL containing ASCII characters in the lower (00-1F) or upper range (80-FF) that are properly escaped as per specification Firefox might turn them client side into some weird Unicode characters by considering multiple ASCII characters as a single multi-byte Unicode character if the data in the URL happen to match them. Things go really haywire if it matches those ASCII bytes to an Asian character on a English based system or a Nordic or French character for someone using an Asian language. That other browser (which I will not name but call it Lord Voldemort) just keeps the URL encoding intact like it should be.
Comment 13•11 years ago
|
||
So, we need to modify function losslessDecodeURI according to the rfc1738.txt. http://mxr.mozilla.org/mozilla-central/source/browser/base/content/browser.js#2217
Comment 14•11 years ago
|
||
The octets 80-FF hexadecimal and the octets 00-1F and 7F hexadecimal represent control characters; these must be encoded.
Assignee | ||
Comment 15•11 years ago
|
||
(In reply to Alice0775 White from comment #14) > Created attachment 795631 [details] [diff] [review] > fix > > The octets 80-FF hexadecimal You mean 80-9F, I think.
Comment 16•11 years ago
|
||
The octets 80-9F hexadecimal and the octets 00-1F and 7F hexadecimal represent control characters; these must be encoded. (In reply to Jonathan Kew (:jfkthame) from comment #15) > (In reply to Alice0775 White from comment #14) > > Created attachment 795631 [details] [diff] [review] > > fix > > > > The octets 80-FF hexadecimal > > You mean 80-9F, I think.
Attachment #795631 -
Attachment is obsolete: true
Reporter | ||
Comment 17•11 years ago
|
||
Where does the 80-9F come from? URLs are written only with the graphic printable characters of the US-ASCII coded character set. The octets 80-FF hexadecimal are not used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent control characters; these must be encoded. That says 80-FF needs to be encoded. US-ASCII is "7 bit ASCII", everything using the most significant bit (byte value 128 through 255 decimal) in the byte is known as "extended ASCII" and is system/code page dependent. That is why there is no standard representation for it and it needs to be encoded. (In reply to comment #16 and comment #15)
Assignee | ||
Comment 18•11 years ago
|
||
(In reply to firefoxbugreporter from comment #17) > Where does the 80-9F come from? > > URLs are written only with the graphic printable characters of the > US-ASCII coded character set. The octets 80-FF hexadecimal are not > used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent > control characters; these must be encoded. > > > That says 80-FF needs to be encoded. > > US-ASCII is "7 bit ASCII", everything using the most significant bit (byte > value 128 through 255 decimal) in the byte is known as "extended ASCII" and > is system/code page dependent. That is why there is no standard > representation for it and it needs to be encoded. No; that's talking about 8-bit data, but at this level what we're dealing with is Unicode. The location bar displays an IRI (see http://www.ietf.org/rfc/rfc3987.txt), where U+00A0..00FF (and thousands more!) are perfectly valid and well-defined printable characters, not an ASCII-only URL. Perhaps the code would be clearer if the regex were rewritten using \u escapes, e.g. [\u0000-\u001f\u007f-\u009f\u2028\u2029\ufffc], rather than mixing \x and \u notations, though it would be functionally equivalent.
Comment 19•11 years ago
|
||
Comment on attachment 795671 [details] [diff] [review] fix v2 > + // object replacement character) (bug 452979 and bug 909264) > + value = value.replace(/[\x00-\x1f\x7f\x80-\x9f\u2028\u2029\ufffc]/g, Please explain about C0 & C1 controls in the comment.
Comment 20•11 years ago
|
||
Please, someone who know well takes over.
Assignee | ||
Comment 21•11 years ago
|
||
This is functionally the same as Alice0775's patch, just with the comment tweaked to include mention of the control-char blocks, and using \u notation for clarity as per comment above.
Attachment #795798 -
Flags: review?(gavin.sharp)
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → jfkthame
Reporter | ||
Comment 22•11 years ago
|
||
(In reply to Jonathan Kew (:jfkthame) from comment #18) > (In reply to firefoxbugreporter from comment #17) > > Where does the 80-9F come from? > > > > URLs are written only with the graphic printable characters of the > > US-ASCII coded character set. The octets 80-FF hexadecimal are not > > used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent > > control characters; these must be encoded. > > > > > > That says 80-FF needs to be encoded. > > > > US-ASCII is "7 bit ASCII", everything using the most significant bit (byte > > value 128 through 255 decimal) in the byte is known as "extended ASCII" and > > is system/code page dependent. That is why there is no standard > > representation for it and it needs to be encoded. > > No; that's talking about 8-bit data, but at this level what we're dealing > with is Unicode. The location bar displays an IRI (see > http://www.ietf.org/rfc/rfc3987.txt), where U+00A0..00FF (and thousands > more!) are perfectly valid and well-defined printable characters, not an > ASCII-only URL. > > Perhaps the code would be clearer if the regex were rewritten using \u > escapes, e.g. [\u0000-\u001f\u007f-\u009f\u2028\u2029\ufffc], rather than > mixing \x and \u notations, though it would be functionally equivalent. That RFC points out the issue: Some percent-encodings cannot be interpreted as sequences of UTF-8 octets. (Note: The octet patterns of UTF-8 are highly regular. Therefore, there is a very high probability, but no guarantee, that percent-encodings that can be interpreted as sequences of UTF-8 octets actually originated from UTF-8. For a detailed discussion, see [Duerst97].) It seems that in Firefox from as soon as one encoding can not be converted to Unicode the entire URL remains encoded? Is that observation correct? (except for the issue of this bug report that is) Trying to display them in Unicode might work for many when the web developer initially used Unicode, but if s/he didn't it should display the entire URL with proper encoding. Otherwise you might end up with something like an Unicode smiley face in your address bar that has no valid reason to be there and is not very useful at all.
Reporter | ||
Comment 23•11 years ago
|
||
Additional test: https://www.google.com/%00%01%02%03%04%05%06%07%08%09%0A%0B%0C%0D%0E%0F%10%11%12%13%14%15%16%17%18%19%1A%1B%1C%1D%1E%1F%7F%80 (added %80) to the end to the test in comment #2. If you test that URL then ALL encoding remain in place. Whatever logic made the URL being kept because there is a %80 in there should be applied to the %00 - %1F and %7F ranges as well I think. That would make more sense than just encode the ASCII control characters using an Unicode scheme.
Assignee | ||
Comment 24•11 years ago
|
||
(In reply to firefoxbugreporter from comment #23) > Additional test: > > https://www.google.com/ > %00%01%02%03%04%05%06%07%08%09%0A%0B%0C%0D%0E%0F%10%11%12%13%14%15%16%17%18%1 > 9%1A%1B%1C%1D%1E%1F%7F%80 > > > (added %80) to the end to the test in comment #2. > > If you test that URL then ALL encoding remain in place. Whatever logic made > the URL being kept because there is a %80 in there should be applied to the > %00 - %1F and %7F ranges as well I think. That would make more sense than > just encode the ASCII control characters using an Unicode scheme. No, that's a different case. In the original test, with control chars in the %00-%1f range, the sequence of octets is well-formed UTF-8 and so it is interpreted as such. That's expected; it's exactly the same mechanism that causes http://example.com/%48%65%6c%6c%6f to be displayed as http://example.com/Hello. (And those octets are indeed being interpreted as UTF-8, not US-ASCII, as you can tell if you include non-ASCII characters in %-encoded UTF-8 form: http://example.com/%48%c3%a9%6c%6c%c3%b6.) However, when you append %80, the sequence of octets represented by the %-encoded values is no longer well-formed UTF-8, and that's why the URL is displayed in its original %-encoded form rather than making any attempt to interpret the octets as characters in any particular encoding. So that is a different situation, and the logic keeping the URL in its %-encoded form there does not apply to the original example - that case *is* well-formed UTF-8 that happens to contain some C0 control characters. The solution, then, is to add those control characters to the set that we explicitly %-encode rather than displaying literally.
Comment 25•11 years ago
|
||
Comment on attachment 795798 [details] [diff] [review] control characters in the location bar should be %-encoded for visibility >diff --git a/browser/base/content/browser.js b/browser/base/content/browser.js >+ // Encode invisible characters (C0/C1 controls, line and paragraph separator, >+ // object replacement character) (bug 452979, bug 909264) >+ value = value.replace(/[\u0000-\u001f\u007f-\u009f\u2028\u2029\ufffc]/g, > encodeURIComponent); A more precise comment would be "C0/C1 control characters + U+007F (DEL)". In bug 598357 comment 31 we also included U+00A0 (NBSP) in our definition of "unprintable characters", should we do so here as well? r=me with those addressed.
Attachment #795798 -
Flags: review?(gavin.sharp) → review+
Assignee | ||
Comment 26•11 years ago
|
||
(In reply to :Gavin Sharp (use gavin@gavinsharp.com for email) from comment #25) > In bug 598357 comment 31 we also included U+00A0 (NBSP) in our definition of > "unprintable characters", should we do so here as well? Seems reasonable to me. It's not "invisible" in the sense that the control characters (usually) are, but it would be visually indistinguishable from a normal space. https://hg.mozilla.org/integration/mozilla-inbound/rev/47b8ffe6ecc4
Target Milestone: --- → Firefox 26
https://hg.mozilla.org/mozilla-central/rev/47b8ffe6ecc4
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•