<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Reporter

Comment 2

•

12 years ago

Ok, Firefox is discriminating between control characters. Try the following URL: https://www.google.com/%00%01%02%03%04%05%06%07%08%09%0A%0B%0C%0D%0E%0F%10%11%12%13%14%15%16%17%18%19%1A%1B%1C%1D%1E%1F%7F Firefox will change the address bar to: https://www.google.com/%09%0A%0B%0C%0D%1C%1D%1E%1F[] (where [] is a box) Google will report that the page can not be found (404) and display the URL to the requested page containing the typed input. I see no reason for this discrimination. The stripped control characters, although probably not often used in URLs, are valid in an URL when encoded. For static pages these control characters are probably avoided by most if not all, but with web applications they can be very much part of an URL when that URL is rewritten and does not use a query string for arguments.

Reporter

Comment 3

•

12 years ago

(In reply to Loic from comment #1) > Why is your UA displaying Firefox 13? Because that is the version of Firefox on this system? If this issues has been resolved in later versions and there is no support on version 13 then that is one thing. If this issue exists in later versions of Firefox then it has no relevance why it is 13 at all.

Flags: needinfo?(firefoxbugreporter)

Loic

Comment 4

•

12 years ago

So update to Firefox 23.0.1 and try to repro the issue. Firefox 13 is EOL and not supported anymore.

Reporter

Comment 5

•

12 years ago

Attached image repro.23.0.1.png — Details

Issues reproduced using Firefox 23.0.1

Reporter

Updated

•

12 years ago

Version: 13 Branch → 23 Branch

Comment 6

•

12 years ago

Attached image screenshot windows7 — Details

Open URL( https://www.google.com/%00%01%02%03%04%05%06%07%08%09%0A%0B%0C%0D%0E%0F%10%11%12%13%14%15%16%17%18%19%1A%1B%1C%1D%1E%1F%7F ) in commnet#2 Regression window(m-c) Good: http://hg.mozilla.org/mozilla-central/rev/8ae16e346bd0 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120105 Firefox/12.0a1 ID:20120106015923 Bad: http://hg.mozilla.org/mozilla-central/rev/fcc32e70c95f Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120105 Firefox/12.0a1 ID:20120106042423 Pushlog: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=8ae16e346bd0&tochange=fcc32e70c95f Regression window(m-i) Good: http://hg.mozilla.org/integration/mozilla-inbound/rev/511078d51f71 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120105 Firefox/12.0a1 ID:20120105035122 Bad: http://hg.mozilla.org/integration/mozilla-inbound/rev/c0b62edd2917 Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120105 Firefox/12.0a1 ID:20120105041225 Pushlog: http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=511078d51f71&tochange=c0b62edd2917 Regressed by: bug 703100

Updated

•

12 years ago

Blocks: 703100

status-firefox22: --- → affected

status-firefox23: --- → affected

status-firefox24: --- → affected

status-firefox25: --- → affected

status-firefox26: --- → affected

status-firefox-esr17: --- → affected

Keywords: regression

OS: Windows 7 → All

Whiteboard: spoof

Version: 23 Branch → 12 Branch

Updated

•

12 years ago

Attachment #795410 - Attachment description: screenshot → screenshot windows7

Comment 7

•

12 years ago

Attached image screenshot ubuntu — Details

Updated

•

12 years ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

Reporter

Comment 8

•

12 years ago

Thanks Alice0775 White

Assignee

Comment 9

•

12 years ago

Actually, the old behavior was dependent on the system's installed fonts, as we'd have sent these characters through the normal font-matching process like any other, and rendered them with whatever font we found that included them in its cmap. In many cases, there'd be no such font, and so we'd draw hexboxes. But if -any- installed font did cover these codes, we'd use that instead - and most likely it would have blank, zero-width glyphs. Or maybe it'd include the characters in the cmap but they'd point to its .notdef glyph - in which case we'd render whatever glyph that is, rather than our own hexbox. So we didn't really control whether or not they'd have any visible representation; it was a lottery, depending on the user's font collection. As stated in http://en.wikipedia.org/wiki/Unicode_control_characters, "these characters themselves have no visual or spatial representation", so it really doesn't make sense to even try to font-match and paint them as such. Probably we should %-escape them in the URL bar, just like other "invisible" characters higher up in Unicode.

Assignee

Comment 10

•

12 years ago

(In reply to firefoxbugreporter from comment #0) > I typed an URL like this in the address bar: > > http://somedomain.com/%1A/PageName > > and I hit enter on it. > > > Actual results: > > My input was changed to: > > http://somedomain.com//PageName > > The ASCII control character that is non-printable was removed. It left an > URL which is not correct for the requested resource. Actually, the character was not removed; it's still present, as you can tell by moving the arrow keys to move through the URL bar one character at a time. It's just that it is an invisible, zero-width character, so you can't see it. So nothing is being "stripped"; but given that the control characters are expected to be invisible, I do think we should represent them in %-escaped form here.

Reporter

Comment 11

•

12 years ago

(In reply to Jonathan Kew (:jfkthame) from comment #10) > (In reply to firefoxbugreporter from comment #0) > > > I typed an URL like this in the address bar: > > > > http://somedomain.com/%1A/PageName > > > > and I hit enter on it. > > > > > > Actual results: > > > > My input was changed to: > > > > http://somedomain.com//PageName > > > > The ASCII control character that is non-printable was removed. It left an > > URL which is not correct for the requested resource. > > Actually, the character was not removed; it's still present, as you can tell > by moving the arrow keys to move through the URL bar one character at a > time. It's just that it is an invisible, zero-width character, so you can't > see it. > > So nothing is being "stripped"; but given that the control characters are > expected to be invisible, I do think we should represent them in %-escaped > form here. If you select the entire URL after it has been modified and copy/paste it to a different target (i.e. notepad on windows) you get the original back (with %1A). So it's a GUI representation issue, internally no data is lost from the URL. When you use the arrow keys and you have to hit them twice to move a single position on the screen then that is non-standard behavior on any platform I think.

Reporter

Comment 12

•

12 years ago

(In reply to Jonathan Kew (:jfkthame) from comment #9) > Actually, the old behavior was dependent on the system's installed fonts, as > we'd have sent these characters through the normal font-matching process > like any other, and rendered them with whatever font we found that included > them in its cmap. > > In many cases, there'd be no such font, and so we'd draw hexboxes. But if > -any- installed font did cover these codes, we'd use that instead - and most > likely it would have blank, zero-width glyphs. Or maybe it'd include the > characters in the cmap but they'd point to its .notdef glyph - in which case > we'd render whatever glyph that is, rather than our own hexbox. So we didn't > really control whether or not they'd have any visible representation; it was > a lottery, depending on the user's font collection. > > As stated in http://en.wikipedia.org/wiki/Unicode_control_characters, "these > characters themselves have no visual or spatial representation", so it > really doesn't make sense to even try to font-match and paint them as such. > Probably we should %-escape them in the URL bar, just like other "invisible" > characters higher up in Unicode. From http://www.ietf.org/rfc/rfc1738.txt Octets must be encoded if they have no corresponding graphic character within the US-ASCII coded character set, if the use of the corresponding character is unsafe, or if the corresponding character is reserved for some other interpretation within the particular URL scheme. No corresponding graphic US-ASCII: URLs are written only with the graphic printable characters of the US-ASCII coded character set. The octets 80-FF hexadecimal are not used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent control characters; these must be encoded. The business of interpreting an URL as Unicode is very weird. When a developer redirects a browser to an URL containing ASCII characters in the lower (00-1F) or upper range (80-FF) that are properly escaped as per specification Firefox might turn them client side into some weird Unicode characters by considering multiple ASCII characters as a single multi-byte Unicode character if the data in the URL happen to match them. Things go really haywire if it matches those ASCII bytes to an Asian character on a English based system or a Nordic or French character for someone using an Asian language. That other browser (which I will not name but call it Lord Voldemort) just keeps the URL encoding intact like it should be.

Comment 13

•

12 years ago

So, we need to modify function losslessDecodeURI according to the rfc1738.txt. http://mxr.mozilla.org/mozilla-central/source/browser/base/content/browser.js#2217

Comment 14

•

12 years ago

Attached patch fix (obsolete) — Details — Splinter Review

The octets 80-FF hexadecimal and the octets 00-1F and 7F hexadecimal represent control characters; these must be encoded.

Assignee

Comment 15

•

12 years ago

(In reply to Alice0775 White from comment #14) > Created attachment 795631 [details] [diff] [review] > fix > > The octets 80-FF hexadecimal You mean 80-9F, I think.

Comment 16

•

12 years ago

Attached patch fix v2 — Details — Splinter Review

The octets 80-9F hexadecimal and the octets 00-1F and 7F hexadecimal represent control characters; these must be encoded. (In reply to Jonathan Kew (:jfkthame) from comment #15) > (In reply to Alice0775 White from comment #14) > > Created attachment 795631 [details] [diff] [review] > > fix > > > > The octets 80-FF hexadecimal > > You mean 80-9F, I think.

Attachment #795631 - Attachment is obsolete: true

Reporter

Comment 17

•

12 years ago

Where does the 80-9F come from? URLs are written only with the graphic printable characters of the US-ASCII coded character set. The octets 80-FF hexadecimal are not used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent control characters; these must be encoded. That says 80-FF needs to be encoded. US-ASCII is "7 bit ASCII", everything using the most significant bit (byte value 128 through 255 decimal) in the byte is known as "extended ASCII" and is system/code page dependent. That is why there is no standard representation for it and it needs to be encoded. (In reply to comment #16 and comment #15)

Assignee

Comment 18

•

12 years ago

(In reply to firefoxbugreporter from comment #17) > Where does the 80-9F come from? > > URLs are written only with the graphic printable characters of the > US-ASCII coded character set. The octets 80-FF hexadecimal are not > used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent > control characters; these must be encoded. > > > That says 80-FF needs to be encoded. > > US-ASCII is "7 bit ASCII", everything using the most significant bit (byte > value 128 through 255 decimal) in the byte is known as "extended ASCII" and > is system/code page dependent. That is why there is no standard > representation for it and it needs to be encoded. No; that's talking about 8-bit data, but at this level what we're dealing with is Unicode. The location bar displays an IRI (see http://www.ietf.org/rfc/rfc3987.txt), where U+00A0..00FF (and thousands more!) are perfectly valid and well-defined printable characters, not an ASCII-only URL. Perhaps the code would be clearer if the regex were rewritten using \u escapes, e.g. [\u0000-\u001f\u007f-\u009f\u2028\u2029\ufffc], rather than mixing \x and \u notations, though it would be functionally equivalent.

Masatoshi Kimura [:emk]

Comment 19

•

12 years ago

Comment on attachment 795671 [details] [diff] [review] fix v2 > + // object replacement character) (bug 452979 and bug 909264) > + value = value.replace(/[\x00-\x1f\x7f\x80-\x9f\u2028\u2029\ufffc]/g, Please explain about C0 & C1 controls in the comment.

Comment 20

•

12 years ago

Please, someone who know well takes over.

Assignee

Comment 21

•

12 years ago

Attached patch control characters in the location bar should be %-encoded for visibility — Details — Splinter Review

This is functionally the same as Alice0775's patch, just with the comment tweaked to include mention of the control-char blocks, and using \u notation for clarity as per comment above.

Attachment #795798 - Flags: review?(gavin.sharp)

Assignee

Updated

•

12 years ago

Assignee: nobody → jfkthame

Reporter

Comment 22

•

12 years ago

(In reply to Jonathan Kew (:jfkthame) from comment #18) > (In reply to firefoxbugreporter from comment #17) > > Where does the 80-9F come from? > > > > URLs are written only with the graphic printable characters of the > > US-ASCII coded character set. The octets 80-FF hexadecimal are not > > used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent > > control characters; these must be encoded. > > > > > > That says 80-FF needs to be encoded. > > > > US-ASCII is "7 bit ASCII", everything using the most significant bit (byte > > value 128 through 255 decimal) in the byte is known as "extended ASCII" and > > is system/code page dependent. That is why there is no standard > > representation for it and it needs to be encoded. > > No; that's talking about 8-bit data, but at this level what we're dealing > with is Unicode. The location bar displays an IRI (see > http://www.ietf.org/rfc/rfc3987.txt), where U+00A0..00FF (and thousands > more!) are perfectly valid and well-defined printable characters, not an > ASCII-only URL. > > Perhaps the code would be clearer if the regex were rewritten using \u > escapes, e.g. [\u0000-\u001f\u007f-\u009f\u2028\u2029\ufffc], rather than > mixing \x and \u notations, though it would be functionally equivalent. That RFC points out the issue: Some percent-encodings cannot be interpreted as sequences of UTF-8 octets. (Note: The octet patterns of UTF-8 are highly regular. Therefore, there is a very high probability, but no guarantee, that percent-encodings that can be interpreted as sequences of UTF-8 octets actually originated from UTF-8. For a detailed discussion, see [Duerst97].) It seems that in Firefox from as soon as one encoding can not be converted to Unicode the entire URL remains encoded? Is that observation correct? (except for the issue of this bug report that is) Trying to display them in Unicode might work for many when the web developer initially used Unicode, but if s/he didn't it should display the entire URL with proper encoding. Otherwise you might end up with something like an Unicode smiley face in your address bar that has no valid reason to be there and is not very useful at all.

Reporter

Comment 23

•

12 years ago

Additional test: https://www.google.com/%00%01%02%03%04%05%06%07%08%09%0A%0B%0C%0D%0E%0F%10%11%12%13%14%15%16%17%18%19%1A%1B%1C%1D%1E%1F%7F%80 (added %80) to the end to the test in comment #2. If you test that URL then ALL encoding remain in place. Whatever logic made the URL being kept because there is a %80 in there should be applied to the %00 - %1F and %7F ranges as well I think. That would make more sense than just encode the ASCII control characters using an Unicode scheme.

:Gavin Sharp [email: gavin@gavinsharp.com]

Assignee

Comment 24

•

12 years ago

(In reply to firefoxbugreporter from comment #23) > Additional test: > > https://www.google.com/ > %00%01%02%03%04%05%06%07%08%09%0A%0B%0C%0D%0E%0F%10%11%12%13%14%15%16%17%18%1 > 9%1A%1B%1C%1D%1E%1F%7F%80 > > > (added %80) to the end to the test in comment #2. > > If you test that URL then ALL encoding remain in place. Whatever logic made > the URL being kept because there is a %80 in there should be applied to the > %00 - %1F and %7F ranges as well I think. That would make more sense than > just encode the ASCII control characters using an Unicode scheme. No, that's a different case. In the original test, with control chars in the %00-%1f range, the sequence of octets is well-formed UTF-8 and so it is interpreted as such. That's expected; it's exactly the same mechanism that causes http://example.com/%48%65%6c%6c%6f to be displayed as http://example.com/Hello. (And those octets are indeed being interpreted as UTF-8, not US-ASCII, as you can tell if you include non-ASCII characters in %-encoded UTF-8 form: http://example.com/%48%c3%a9%6c%6c%c3%b6.) However, when you append %80, the sequence of octets represented by the %-encoded values is no longer well-formed UTF-8, and that's why the URL is displayed in its original %-encoded form rather than making any attempt to interpret the octets as characters in any particular encoding. So that is a different situation, and the logic keeping the URL in its %-encoded form there does not apply to the original example - that case *is* well-formed UTF-8 that happens to contain some C0 control characters. The solution, then, is to add those control characters to the set that we explicitly %-encode rather than displaying literally.

Comment 25

•

12 years ago

Comment on attachment 795798 [details] [diff] [review] control characters in the location bar should be %-encoded for visibility >diff --git a/browser/base/content/browser.js b/browser/base/content/browser.js >+ // Encode invisible characters (C0/C1 controls, line and paragraph separator, >+ // object replacement character) (bug 452979, bug 909264) >+ value = value.replace(/[\u0000-\u001f\u007f-\u009f\u2028\u2029\ufffc]/g, > encodeURIComponent); A more precise comment would be "C0/C1 control characters + U+007F (DEL)". In bug 598357 comment 31 we also included U+00A0 (NBSP) in our definition of "unprintable characters", should we do so here as well? r=me with those addressed.

Attachment #795798 - Flags: review?(gavin.sharp) → review+