Closed Bug 909264 Opened 11 years ago Closed 11 years ago

ASCII control characters stripped from address bar

Categories

(Firefox :: Address Bar, defect)

12 Branch
x86_64
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Firefox 26
Tracking Status
firefox22 --- affected
firefox23 --- affected
firefox24 --- affected
firefox25 --- affected
firefox26 --- affected
firefox-esr17 --- affected

People

(Reporter: firefoxbugreporter, Assigned: jfkthame)

References

Details

(Keywords: regression, Whiteboard: spoof)

Attachments

(5 files, 1 obsolete file)

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:13.0) Gecko/20100101 Firefox/13.0.1 (Beta/Release)
Build ID: 20120614114901

Steps to reproduce:

I typed an URL like this in the address bar:

http://somedomain.com/%1A/PageName

and I hit enter on it.


Actual results:

My input was changed to:

http://somedomain.com//PageName

The ASCII control character that is non-printable was removed. It left an URL which is not correct for the requested resource. Further more for example:

http://somedomain.com/somename%1A/PageName 

would turn to:

http://somedomain.com/somename/PageName 

which might be a completely different resource on the server and confuse the user.




Expected results:

ASCII control characters that are non-printable should remain in the address bar with their URL encoding. This has nothing to do with localization (https://bugzilla.mozilla.org/show_bug.cgi?id=105909) as there is not another standardized representation for them in any locale. 

Useful information is lost from the address bar by stripping them and even possible a conflict created where the displayed URL is for another resource than the displayed page. 

I have not tested this with other versions, but this should be easy and quick to test for people using those versions.
Why is your UA displaying Firefox 13?
Component: Untriaged → Location Bar
Flags: needinfo?(firefoxbugreporter)
Ok, Firefox is discriminating between control characters.

Try the following URL:

https://www.google.com/%00%01%02%03%04%05%06%07%08%09%0A%0B%0C%0D%0E%0F%10%11%12%13%14%15%16%17%18%19%1A%1B%1C%1D%1E%1F%7F

Firefox will change the address bar to:

https://www.google.com/%09%0A%0B%0C%0D%1C%1D%1E%1F[] (where [] is a box)

Google will report that the page can not be found (404) and display the URL to the requested page containing the typed input.

I see no reason for this discrimination. The stripped control characters, although probably not often used in URLs, are valid in an URL when encoded. 

For static pages these control characters are probably avoided by most if not all, but with web applications they can be very much part of an URL when that URL is rewritten and does not use a query string for arguments.
(In reply to Loic from comment #1)
> Why is your UA displaying Firefox 13?

Because that is the version of Firefox on this system?

If this issues has been resolved in later versions and there is no support on version 13 then that is one thing. If this issue exists in later versions of Firefox then it has no relevance why it is 13 at all.
Flags: needinfo?(firefoxbugreporter)
So update to Firefox 23.0.1 and try to repro the issue. Firefox 13 is EOL and not supported anymore.
Attached image repro.23.0.1.png
Issues reproduced using Firefox 23.0.1
Version: 13 Branch → 23 Branch
Attached image screenshot windows7
Open URL( https://www.google.com/%00%01%02%03%04%05%06%07%08%09%0A%0B%0C%0D%0E%0F%10%11%12%13%14%15%16%17%18%19%1A%1B%1C%1D%1E%1F%7F ) in commnet#2

Regression window(m-c)
Good:
http://hg.mozilla.org/mozilla-central/rev/8ae16e346bd0
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120105 Firefox/12.0a1 ID:20120106015923
Bad:
http://hg.mozilla.org/mozilla-central/rev/fcc32e70c95f
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120105 Firefox/12.0a1 ID:20120106042423
Pushlog:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=8ae16e346bd0&tochange=fcc32e70c95f


Regression window(m-i)
Good:
http://hg.mozilla.org/integration/mozilla-inbound/rev/511078d51f71
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120105 Firefox/12.0a1 ID:20120105035122
Bad:
http://hg.mozilla.org/integration/mozilla-inbound/rev/c0b62edd2917
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a1) Gecko/20120105 Firefox/12.0a1 ID:20120105041225
Pushlog:
http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=511078d51f71&tochange=c0b62edd2917

Regressed by: bug 703100
Blocks: 703100
Keywords: regression
OS: Windows 7 → All
Whiteboard: spoof
Version: 23 Branch → 12 Branch
Attachment #795410 - Attachment description: screenshot → screenshot windows7
Attached image screenshot ubuntu
Status: UNCONFIRMED → NEW
Ever confirmed: true
Thanks Alice0775 White
Actually, the old behavior was dependent on the system's installed fonts, as we'd have sent these characters through the normal font-matching process like any other, and rendered them with whatever font we found that included them in its cmap.

In many cases, there'd be no such font, and so we'd draw hexboxes. But if -any- installed font did cover these codes, we'd use that instead - and most likely it would have blank, zero-width glyphs. Or maybe it'd include the characters in the cmap but they'd point to its .notdef glyph - in which case we'd render whatever glyph that is, rather than our own hexbox. So we didn't really control whether or not they'd have any visible representation; it was a lottery, depending on the user's font collection.

As stated in http://en.wikipedia.org/wiki/Unicode_control_characters, "these characters themselves have no visual or spatial representation", so it really doesn't make sense to even try to font-match and paint them as such. Probably we should %-escape them in the URL bar, just like other "invisible" characters higher up in Unicode.
(In reply to firefoxbugreporter from comment #0)

> I typed an URL like this in the address bar:
> 
> http://somedomain.com/%1A/PageName
> 
> and I hit enter on it.
> 
> 
> Actual results:
> 
> My input was changed to:
> 
> http://somedomain.com//PageName
> 
> The ASCII control character that is non-printable was removed. It left an
> URL which is not correct for the requested resource.

Actually, the character was not removed; it's still present, as you can tell by moving the arrow keys to move through the URL bar one character at a time. It's just that it is an invisible, zero-width character, so you can't see it.

So nothing is being "stripped"; but given that the control characters are expected to be invisible, I do think we should represent them in %-escaped form here.
(In reply to Jonathan Kew (:jfkthame) from comment #10)
> (In reply to firefoxbugreporter from comment #0)
> 
> > I typed an URL like this in the address bar:
> > 
> > http://somedomain.com/%1A/PageName
> > 
> > and I hit enter on it.
> > 
> > 
> > Actual results:
> > 
> > My input was changed to:
> > 
> > http://somedomain.com//PageName
> > 
> > The ASCII control character that is non-printable was removed. It left an
> > URL which is not correct for the requested resource.
> 
> Actually, the character was not removed; it's still present, as you can tell
> by moving the arrow keys to move through the URL bar one character at a
> time. It's just that it is an invisible, zero-width character, so you can't
> see it.
> 
> So nothing is being "stripped"; but given that the control characters are
> expected to be invisible, I do think we should represent them in %-escaped
> form here.

If you select the entire URL after it has been modified and copy/paste it to a different target (i.e. notepad on windows) you get the original back (with %1A). So it's a GUI representation issue, internally no data is lost from the URL.

When you use the arrow keys and you have to hit them twice to move a single position on the screen then that is non-standard behavior on any platform I think.
(In reply to Jonathan Kew (:jfkthame) from comment #9)
> Actually, the old behavior was dependent on the system's installed fonts, as
> we'd have sent these characters through the normal font-matching process
> like any other, and rendered them with whatever font we found that included
> them in its cmap.
> 
> In many cases, there'd be no such font, and so we'd draw hexboxes. But if
> -any- installed font did cover these codes, we'd use that instead - and most
> likely it would have blank, zero-width glyphs. Or maybe it'd include the
> characters in the cmap but they'd point to its .notdef glyph - in which case
> we'd render whatever glyph that is, rather than our own hexbox. So we didn't
> really control whether or not they'd have any visible representation; it was
> a lottery, depending on the user's font collection.
> 
> As stated in http://en.wikipedia.org/wiki/Unicode_control_characters, "these
> characters themselves have no visual or spatial representation", so it
> really doesn't make sense to even try to font-match and paint them as such.
> Probably we should %-escape them in the URL bar, just like other "invisible"
> characters higher up in Unicode.

From http://www.ietf.org/rfc/rfc1738.txt

   Octets must be encoded if they have no corresponding graphic
   character within the US-ASCII coded character set, if the use of the
   corresponding character is unsafe, or if the corresponding character
   is reserved for some other interpretation within the particular URL
   scheme.

   No corresponding graphic US-ASCII:

   URLs are written only with the graphic printable characters of the
   US-ASCII coded character set. The octets 80-FF hexadecimal are not
   used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent
   control characters; these must be encoded.

The business of interpreting an URL as Unicode is very weird. When a developer redirects a browser to an URL containing ASCII characters in the lower (00-1F) or upper range (80-FF) that are properly escaped as per specification Firefox might turn them client side into some weird Unicode characters by considering multiple ASCII characters as a single multi-byte Unicode character if the data in the URL happen to match them. 

Things go really haywire if it matches those ASCII bytes to an Asian character on a English based system or a Nordic or French character for someone using an Asian language. 

That other browser (which I will not name but call it Lord Voldemort) just keeps the URL encoding intact like it should be.
So, we need to modify function losslessDecodeURI according to the rfc1738.txt.
http://mxr.mozilla.org/mozilla-central/source/browser/base/content/browser.js#2217
Attached patch fix (obsolete) — Splinter Review
The octets 80-FF hexadecimal and the octets 00-1F and 7F hexadecimal represent control characters; these must be encoded.
(In reply to Alice0775 White from comment #14)
> Created attachment 795631 [details] [diff] [review]
> fix
> 
> The octets 80-FF hexadecimal

You mean 80-9F, I think.
Attached patch fix v2Splinter Review
The octets 80-9F hexadecimal and the octets 00-1F and 7F hexadecimal represent control characters; these must be encoded.

(In reply to Jonathan Kew (:jfkthame) from comment #15)
> (In reply to Alice0775 White from comment #14)
> > Created attachment 795631 [details] [diff] [review]
> > fix
> > 
> > The octets 80-FF hexadecimal
> 
> You mean 80-9F, I think.
Attachment #795631 - Attachment is obsolete: true
Where does the 80-9F come from?

   URLs are written only with the graphic printable characters of the
   US-ASCII coded character set. The octets 80-FF hexadecimal are not
   used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent
   control characters; these must be encoded.


That says 80-FF needs to be encoded.

US-ASCII is "7 bit ASCII", everything using the most significant bit (byte value 128 through 255 decimal) in the byte is known as "extended ASCII" and is system/code page dependent. That is why there is no standard representation for it and it needs to be encoded.

(In reply to comment #16 and comment #15)
(In reply to firefoxbugreporter from comment #17)
> Where does the 80-9F come from?
> 
>    URLs are written only with the graphic printable characters of the
>    US-ASCII coded character set. The octets 80-FF hexadecimal are not
>    used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent
>    control characters; these must be encoded.
> 
> 
> That says 80-FF needs to be encoded.
> 
> US-ASCII is "7 bit ASCII", everything using the most significant bit (byte
> value 128 through 255 decimal) in the byte is known as "extended ASCII" and
> is system/code page dependent. That is why there is no standard
> representation for it and it needs to be encoded.

No; that's talking about 8-bit data, but at this level what we're dealing with is Unicode. The location bar displays an IRI (see http://www.ietf.org/rfc/rfc3987.txt), where U+00A0..00FF (and thousands more!) are perfectly valid and well-defined printable characters, not an ASCII-only URL.

Perhaps the code would be clearer if the regex were rewritten using \u escapes, e.g. [\u0000-\u001f\u007f-\u009f\u2028\u2029\ufffc], rather than mixing \x and \u notations, though it would be functionally equivalent.
Comment on attachment 795671 [details] [diff] [review]
fix v2

> +  // object replacement character) (bug 452979 and bug 909264)
> +  value = value.replace(/[\x00-\x1f\x7f\x80-\x9f\u2028\u2029\ufffc]/g,

Please explain about C0 & C1 controls in the comment.
Please, someone who know well takes over.
This is functionally the same as Alice0775's patch, just with the comment tweaked to include mention of the control-char blocks, and using \u notation for clarity as per comment above.
Attachment #795798 - Flags: review?(gavin.sharp)
Assignee: nobody → jfkthame
(In reply to Jonathan Kew (:jfkthame) from comment #18)
> (In reply to firefoxbugreporter from comment #17)
> > Where does the 80-9F come from?
> > 
> >    URLs are written only with the graphic printable characters of the
> >    US-ASCII coded character set. The octets 80-FF hexadecimal are not
> >    used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent
> >    control characters; these must be encoded.
> > 
> > 
> > That says 80-FF needs to be encoded.
> > 
> > US-ASCII is "7 bit ASCII", everything using the most significant bit (byte
> > value 128 through 255 decimal) in the byte is known as "extended ASCII" and
> > is system/code page dependent. That is why there is no standard
> > representation for it and it needs to be encoded.
> 
> No; that's talking about 8-bit data, but at this level what we're dealing
> with is Unicode. The location bar displays an IRI (see
> http://www.ietf.org/rfc/rfc3987.txt), where U+00A0..00FF (and thousands
> more!) are perfectly valid and well-defined printable characters, not an
> ASCII-only URL.
> 
> Perhaps the code would be clearer if the regex were rewritten using \u
> escapes, e.g. [\u0000-\u001f\u007f-\u009f\u2028\u2029\ufffc], rather than
> mixing \x and \u notations, though it would be functionally equivalent.

That RFC points out the issue:        

       Some percent-encodings cannot be interpreted as sequences of
       UTF-8 octets.

       (Note: The octet patterns of UTF-8 are highly regular.
       Therefore, there is a very high probability, but no guarantee,
       that percent-encodings that can be interpreted as sequences of
       UTF-8 octets actually originated from UTF-8.  For a detailed
       discussion, see [Duerst97].)

It seems that in Firefox from as soon as one encoding can not be converted to Unicode the entire URL remains encoded? Is that observation correct? (except for the issue of this bug report that is)

Trying to display them in Unicode might work for many when the web developer initially used Unicode, but if s/he didn't it should display the entire URL with proper encoding. Otherwise you might end up with something like an Unicode smiley face in your address bar that has no valid reason to be there and is not very useful at all.
Additional test:

https://www.google.com/%00%01%02%03%04%05%06%07%08%09%0A%0B%0C%0D%0E%0F%10%11%12%13%14%15%16%17%18%19%1A%1B%1C%1D%1E%1F%7F%80


(added %80) to the end to the test in comment #2.

If you test that URL then ALL encoding remain in place. Whatever logic made the URL being kept because there is a %80 in there should be applied to the %00 - %1F and %7F ranges as well I think. That would make more sense than just encode the ASCII control characters using an Unicode scheme.
(In reply to firefoxbugreporter from comment #23)
> Additional test:
> 
> https://www.google.com/
> %00%01%02%03%04%05%06%07%08%09%0A%0B%0C%0D%0E%0F%10%11%12%13%14%15%16%17%18%1
> 9%1A%1B%1C%1D%1E%1F%7F%80
> 
> 
> (added %80) to the end to the test in comment #2.
> 
> If you test that URL then ALL encoding remain in place. Whatever logic made
> the URL being kept because there is a %80 in there should be applied to the
> %00 - %1F and %7F ranges as well I think. That would make more sense than
> just encode the ASCII control characters using an Unicode scheme.

No, that's a different case.

In the original test, with control chars in the %00-%1f range, the sequence of octets is well-formed UTF-8 and so it is interpreted as such. That's expected; it's exactly the same mechanism that causes http://example.com/%48%65%6c%6c%6f to be displayed as http://example.com/Hello. (And those octets are indeed being interpreted as UTF-8, not US-ASCII, as you can tell if you include non-ASCII characters in %-encoded UTF-8 form: http://example.com/%48%c3%a9%6c%6c%c3%b6.)

However, when you append %80, the sequence of octets represented by the %-encoded values is no longer well-formed UTF-8, and that's why the URL is displayed in its original %-encoded form rather than making any attempt to interpret the octets as characters in any particular encoding.

So that is a different situation, and the logic keeping the URL in its %-encoded form there does not apply to the original example - that case *is* well-formed UTF-8 that happens to contain some C0 control characters. The solution, then, is to add those control characters to the set that we explicitly %-encode rather than displaying literally.
Comment on attachment 795798 [details] [diff] [review]
control characters in the location bar should be %-encoded for visibility

>diff --git a/browser/base/content/browser.js b/browser/base/content/browser.js

>+  // Encode invisible characters (C0/C1 controls, line and paragraph separator,
>+  // object replacement character) (bug 452979, bug 909264)
>+  value = value.replace(/[\u0000-\u001f\u007f-\u009f\u2028\u2029\ufffc]/g,
>                         encodeURIComponent);

A more precise comment would be "C0/C1 control characters + U+007F (DEL)".

In bug 598357 comment 31 we also included U+00A0 (NBSP) in our definition of "unprintable characters", should we do so here as well?

r=me with those addressed.
Attachment #795798 - Flags: review?(gavin.sharp) → review+
(In reply to :Gavin Sharp (use gavin@gavinsharp.com for email) from comment #25)
> In bug 598357 comment 31 we also included U+00A0 (NBSP) in our definition of
> "unprintable characters", should we do so here as well?

Seems reasonable to me. It's not "invisible" in the sense that the control characters (usually) are, but it would be visually indistinguishable from a normal space.

https://hg.mozilla.org/integration/mozilla-inbound/rev/47b8ffe6ecc4
Target Milestone: --- → Firefox 26
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: