Closed Bug 105909 Opened 23 years ago Closed 17 years ago

Address bar should show path/query %-unescaped so non-ASCII URLs are readable

Categories

(Core :: Internationalization, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla1.9alpha6

People

(Reporter: Dan.Oscarsson, Assigned: smontagu)

References

Details

(Keywords: intl)

In all places where a URL is displayed to the user, Mozilla should
display the URL using the local character set, if possible.
Only when a character cannot be displayed using the local character set,
should it be displayed using %-encoding.

Two of the most common places where a URL is displayed is in
the Location field and at the bottom when the mouse is moved over
a link. But also URLs are used in dialog windows.

In current Mozilla I have seen URLs displayed using local character
set, in %-encoding and in UTF-8.
I can enter a URL in the Location field using local character set
and I can click on a link containg a URL in the character set
of the HTML document, but when I go to the page the URL is
displayed using %-encoding in the Location field.
This is not acceptable for a user. A URL must always be
displayed using local character set.

To get things to work well Mozilla should do like this:

All URLs should, if possible, be converted into UTF-8 internally.
The UTF-8 should be normalised using Unicode normalisation form C.
All comparing and handling should be done using this standard
format so that for example cookies work.
If a URL cannot be converted to UTF-8, for example if the
character encoding cannot be recognized and the user have
given no information of default character set to assume, it will
have to be retained in original form.

When a URL is displayed to the user (in all places), a URL
must have all character displayed using the local character set
if the character can be displayed in it. If not, %-encoding is
used. If a URL is not in UTF-8 format, the characters should be
displayed using the local character set, if the byte value can
be displayed as a character. For example, if the local character
set is ISO 8859-1 and a non-UTF-8 URL contains character values in the
range 0xa0-0xff, those values should be displayed as ISO 8859-1.
Character in the range 0x7f-0x9f should be displayed using
%-encoding.

When transmitting URLs to web servers, they should send them
in a character set supported by the server, if known.
If unknown the default character set defined by the
user should be used. The URLs need not be transmitted using
%-encoding, if preferred by user the pure binary format may
be used.
A MacOS user entering a URL should not have the URL sent
using a MacOS character set unless the user have defined this
as the default.

Summary: 
Internally Mozilla should use one format on all URLs so they
can be compared - UCS normalised using Unicode form C and the
encoded using UTF-8 is a good one.

In all places where a user enters or is shown a URL, the URL
must be displayed using the local (client) character set only
encoding those characters not supported by the local character set
as %-encoding.
->ftang
Assignee: yokoyama → ftang
Depends on: 105734
>Mozilla should display the URL using the local character set, if possible.
Why?
The following reason for Why Not?
1. the URL are from the net, not from the "local"
2. What is "local character set", we don't have such thing.
3. IETF is moving to use UTF8 for URL

>All URLs should, if possible, be converted into UTF-8 internally.
The problem is how to do so if you don't know the source charset of the URL.
>If a URL cannot be converted to UTF-8, for example if the
>character encoding cannot be recognized and the user have
>given no information of default character set to assume, it will
>have to be retained in original form.
Tell us any place which we COULD convert to UTF-8 but we don't now. 

>When a URL is displayed to the user (in all places), a URL
>must have all character displayed using the local character set
>if the character can be displayed in it. 
Why ? If all the URL is UTF-8 then why we only display those characters in local
character set ?

>When transmitting URLs to web servers, they should send them
>in a character set supported by the server, if known.
How do you know what charset is supported by the server? in HTTP protocol? in
IMAP protocol? in POP3 protocol? in FTP protocol? what should we do if they
support more than one charset, which one we should choose?


>If unknown the default character set defined by the
>user should be used.
Why? how could user know what charset should be default ?

>A MacOS user entering a URL should not have the URL sent
>using a MacOS character set unless the user have defined this
>as the default.
I don't think we ever do so, not even in Netscape 1.0

>Internally Mozilla should use one format on all URLs so they
>can be compared - UCS normalised using Unicode form C and the
>encoded using UTF-8 is a good one.
In theory, we already use UTF-8 as possible as we can. We cannot do so in File
url for backward compatability reason. 


Keywords: intl
QA Contact: teruko → ruixu
>>Mozilla should display the URL using the local character set, if possible.
>Why?
>The following reason for Why Not?
>1. the URL are from the net, not from the "local"
Some come in through the net, others are entered locally in a client.

But the important thing here is that all characters in a URL that can
be displyed for the, is displayed for the user insted of showing them
as %-encoded ASCII.
For example, when I om my web go to URL: /Affärsenheter/
Mozilla displays: /Aff%E4rsenheter/ in the localtion field.

This is not right.

>2. What is "local character set", we don't have such thing.
The local character set is the character set the user is using on
the system the client (mozilla) is run on.
For example, on my Solaris system the local character set
is ISO 8859-1 (all documents, web pages are in that character set) and
on a MacOS 9 system the local character set is an Apple invented
character set.
(by character set I mean the mapping between character s and code points and
it represents the characters that can be displayed as a character).
On a Unix system the local character set is defined by the current locale.

>3. IETF is moving to use UTF8 for URL

Yes, UTF-8 is a good choice for interoperability, but as UTF-8 is not
always the local character set, you have to convert netween UTF-8 and the local
character set to be able to display the character for the user.


>>All URLs should, if possible, be converted into UTF-8 internally.
>The problem is how to do so if you don't know the source charset of the URL.

Agreed. You have to guess to let the user tell you.

>>When a URL is displayed to the user (in all places), a URL
>>must have all character displayed using the local character set
>>if the character can be displayed in it. 
>Why ? If all the URL is UTF-8 then why we only display those characters in
local
>character set ?
Maybe I was unclear. See the example above. If that URL had also included
a Korean character which cannot be displayed by my local character
set ISO 8859-1, that character should be displayed as %-encoded. But all
characters in the URL that can be displayed by ISO 8859-1, must be
displayed without %-encoding.

>>When transmitting URLs to web servers, they should send them
>>in a character set supported by the server, if known.
>How do you know what charset is supported by the server? in HTTP protocol? in
>IMAP protocol? in POP3 protocol? in FTP protocol? what should we do if they
>support more than one charset, which one we should choose?
Yes that is a problem. I have no solution.
HTTP should probably use UTF-8 or ISO 8859-1 unless user have said
otherwise. IMAP has a protocol for sending non ASCII. FTP has UTF8
and local character set defined to be used. POP3 I do not think has
any support for it.


>>If unknown the default character set defined by the
>>user should be used.
>Why? how could user know what charset should be default ?
Well, for example I have a MacOS user. As he knows that most servers
he visits use ISO 8859-1, it is much better if the browser sent
ISO 885-91 instead of MacOS-encoding.


>>A MacOS user entering a URL should not have the URL sent
>>using a MacOS character set unless the user have defined this
>>as the default.
>I don't think we ever do so, not even in Netscape 1.0
Oh yes it does. My web server has a special fix to recognise
URLs sent in MacOS character set just because of this.



so... basically what you said is we should try to display the URL without % 
there, right ?
the problem is the URL bar is a text field, if we convert it into a for that can 
be display without %, we may LOST information when user click RETURN
>so... basically what you said is we should try to display the URL without % 
>there, right ?

Yes. Would you like an A to be displayed as %41 in an URL?
To have my non-ASCII letters displayed as %-encoded characters is as
unacceptible for me as %-encoding all ASCII letters would be to you.

>the problem is the URL bar is a text field, if we convert it into a for that
can 
>be display without %, we may LOST information when user click RETURN

Then you have to store that information somewhere else. The user want to
see URLs in native character set and to enter them using native characters.
If Mozilla handles all URLs in UTF internally, it is fine, but not when
displaying them to the user.

Today my cookies do not work in some places. This is probably because
the path for the cookie is in an another format than the %-encoded
format Mozilla puts into the Location bar. Apparently Mozilla do not
convert all URLs to one standard format when comparing paths.
Reassign this to nhotta. nhotta, should we future this one ?
Assignee: ftang → nhotta
Whenever an URL is decoded without knowing a charset it may not work.
I think the current behavior of percent encoded should be the default.
If decode using locale charset or UTF-8, the user has to be notified about 
possible data loss.
We may change the backend code first with a backend pref (e.g. encode, decode as 
UTF-8, decode as system charset) to experiment.
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.0
>I think the current behavior of percent encoded should be the default.

No, absolutely not.

In old Netscape, when I type a URL: .../Affärs in remains that way.
In Mozilla it is immediately converted into: .../Aff%C3%A4rs.

This is very ugly.
Every time I want to copy and paste the URL into some other place
I have to manually decode the %-encoding to get it right.
I can't give other people a such ugly and unreadable URL.
And what are all those who do not understand %-encoding and hexvalues
going to do? They will just be irritated and complain.
Change Mozilla so that the characters ABC also are translated into
%-encodings and see how ugly it looks and works.

And cookies do not work! It looks like Mozilla compares the
%-encoded (or UTF-8 encoded) URL with the cookies path in ISO 8859-1.
There clearly is much more work to be done to get URLs handled
correctly (as a user sees it) in Mozilla.

I think the best possible way for handling URLs is to everywhere
internally convert them into UTF-8 (and leave them as invalid UTF-8
if it cannot be converted into UTF-8). And display them as good as
possible when displayed, and send them to servers either as UTF-8
or, by user defined, character encoding.
>In old Netscape, when I type a URL: .../Affärs in remains that way.
I think this was just working for ISO-8859-1 only.

There is a bug about internal UTF-8 URL, bug 84032.
Even using UTF-8 internal URL, it needs info about a charset, because the
server's charset may not be UTF-8.

>(and leave them as invalid UTF-8 if it cannot be converted into UTF-8
This does not work for user typed URL.

Keywords: mozilla1.0
*** Bug 29698 has been marked as a duplicate of this bug. ***
Since the problem in bug 29698 is on all platforms which has marked as
duplicated as this one, I'm changing platform to All.
Severity: normal → major
OS: Solaris → All
Hardware: Sun → All
OK, what if I have URL with cyrillic CP-1251 (windows) chars, but my 
local charset is KOI8-R (unix)? What about CP866 (dos encoding)?
Huh, they all 8-bit single byte charsets, any
string in one encoding can be displayed in another, but you'll get garbage.

> I can't give other people a such ugly and unreadable URL.
Why not? This is just URL, not Shakespeare's poetry ;-))
What the use of just reading URLs? Paste 'em into browser url bar or bookmark
them, that's it!

> And what are all those who do not understand %-encoding and hexvalues
> going to do? They will just be irritated and complain.
There's URL containing thousands of characters. Do you read them all? ;-)

>OK, what if I have URL with cyrillic CP-1251 (windows) chars, but my 
>local charset is KOI8-R (unix)? What about CP866 (dos encoding)?
>Huh, they all 8-bit single byte charsets, any
>string in one encoding can be displayed in another, but you'll get garbage.
Yes, you cannot just take a character encoding and without translation
display it using some other encoding. When a html page is displayed, it is
translated from its character set into the one used in the web browser.
You will have to do the same with URLs.
If you have a URL in cyrillic CP-1251 and want to display it on a system
using KOI8-R, you have to convert from cyrillic CP-1251 to KOI8-R before
displaying it. You have the same problems when comparing URLs.
Today Mozilla fails to handle cookies for me, because when I type in
a URL like: /Tjänster/xxxx, Mozilla displays it as /Tj%E4nster/xxxx,
but the cookie set by the program sets the path of the cookie to: /Tjänster
and the matching routines in Mozilla do not do the comparing correctely.
MS IE does the same fault when set to send all URLs as UTF-8.
The best way to do thisng would be to convert all URLs from its character set
into UTF-8 internally. All comparing should be done on the UTF-8 form.
When displaying, the UTF-8 encoded URL is decoded and all displyable
characters are displayed as itself, and others using %-encoding.
When sending to server you either send them as UTF-8 or with the format
the URL had when entered into Mozilla.
There are drafts from W3C recommending URLs containg non-ASCII to use
UTF-8, but just like html pages will be in different character sets, most
URLs will be written in the same character set as its surroundings.
In a html page it will be the same character set as the rest of the
text, in a dialog it will be the local character set used to display and enter
text. Mozilla will have to do the convertion between characters sets
when URLs are displayed or used in different contexts. Just changing everything
that is not english letters into %-encodings is not right, it will just
make non-English people using Mozilla displike it and select a browser
that works for international users.

>> I can't give other people a such ugly and unreadable URL.
>Why not? This is just URL, not Shakespeare's poetry ;-))
>What the use of just reading URLs? Paste 'em into browser url bar or bookmark
>them, that's it!
Many URLs will only be used in a context where they are pasted or just
called from a web page. But there are many URLs that are used to make it
possible for people to remember or easy to use.
Should we say to mozilla users to go to: www.mozilla.org/r%65l%65%69s%65s
to look for releases? Not easy to remeber, not easy to type.
No, companies do not want to have to write %-encodings when
advertising.
URLs are not just a binary string, many of them are used to give information
on what they relate to and to make them easy to remember.
Try to turn on %-encoding on all letters in URLs and see how easy
it is to use. Then you will get some feeling of what non-English speaking
people will see if non-ASCII is displayed as %-encoding.

Target Milestone: mozilla1.0 → mozilla1.2
This is one of the series problem caused by darin's patch for bug 124042. Before
this patch, we used to just escape the local string, and unescape it before
display. After his patch, nsStandardURL::GetFileName,
nsStandardURL::GetFileBaseName, (and probably more) are assumed to return string
in UTF8 encoding. Now 2 problems arise:
1, when nsStandardURL (or its subclass/superclass) is initiated with string,
those strings are not always converted to UTF8.
2, when the string is used for displaying, unescape is not done. 

We must fix both those 2 issues. There might be many places which need to be
fixed. Bug 135854, 136221, and prossibly 127476 are just some examples. 




This problem causes another problem which is that drag&drop their icon from URL
to somewhere such as desktop.
Created icon doesn't work beside, displays error message because can't
understand file name/folder name to link.
Severity: major → critical
Priority: -- → P1
Comment # 16 splited into bug 160236
See also bug 102984
Target Milestone: mozilla1.2alpha → ---
*** Bug 187445 has been marked as a duplicate of this bug. ***
*** Bug 138951 has been marked as a duplicate of this bug. ***
*** Bug 137597 has been marked as a duplicate of this bug. ***
*** Bug 230421 has been marked as a duplicate of this bug. ***
*** Bug 234154 has been marked as a duplicate of this bug. ***
i'm not sure if this the same bug or not.

if we put this wikipedia url to Mozilla 

http://th.wikipedia.org/wiki/╩цу╢у_╬╧аб╖╓Л

it will point to wrong (empty) page
http://th.wikipedia.org/wiki/%C2%BB%C3%83%C3%95%C2%B4%C3%95_%C2%BE%C2%B9%C3%81%C3%82%C2%A7%C2%A4%C3%AC

but Internet Explorer can handle it correctly, leads us to
http://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B5%E0%B8%94%E0%B8%B5_%E0%B8%9E%E0%B8%99%E0%B8%A1%E0%B8%A2%E0%B8%87%E0%B8%84%E0%B9%8C

I think that other non-european languages also have this problem.
Assignee: nhottanscp → nobody
Status: ASSIGNED → NEW
*** Bug 257198 has been marked as a duplicate of this bug. ***
Assignee: nobody → smontagu
QA Contact: ruixu → amyy
>the problem is the URL bar is a text field, if we convert it into 
>a for that can be display without %, we may LOST information when 
>user click RETURN

This problem is not new. Every wordprocessor and editor has to deal with it. If
the character is in the users character set, display it properly, else keep it
encoded.
Yes, and other languages have same problem, like Cyrilic alphabet. You can see
on this image that only Opera show url in Cyrilic, i try this with Opera,
Firefox, Netscape and IE. Image is here
http://djevrek.on.panonnet.net/slike/LOOK.PNG.
See http://www.ietf.org/rfc/rfc3987.txt for the internalized resource identifier 
(IRI) recommendations.

Konqueror and Safari currently behave as Sasa describes for URLS that are 
encoded in UTF-8 (and thus can be considered IRIs): they get displayed as 
unencoded text, while other encodings, eg ISO-8859-1, will still show hex codes.

This generally provides a nice UI, although copy-and-paste from the URL bar can 
be annoying when you really wanted the encoded form.

For non-ASCII characters the URL encoding is really, really ugly: 6 bytes per 
character for Cyrillic, 9 bytes per character for CJK expands to long illegible, 
non-memorable URIs. IRIs should if possible be presented to the user in a 
legible way, though this carries with it similar problems (and benefits!) as IDN 
hostnames.
I think we should at least unescape the URL when it's escaping URL-safe low-bit
ASCII characters; i.e. if we wouldn't escape the character if you typed it in,
we should unescape it if it occurs escaped. For example, if you paste
http://www.mozilla.org/products/%66%69%72%65%66%6f%78/ into the address bar it's
really the same as http://www.mozilla.org/products/firefox/ but you can't tell
that. You can use this for example to obscure the actual address of websites
that you send people via IRC/email/etc even after they've loaded the page.

This of course doesn't apply to the hostname portion of the URL where an escaped
character can mean something different from an unescaped one.
*** Bug 309025 has been marked as a duplicate of this bug. ***
Summary: URLs should be displayd using local character set → URLs should be displayed using local character set
Is there any news about fixing/trying to fix this bug ? I think that this option is very needed in FF 2.0, but because no one share my opinion. Please just try to add option for this in about:config for FF 3.0. 

Sasa Stefanovic
Workaround: The extension Locationbar² (see http://en.design-noir.de/mozilla/locationbar2/ ) decodes the URL.
Summary: URLs should be displayed using local character set → Address bar should show path/query %-unescaped so non-ASCII URLs are readable
Blocks: 366797
In the patch for bug 366797, URLs are decoded this way:

+            try {
+              val = decodeURIComponent(urlParts[4]);
+            } catch(e) {
+              val = unescape(urlParts[4]);
+            }

Please let me know if that is sane and if it would fix this bug.
Apologies for duplicating this in buglizza, which was kindly fixed by Magnus.  It seems there has not been much discussion in this thread as of late.

Doesn't firefox's "network.standard-url.encode-utf8" not fix this problem, and if so, why is it not set as the default?

http://www.w3.org/International/O-URL-code.html says everything should be using UTF8.
Oops, I apologize for double posting, I also forgot to mention the setting "network.standard-url.escape-utf8", which should be false by default.
 (In reply to comment #35)
> Doesn't firefox's "network.standard-url.encode-utf8" not fix this problem

No, see http://kb.mozillazine.org/Network.standard-url.encode-utf8

(In reply to comment #36)
> "network.standard-url.escape-utf8", which should be false by default.

That would be an improvement, but not the full solution, imho.
I.e. set it to false and go to http://th.wikipedia.org/wiki/%E0%B8%9B%E0%B8%A3%E0%B8%B5%E0%B8%94%E0%B8%B5_%E0%B8%9E%E0%B8%99%E0%B8%A1%E0%B8%A2%E0%B8%87%E0%B8%84%E0%B9%8C
What you see is that this pref won't cause already-escaped URLs to be unescaped.
I'm not sure if this is the same bug or if it should be filed separately -- but Firefox's History records escaped and unescaped URL's as two separate entries. For example http://en.wikipedia.org/wiki/Zombie_(disambiguation) and http://en.wikipedia.org/wiki/Zombie_%28disambiguation%29 are both listed. Opera doesn't do this and always unescapes characters in both the location bar and history.
No longer blocks: 366797
Depends on: 366797
QA Contact: amyy → i18n
Target Milestone: --- → mozilla1.9alpha6
Fixed as part of bug 366797.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
The fix for bug 366797 was backed out. Please reopen this bug, too.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
just to clarify why decoding URLs is important. most Wikis and many blogging platforms generate URLs based on article title, which means there is an increasing number of URLs in non ascii chars. and in the case of wikis one needs to edit the URL from time to time.

as mentioned here before Konquerer decodes UTF-8 URLs.
<SphinX> http://www.degadget.com/laptops/iphone’s-multi-touch-technology-on-macbooks-in-october
<SphinX> sorry... that apostrof should be like that: http://www.degadget.com/laptops/iphone%E2%80%99s-multi-touch-technology-on-macbooks-in-october
<jruderman> interesting, safari's "feature" of decoding %-escaped URLs breaks pasting into irc [because irc clients don't consider the "smart apostrophe" to be a URL character]
<jruderman> how do you get the escaped url back?
<SphinX> c/p-ed that from the guy who gave me the link in the first place ;-)

We should make sure power users have access to the escaped URL.  Perhaps copying from the address bar should automatically use the escaped version.
(In reply to comment #43)
> Perhaps copying from the address bar should automatically use the escaped
> version.

The patch in bug 366797 does that (or at least tries to), if you select the entire URL. Which build was SphinX testing? Presumably one of the builds that included the fix before it was backed out?
(In reply to comment #44)
> (In reply to comment #43)
> > Perhaps copying from the address bar should automatically use the escaped
> > version.
> 
> The patch in bug 366797 does that (or at least tries to), if you select the
> entire URL.

Indeed, and it always worked for me.

> Which build was SphinX testing? Presumably one of the builds that
> included the fix before it was backed out?

Looks like he wasn't using Firefox at all?! (Jesse mentions Safari.)
(In reply to comment #45)
> > Which build was SphinX testing? Presumably one of the builds that
> > included the fix before it was backed out?
> 
> Looks like he wasn't using Firefox at all?! (Jesse mentions Safari.)

Oh, that makes sense. Glad you already have that case covered :)
Almost 6 years later, still discussing this bug... *sigh*
I guess with the exception of text-mode terminals, there are no longer such things as "display character set" and "input character set".  Any character can be displayed (at least as a box with a number inside) and entered (at least with some Alt-numpad or XKB compose and dead keys).  If I understand correctly, the application such as Firefox does not need to perform extra encoding of output and input characters.

This means utf8/hex-encoded links can (a) always be decoded when showing them in the URL bar.  On entering links, they (b) can always be encoded into utf8-hex.

My Firefox at work is "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.4) Gecko/20061201 Firefox/2.0.0.4 (Ubuntu-feisty)".  I did not see (a) utf8-hex decoding in Firefox's location bar yet.

On input (b) in Linux, the location bar seems to unnecessary convert Unicode input into the LANG charset, then hex encoding the result.  This should be changed to "convert Unicode input to utf8, then hex encode the result".
The patch in bug 366797 has been checked in again, so this is FIXED again.
Status: REOPENED → RESOLVED
Closed: 17 years ago17 years ago
Resolution: --- → FIXED
Depends on: 387268
I'm getting %-encoding in the address bar again.
Only UTF-8 addresses are decoded. (Assuming nothing has been backed out silently.)
(In reply to comment #53)
> Only UTF-8 addresses are decoded. (Assuming nothing has been backed out
> silently.)
> 

Because that's the standard set down by w3c.
It appears that decoding of every percent-encoded character may result in non-equivalent URI.  See example 2 of encoding the slash "/" character in 

http://www.w3.org/Addressing/URL/4_URI_Recommentations.html

I guess the existing URI decoding algorithm in Firefox 3 URL bar is already cautious about this issue.  From seeing some Wikipedia URIs in Firefox 3 It appears it leaves non-decoded even the comma character.

Other than that extra precaution in the existing algorithm, I don't see any issue in Firefox 3 URI bar.  I am not aware about Unicode invisible or similarly-looking characters, though.
How about readable non-ASCII urls in visited links drop-down list?

New TT needed of this issue can be fixed within this TT? 
I also think the visited links should be consistent with what is shown in the location bar, currently it is non readable - and you don't need compatibility with IRC clients there, it is a fully firefox thingy. I hope it will also be shown in a readable form.
No longer depends on: 105734
You need to log in before you can comment on or make changes to this bug.