Closed Bug 150376 Opened 22 years ago Closed 6 years ago

Handling of non-ASCII in URIs need fixing

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: Dan.Oscarsson, Assigned: smontagu)

References

Details

(Keywords: intl)

Attachments

(1 file, 1 obsolete file)

The handling of URIs with non-ASCII characters fail or do not
live up to users need in many places in Mozilla.

I have before entered several bugs (138951,105909) and there are many
others (for example 140472) related to this. I have tried to
understand how the code works to see how it could be
fixed. I may be wrong in my analysis below due to missing
something in the quite complex code.

First some important things about URIs:
Some important documents on URIs are RFC 2396,
draft-w3c-i18n-iri and draft-ietf-idn-uri.
Reading RFC 2396 you will find the base for how
URIs should be handled:

    A URI is a sequence of characters.

From this it follows that when users interact with URIs
they expect them to be presented using the character set of
the contect the URI is in. The same is also true for URIs
in protocols.
This means that at all places were the user may see a URI
in must use characters using the locale the user has.
Example of places in Mozilla are: urlbar, urlbarhistory,
bookmarks, cookie manager, history, window title and
the status line at the bottom of the window where a URL
is displayed when mouse hover above a link.
To be clear, when I say be presented using the character
set of the context, I mean that you may not %-encode all
characters not ASCII and display it. You must display
all characters supported by the character set of the context.

The basic routine to construct a URI to be displayed for the
user is something linke this:
- Start with the URI in a well defined format (like UTF-8).
- Convert all characters that are ok to display for the user
  into the local character set. All characters that cannot
  be displayed will be converted into %-encoding.
  The characters that cannot be displayed include the
  reserved characters in a URI that must be %-encoded, all
  characters not supported by local character set and not
  printable by current locale.
  Also not that even if my local character set is UTF-8 it
  does not mean that all can or should be displayed.
  For example, for me latin based characters would be OK
  but no Chinese even though they can be displayed (as
  I cannot handle Chinese or enter Chinese).

When using a host name from a URI to call DNS you have to
try both the proposed IDNA standard (if it gets accepted) and
using local character set. Otherwise you will break
things that worked before. Note that there are no %-encoding
when calling DNS.

When using HTTP you may have to try UTF-8, local character set
or user preference before finding the correct usage (this due
to the current HTTP standard not being enough for international
users and they have started using different encodings).

-
To be able to to the above you MUST have a single common way
to encode URIs internally in Mozilla!
This is not so today.

When you have a common encoding (for example UTF-8 with only
the reserved characters that cannot be without %-encoding, %-encoded)
you can easily get all other things to work.

Each component using a URI must itself convert the URI from
the standard internal format into what is suitable for the
context the URI are going to be used.

-
Today there is no common internal format. For example when a
URI is entered using the urlbar it looks like for me that
nsDefaultURIFixup.cpp converts the URI into local character set
and stores it into the URI. While others just insertes it
as UTF-8. And you can get mixed character set in a URI.
For example, if the web server does a redirect giving a
URI using UTF-8 encoded characters, that URI is loaded
as %-encoded UTF-8 into the internal URI. When then the user
with a local character set of ISO 8859-1 adds a path segment
to it in the urlbar, that segemnt ends up using %-encoded
ISO 8859-1. So now the URI has both UTF-8 and ISO 8859-1 encoded
characters.

I am aware that in some contexts (like a web server redirect)
it is difficult to know what character set used. In this case
you have to, as far as possible, try to identify it. At least
you can try to assume UTF-8 and if it cannot be interpreted as
that, assume local character set and try again.
When getting a redirect you must try to decode any %-encoded
characters (of those allowed) and do your best to convert
it into the internal standard. Otherwise things will get wrong
and the user will often get unacceptible URIs diaplyed to them
and as internationalisation gets in common usage, things will
break.

It will take some time before all standards are updated to
handle non-ASCII in URIs so Mozilla must be prepared to deal
with input in strict conflict with a standard or badly
defined in a standard. For example, when finding a URI in a
HREF in a HTML document, people will not %-encode non-ASCII
characters, they will use the character set of the document.
And when entering host names, they will not enter an ACE
encoded names as in IDNA or what to see them as ACE.


Many of the bugs related to handling of non-ASCII in URIs
will be fixed if you change the handling to the way I
have given above.I expect it will result in code changes in
many places, but I think it will be worth the result.
Finally Mozilla will work much more like people expect
when using non-ASCII characters in URIs.
->nhotaa
Assignee: yokoyama → nhotta
The mixed charset situation can happen when the user edits the existing URL.
There are some issues applying the auto-detection module used for HTML documents
(performance, size, accuracy, UI). But we might be able to check if the escaped
string is a valid UTF-8. The other option is to add a pref to switch UTF-8 and
OS charset (bug 129726).
Status: NEW → ASSIGNED
Keywords: intl
Current plan is to address this by bug 129726.
Reporter, would that be acceptable? Or do you have other proposals? If so,
please items them.
Bug 129726 do NOT fix the problem.
In my initial description on the problem I also gave the solution.

The basic problem is that a URI need to be displayed or be encoded using
different formats depending on context. 

At all cases where a user is shown a URI it must be shown using the
users character set. It may only contain %-encoding for those characters that
cannot be displayed by the users character set. Any ACE-encoded hostnames
in the domain name part of the URI must be decoded and displayed using
the users character set.

When a URI is sent over a protocol it need to be converted into the format
the protocol needs. For example, the domain name in the URI cannot contain
%-encoding when used in DNS, instead it have to be tried using users
character set and when a standard is set trying that standard.

How to fix it in Mozilla:
All URIs MUST internally be in one single format. Most suitable is
UTF-8.
All parts of Mozilla must when storing a URI see to that they are stored
using the single format.
And all parts must when using a URI convert it from the single format
into the format needed for its use.

That is what is needed. Today Mozilla internally stores URIs in many
formats making it difficult for every part of Mozilla to display or
handle them in correct format for its context.
A pref for sending URIs in UTF-8 does not fix the problem - for example it
does not fix showing URIs in correct form for users.
I can not see any solution but the one above that will make adaption of
URI depending on context possible (unless you want very complex code
repeated in each Mozilla component).
>How to fix it in Mozilla:
>All URIs MUST internally be in one single format. Most suitable is
>UTF-8.
UTF-8 is used for Mozilla internal format.

>All parts of Mozilla must when storing a URI see to that they are stored
>using the single format.
Format is either raw UTF-8 or percent escaped (in different charset encodings).
Unescape URIs is not good since we may not know their original charset.

>And all parts must when using a URI convert it from the single format
>into the format needed for its use.
I think that is the current implementation. Libnet does not do the conversion
and let each protocols to handle URI (e.g. convert from UTF-8 to a charset which
the server can understand).

Bug 129726 is about forcing original charset to UTF-8, so any unescaped URI
(like user's input in URL location bar) will be treated as UTF-8.


>>All parts of Mozilla must when storing a URI see to that they are stored
>>using the single format.
>Format is either raw UTF-8 or percent escaped (in different charset encodings).
>Unescape URIs is not good since we may not know their original charset.

But unescape should be done in all cases where the original character set
can be identified. I looked at some of the codes handling the URL location bar
and instead of storing the URL using UTF-8 it was converted to system local
character set (in my case ISO 8859-1), %-encoded and stored in the URI object.
It should have been stored as raw UTF-8 instead. I have not checked other
code, but for example if a URL is found in a HTML document using ISO 8859-1
the URL should be converted into UTF-8 before stored in the URI object.

Only when the URL is retrived from the URI object to be displayed or sent
through a protocol should it be converted as needed.
Also, especially when displaying a URL for the user, unescaping must be done
as far as possible. An incoming %-enscaped UTF-8 URL or ISO 8859-1 URL
need to be unescaped as far as possible before stored internally as raw
UTF-8.


Bug 129726 is about forcing original charset to UTF-8, so any unescaped URI
(like user's input in URL location bar) will be treated as UTF-8.
So it is not just sending the URL. OK that will probably fix some of the
problems I have.
You might add a way to register the preferred character set of a HTTP server
so the user in prefs (or somewhere) just like you can define images to block,
can define it for different servers. Then you could, if a URL arriving from
a defined server does not look like UTF-8, be assumed ti be in the defined
character set and converted into UTF-8 internally. And when sending requests
to that server, the URL can be converted from UTF-8 to the character set
that server wants. That would help handling many servers during the slow
transition to UTF-8 encoded URLs.
> but for example if a URL is found in a HTML document using ISO 8859-1
> the URL should be converted into UTF-8 before stored in the URI object.
I agree. This is not currently done. We could match the document's URI and the
string in the URL location bar then assume its charset. That is useful when the
user edits and modifies the string in the location bar.
The user can also just type a new URL (I assume this is more common). That case,
the current behavior to use system's default is reasonable.


> Also, especially when displaying a URL for the user, unescaping must be done
> as far as possible.
I agree. I think there are cases this is not done properly.

> An incoming %-enscaped UTF-8 URL or ISO 8859-1 URL
> need to be unescaped as far as possible before stored internally as raw
> UTF-8.
I don't think there is a safe way to unescape already escaped URL unless some
cases like the protocol defines the charset (e.g. to be always UTF-8).
In a HTML document, it may contain escaped URLs which charset is different from
the charset of the document.


> You might add a way to register the preferred character set of a HTTP server
There are cases where more than one possible charsets for a launguage (e.g.
Japanese). That case, the user cannot really choose one charset. 
The other case (one charset per user's locale), the prefferred charset is
usually system's default. If not, then I think that is UTF-8. The toggling
between UTF-8 and system's default is about bug 129726.



>The user can also just type a new URL (I assume this is more common). That >case,
>the current behavior to use system's default is reasonable.
If all URLs are to be stored using UTF-8 internally, then all URLs entered
using the URL bar should be stored as UTF-8. But when sent to a HTTP server
it could be converted to the system's default (unless the users preference says
otherwise).


>> An incoming %-enscaped UTF-8 URL or ISO 8859-1 URL
>> need to be unescaped as far as possible before stored internally as raw
>> UTF-8.
>I don't think there is a safe way to unescape already escaped URL unless some
>cases like the protocol defines the charset (e.g. to be always UTF-8).
>In a HTML document, it may contain escaped URLs which charset is different
>from the charset of the document.
Yes, it is not always easy. You can see in draft-duerst-iri-01.txt some
comments on this. When unescaping a URL you should try using UTF-8 and if
that fails, try the system default, and if that fails leave it as escaped.


>> You might add a way to register the preferred character set of a HTTP server
>There are cases where more than one possible charsets for a launguage (e.g.
>Japanese). That case, the user cannot really choose one charset. 
>The other case (one charset per user's locale), the prefferred charset is
>usually system's default. If not, then I think that is UTF-8. The toggling
>between UTF-8 and system's default is about bug 129726.
I guess most HTTP servers use one character set when handling URLs. Some
more advanced will accept both UTF-8 and one local character set.
The recommendation in the IETF/W3C drafts are UTF-8 but it will  take
some time before that is fexed everywhere. To help the transition I think
a way to say: "for server xxx.yy.zz use character set zzzz in URLs" would
ease the way. The code that does HTTP queries could just look up this
database and of the hostname matches convert the URL from UTF-8 to
the defined character set. This way you would not have just two cases:
either use UTF-8 or system's default. Instead the user could define
special cases in addition to the two default cases.
> If all URLs are to be stored using UTF-8 internally, then all URLs entered
> using the URL bar should be stored as UTF-8. 
That is the current implementation. There is a field to store a charset which is
used for the later conversion.

> When unescaping a URL you should try using UTF-8 and if
> that fails, try the system default, and if that fails leave it as escaped.
Is that really safe?
For example, how can we determine 0xc2a0 as one character instead of two as
0xc2, 0xa0? 
I think we could do that for UI only but would not want to unescape and possibly
lose the data.


About storing charsets per server, I think that is good but would not conflict
to have the option to send as UTF-8 (we can implement the UTF-8 option first).
The serve based charset idea is nice. Would that have UI or would it be done
automatically at backend? The other question is when does the user need it? Is
that when the user types URL to the location bar? If the URL is new then the
default (system charset or UTF-8) has to be used anyway. Note that, clicking
links in a document automatically use the document's charset.
>> If all URLs are to be stored using UTF-8 internally, then all URLs entered
>> using the URL bar should be stored as UTF-8. 
>That is the current implementation. There is a field to store a charset which
>is used for the later conversion.
Yes I have seen that, but when I looked at the code (for 1.0) there was code
that converted URLs entered through the URL bar into the system default
before storing it in the URI object (instead of storing it as UTF-8 and just
defining the system default in the field).

>> When unescaping a URL you should try using UTF-8 and if
>> that fails, try the system default, and if that fails leave it as escaped.
>Is that really safe?
>For example, how can we determine 0xc2a0 as one character instead of two as
>0xc2, 0xa0? 
If you look in the iri draft there is some comments on that. For most character
sets and text, a sequence of charcters will not look like UTF-8. So first try
UTF-8 and if that fails try system default (ok if the codes are displayable).

>I think we could do that for UI only but would not want to unescape and
>possibly lose the data.
Yes, the most important place is in all places where a user may see a URL.
Though you should probably always try to assume UTF-8 and see if that works.
Especially as while we not go for UTF-8 some servers will respond with
raw UTF-8 and others with %-encoded UTF-8. And the %-encoded UTF-8 URLs
should be unescaped so they will match as equal to the raw UTF-8 URLs.
I can give you an example where "send UTF-8" fails. In MS IE you have the
option "always send URLs as UTF-8". My HTTP server uses ISO 8859-1 as default
character set but recognises UTF-8. So when MS IE send a URL as UTF-8 my
server understands it, but when pages then contain ISO 8859-1 encoded URLs
or redirects use ISO 8859-1, MS IE can follow them with no problem but fails
sending cookies. I expect this is because MS IE fails to match its UTF-8
converted URLs (from system default ISO 8859-1) with my servers ISO 8859-1
URLs. Had MS IE converted all URLs (even those gotten from HTTP server) into
UTF-8 matching of URLs would have worked.


>About storing charsets per server, I think that is good but would not conflict
>to have the option to send as UTF-8 (we can implement the UTF-8 option first).
>The serve based charset idea is nice. Would that have UI or would it be done
>automatically at backend? The other question is when does the user need it? Is
>that when the user types URL to the location bar? If the URL is new then the
>default (system charset or UTF-8) has to be used anyway. Note that, clicking
>links in a document automatically use the document's charset.
A new URL entered in the location bar should be stored internally as UTF-8
(possibly with a field saying "use system default when transmitting") and
a link in document must be converted from document's charset into UTF-8
before it is stored internally.
Both URLs should when transmitted to a HTTP server be converted to the
charset that server wants. Just because you get a document using ISO 8859-1
does not mean that the server wants that charset.
If you have a stored charset to be used for the HTTP server, convert to it.
If unknown I would recommend to try UTF-8 first and fallback on system/document
charset if UTF-8 fails before giving up.
It is not the user who needs it, it is the HTTP software. It can be use both
when sending and receiving URLs to know what that servers default charset is.
The charset per server need to be defined through an UI by the user. Mozilla
might guess the charset used by a server by identifying the charset in URLs
received from that server, and use that as a fallback.
Having a stored server charset is a mechanism to ease the transition from
local charset to UTF-8. I am sure as IRIs gets popular we will have many
more servers that, just like browsers, will have to cope with somee using
UTF-8 and some using local charset.
This is the code to set a charset for the URL location bar. It does not perform
conversion though.
http://lxr.mozilla.org/seamonkey/source/docshell/base/nsDefaultURIFixup.cpp#120

Send as UTF-8 forces to use UTF-8, so it is expected to break raw 8 bit non
UTF-8 URL.

> Just because you get a document using ISO 8859-1
> does not mean that the server wants that charset.
That is why we use a document charset only when the URL in the doucment is not
escaped.

The per server charset UI, the issue is that the user does not know what to set.
This means the default would be used anyway. I am not sure about the benefit of
storing the info per server when the default is used anyway. 
If this is mostly to support non ASCII URI in the location bar then I think this
would be to match. You can file a separate bug for this feature and keep this
bug for the internal URI format issue.


>This is the code to set a charset for the URL location bar. It does not
>perform conversion though.
>http://lxr.mozilla.org/seamonkey/source/docshell/base/nsDefaultURIFixup.cpp#120

At least previously, calling NS_NewURI with a defined charset resulted in
NS_NewURI converting the UTF-8 string into the defined charset.
So the URI internally had the system charset instead of UTF-8.
I have not checked if the URI code now only stores the charset name without
converting to it.

>> Just because you get a document using ISO 8859-1
>> does not mean that the server wants that charset.
>That is why we use a document charset only when the URL in the doucment is not
>escaped.
I wonder what you mean with this. If you read the draft-duerst-iri-01.txt
about this you see that a URL shall be written using the document charset,
but Mozilla must convert it into UTF-8 when sending the URL to the HTTP server.
You cannot send it in document charset (unless the http server wants that charset).


>The per server charset UI, the issue is that the user does not know what to
>set. This means the default would be used anyway. I am not sure about the
>benefit of storing the info per server when the default is used anyway.
Yes, you are probably right that few users will use it (at least in the
beginning). We can wait and see how many problems people get when UTF-8 URLs
starts being used, and introduce it later.
> I wonder what you mean with this. If you read the draft-duerst-iri-01.txt
> about this you see that a URL shall be written using the document charset,
> but Mozilla must convert it into UTF-8 when sending the URL to the HTTP server.
> You cannot send it in document charset (unless the http server wants that 
> charset).
I think most of the existing documents' URI are encoded using a document charset
(escaped or unescaped).
Sending all of those URI as UTF-8 would break links in those documents.

About unescaping, the draft does not guarantee correctness of assuming UTF-8 for
converting URI to IRI (section 3.2).
http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt
I think the client application can choose not to unescape already escaped URI to
avoid possible dataloss. And note that most of the existing URI are non UTF-8.
That is why I think we may unescape URI only for UI.

>        b.  Some escape sequences cannot be interpreted as sequences of
>           UTF-8 octets.
> 
>           (Note: Due to the regularities in the octet patterns of UTF-8,
>           there is a very high probability, but no guarantee, that escape
>           sequences that can be interpreted as sequences of UTF-8 octets
>           actually originated from UTF-8.  For a detailed discussion, see
>           [Duer97].)



>I think most of the existing documents' URI are encoded using a document
>charset (escaped or unescaped).
>Sending all of those URI as UTF-8 would break links in those documents.
Escaped links should probably be sent as they are, but unescaped should
internally be converted to UTF-8, and should be sent to the HTTP server
as UTF-8 (and retried as document set if failed to avoid breaking links to 
servers not yet upgraded). If you are going to follow the IRI draft.

Do you think that the "send URLs as UTF-8" setting only means those entered
from the URL bar? It should mean all URLs. As a middle way you could say,
those enterd through UI and all in documents using the same charset as
system default. Though to best promote UTF-8 as standard I think you should
send as UTF-8 first and if that fails fall back to retry using system or
document charset.


>About unescaping, the draft does not guarantee correctness of assuming UTF-8
>for converting URI to IRI (section 3.2).
>http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt
>I think the client application can choose not to unescape already escaped URI
>to avoid possible dataloss. And note that most of the existing URI are non
>UTF-8. That is why I think we may unescape URI only for UI.
The most important step is to unescape for UI. If you can get that in, it will
be great and satisfy most people. By UI I mean all places where the user see
a URL, for example URL bar, status bar at bottom of window, history, bookmarks.
It seems to me that it's not just |spec| of nsIURI but also |originCharset| (bug
127282) that is handled a bit inconsistently.  Alternatively, I could say that
it's not always set per my interpretation of nsIURI spec. 

http://lxr.mozilla.org/mozilla/source/netwerk/base/public/nsIURI.idl#222

According to my interpretation, it should _always_ (i.e. even after it's
converted to UTF-8 and url-escaped) be the charset of a document where nsIURI
comes from. However, often times it's just left empty. This happens when 
NS_NewURI is invoked without originCharset or when nsIURIFixup->createFixupURI
is used to create a nsIURI.  

http://lxr.mozilla.org/seamonkey/source/docshell/base/nsIURIFixup.idl#69

nsIURIFixup may need a new method createFixupURIwithCharset. One of my patches
(not uploaded) to bug 199237 does that, but I didn't take that path for bug
199237 because I found a simpler way to fix the bug. In addition, we may also
have to add a new method loadURIwithCharsetto nsIWebNavigation (or modify
loadURI to have a new parameter) because that's the primary path taken in xpfe
to load a URI (see my obsolete patch to bug 199237, attachment 126982 [details] [diff] [review])

Related to this  is bug 205682. 

nsIURI is now frozen so that it is probably too late....  It might have been
better if we had two additional attributes in nsIURI interface, rawSpec
(nsACString) and  converted(boolean). 

|rawspec| stores  URI specs  (const nsACstring of unknown/unspecified charset)
that we get from remote documents  to which no further processing is applied
(i.e.  neither  url-escaping and nor charset conversion is applied ). When we
make a request to a remote server (especially the same server as this uri is
obtained in the first place : see bug 205682 and bug 127282 comment #33 and bug
127282 comment #73), we try this first before |spec|. The assumption is that
|spec| is lazilly filled up after converting to UTF-8 and url-escaping (if
necessary) _on demand_ when |converted| is also set to T. 

We may do without |converted| by making empty |spec| mean 'not yet converted',
but overloading an attribute that way appears not to work always. 
OS: SunOS → All
Hardware: Sun → All
> This happens when NS_NewURI is invoked without originCharset 

Any time this happens when the resulting nsIURI is expected to be used anywhere
outside mozilla internals, that's a bug.
jshin:

>nsIURI is now frozen so that it is probably too late....  It might have been
>better if we had two additional attributes in nsIURI interface, rawSpec
>(nsACString) and  converted(boolean). 
>
>|rawspec| stores  URI specs  (const nsACstring of unknown/unspecified charset)
>that we get from remote documents  to which no further processing is applied
>(i.e.  neither  url-escaping and nor charset conversion is applied ). When we
>make a request to a remote server (especially the same server as this uri is
>obtained in the first place : see bug 205682 and bug 127282 comment #33 and bug
>127282 comment #73), we try this first before |spec|. The assumption is that
>|spec| is lazilly filled up after converting to UTF-8 and url-escaping (if
>necessary) _on demand_ when |converted| is also set to T. 

rememeber, the URI string as extracted from a remote document might might
contain straight Unicode characters.  the point of nsIURI::spec is to preserve
the original URI string with Unicode character points left intact (no escaping).  
nsIURI::asciiSpec is meant to be the converted URI string that is suitable for
sending to a server (i.e., a URI string that is RFC 2396 compliant).  currently,
however, nsIURI::spec is nearly equivalent to nsIURI::asciiSpec.  this is mainly
the case for historical reasons because many consumers of nsIURI are not able to
deal with non-ASCII URI strings.  eventually, we should be able to change the
behavior of nsStandardURL such that nsIURI::spec returns the raw URI string.

inotherwords, i don't think we need any new nsIURI attributes ;-)
> This happens when NS_NewURI is invoked without originCharset 

> Any time this happens when the resulting nsIURI is expected to be 
> used anywhere outside mozilla internals, that's a bug.

The primary source of this 'bug' is xpfe code. For instance, see browser.xml :

http://lxr.mozilla.org/seamonkey/source/xpfe/global/resources/content/bindings/browser.xml#117

or 

http://lxr.mozilla.org/seamonkey/source/xpfe/communicator/resources/content/contentAreaUtils.js#102

In some cases, |aReferreURI| or |referrer| may be referred to extract
|originCharset| for a new |nsURI|. According to the result of my attempt to make
use of it in bug 199237), either |aReferrerURI| or |referrer| doesn't have 
anything useful as far as |originCharset| is concerned. 

Other possible places are whereever |loadURI| of |nsIWebNavigator| is invoked. 


> rememeber, the URI string as extracted from a remote document might > might
contain straight Unicode characters.  the point of 
> nsIURI::spec is to preserve
> the original URI string with Unicode character points left intact 
> (no escaping).  nsIURI::asciiSpec is meant to be the converted URI 
> string that is suitable for
> sending to a server (i.e., a URI string that is RFC 2396 compliant).

Thank you for the reply. I overlooked |asciiSpec| (I saw it, but apparently I
misunderstood it). In what follows, I'm trying to understand you and nsIURI spec. 

Even these days, most web pages include URLs in legacy MIME charsets(ISO-8859-1,
EUC-JP, KOI8-R, etc) as you know too well. There are a few different ways to
store these URLs. 

  1. raw 8bit strings as the author of pages entered.  
  2. url-escaped but without charset conversion (i.e. still in legacy
     mime charset)
  3. converted to UTF-8 (assuming that raw 8bit urls are in the doc.
     charset.) but without url-escaping
  4. converted to UTF-8 and url-escaped 

Currently, in some places in Mozilla, #2 is stored in |spec| while in other
places #3 or #4 is stored in |spec|. Actually, the distinction between #3 and #4
is not much important. Moreover, |originCharset| is sometimes set to the doc.
charset, but other times, it's not (but set to UTF-8 even if what |spec|
contains is #2)

Which of the above are |spec| and |asciiSpec| supposed to contain, respectively?
|spec| is AUTF8String so that it's definitely not for #1. Neither is
|asciiSpec|. However, |asciiSpec| can have #2. If  what you meant is that
|asciiSpec| can (and is supposed to) do what I proposed |rawSpec| should do,
there's indeed no need for a new attribute (|rawSpec|).   The difference between
#1(|rawSpec|) and #2(|asciiSpec| is jus url-escaping so that the information is
preserved either way (except for unlikely cases where  the distinction can make
a difference.)

If my understanding is correct (reading your comment one more time, I'm less
certain), wouldn't we have to modify the specification of |nsIURI| to make
crystal-clear what |asciiSpec| is? 
>Which of the above are |spec| and |asciiSpec| supposed to contain, 
>respectively? 

let me try to explain this more clearly...

for starters, |spec| may contain UTF-8 characters sequences. |asciiSpec| may
not.  |spec| is meant to be used in most places by the application.  |asciiSpec|
is meant to be used when transferring the URI over the network, or when an RFC
2396 compatible URI string is required.  |spec| is meant to be compatible with
the emerging IRI specification.

typically, necko does not get access to the raw URI string as it was found in,
say, an HTML document.  instead, the HTML parser extracts URI strings from an
HTML document and stores them in a UTF-16 buffer.  the HTML parser converts
non-ASCII characters to unicode and expands HTML numeric character references
(e.g., И).  finally, when the URI string is given to necko, the URI string
has been converted to UTF-8.

as a result, necko does not have the original URI string, but given the charset
of the document, necko can convert the URI string back to the document charset
(possibly converting characters that were not originally encoded in the document
charset, but rather as numeric character references).  if no origin charset
parameter (the document charset from which the URI string originated) is given
to necko, then necko assumes an origin charset of UTF-8 (assuming IRI).

now, back to |asciiSpec| and |spec|... |asciiSpec| is the fully converted and
fully escaped URI string.  the hostname will also be ACE encoded if necessary
(for i18n domain names).  |spec| on the other hand "may" differ from
|asciiSpec|.  i say "may" because the interface allows them to be the same or
different depending on the implementation of nsIURI.  this is where things get a
little messy.  because most of mozilla does not know how to deal with non-ASCII
URI strings (at least back in the days of mozilla 1.0 this was certainly true),
nsIURI::spec is almost always equivalent to asciiSpec with the exception of IDN.
inotherwords, today it turns out that |spec| is equivalent to |asciiSpec| except
that the hostname of |spec| will be given out as unescaped UTF-8 (because there
is no escaping other than ACE encoding that works for the hostname portion of a
URL... %-escaping is not valid in the hostname field).  there is a preference
however to control this behavior:

   "network.standard-url.escape-utf8"

this pref is currently enabled by default, but if it were disabled |asciiSpec|
would return the escaped version of |spec|, and |spec| would preserve any UTF-8
characters (without escaping them) provided the origin charset is UTF-8.  when
the origin charset is not UTF-8, the pref will be ignored... and the behavior
will be as it is today.  this could probably be changed, but a lot of callers of
GetPath and GetDirectory assume the charset of the string (sans-escaping) is
appropriate for their use.

so, i think that |spec| and |asciiSpec| are both basically #2 with the only
difference being how IDN is handled.
Darin, thanks tons for your very detailed explanation. So, at 'the entrance of  
necko' (i.e. by the time uri strings reaches NS_NewURI*'s), they're all in  
UTF-8. Then inside necko(by protocolHandler->NewURI), it's converted back to  
|originCharset| and url-escaped if |originCharset| is specified/known.    
 
I was confused by some code that stores UTF-8 spec (e.g  
http://lxr.mozilla.org/seamonkey/source/mailnews/compose/src/nsSmtpService.cpp#321 
). Is this a bug? Instead of converting to UTF-8(#3 or #4), should it store 
|spec| as it is (plus url-escaping : #2) along with setting |originCharset|? 
For mailtoURL, it doesn't seem to matter (as long as it's kept inside 
mailnews) although it's not consistent with your explanation. 
 
Now getting back where it matters.... there are some places #4 is stored in 
|spec| and I guess we have to track down where it's done and fix. (e.g. bug 
205682) 
 
>Darin, thanks tons for your very detailed explanation. So, at 'the entrance of  
>necko' (i.e. by the time uri strings reaches NS_NewURI*'s), they're all in  
>UTF-8. Then inside necko(by protocolHandler->NewURI), it's converted back to  
>|originCharset| and url-escaped if |originCharset| is specified/known.    

your welcome!  actually, |spec| is url-escaped in all cases when that magic
pref i mentioned is set to true.  the default setting is "true."


>I was confused by some code that stores UTF-8 spec (e.g  
>http://lxr.mozilla.org/seamonkey/source/mailnews/compose/src/nsSmtpService.cpp#321 
>). Is this a bug? Instead of converting to UTF-8(#3 or #4), should it store 
>|spec| as it is (plus url-escaping : #2) along with setting |originCharset|? 
>For mailtoURL, it doesn't seem to matter (as long as it's kept inside 
>mailnews) although it's not consistent with your explanation. 
 
yeah, i was really only speaking about nsStandardURL.  mailnews is another
whole can of worms.  we should strive for consistency with the interface.
it may be appropriate for mailnews to ignore the originCharset in some cases.
i don't really know enough about the mailnews protocols to say for sure.


>Now getting back where it matters.... there are some places #4 is stored in 
>|spec| and I guess we have to track down where it's done and fix. (e.g. bug 
>205682)

i think this is just cases where originCharset is not specified.  because #4
is just escaped UTF-8, which as i said is the default value for originCharset.
so, we probably just need to go through the tree and ensure that originCharset
is always specified when NewURI is called.  perhaps adding a NS_ASSERTION in
nsStandardURL::Init would be wise ;-)
Thanks again for the clarification. I realized that mailtoUrl inherits 
nsSimpleURL instead of nsStandardURL.  
 
As for #4 being stored instead of #2, I'll try your diagnostic method. As 
noted earlier, other places where info. loss (originCharset is not passed to 
NewURI) occurs are a few spots in xpfe (comment #18). We might have to revise 
nsIWebNavigation(?), nsIDOMJSWindow (or nsIDOMWindow) and nsIURIFixup.   
It would be nice if this could be fixed soon. I have done a few tests so see
how current (Mozilla 1.4) works in, the for me, important cases.
From my tests I am doubtful that Mozilla does have the URL internally as
UTF-8.

I have done two tests:
1) have a file with non ASCII in fetched from the server.
2) have the server do a redirect with a non ASCII URL in location header.

I run my tests on a Solaris system using a locale with ISO 8859-1 as
character set.
Here is what happens in 1):
  I type: http://host/tjänst.html in the urlbar
  Mozilla changes it to: http://host/tj%E4nst.html
  Mozilla sends GET /tj%E4nst.html
  my server returns the page

  I would have expected it to be:
     I type http://host/tjänst.html in the urlbar
     Mozilla sends GET /tjänst.html
                or GET /tj%E4nst.html
                or GET utf-8 coded url
     my server returns page
  Why is my URL in the urlbar changed?
  If you store my URL internally as UTF-8 there is no need to
  convert it back to ISO 8859-1, %-encode and then display in urlbar.
  The correct way to display is: http://host/tjänst.html


In case 2) I do:
  I type http://host/tjänster in urlbar which is a directory
  my server does redirect using UTF-8 encoded URL.
  mozilla displays as: http://host/tj%C3%A4nster/

  If instead I configure me server to do the redirect using ISO 8859-1
  mozilla displays in urlbar: http://host/tj%E4nster/

  None of the above is acceptible. Both should be displayed
  as http://host/tjänster/

Mozilla does the same with ISO 8859-1 encoded links in an ISO 8859-1
encoded html document.

If Mozilla really internally have all URIs as UTF-8 there should
not be any problems to get the urlbar to display all "displayable" (by
current locale) characters without %-encoding them. Why is this not
done? The code is so complex that I have difficulty to find where the
code handling the urlbar is, but ought not be that difficult to fix?
The same goes for all other places where a url is displayed, like
status line at bottom of window when a mouse is over a link.

It is an other matter what you should send to a web server.
There you may do %-encoding, encode to original character set or
to character set of current locale (or UTF-8 if wanted by user).

And for the host part you need to ACE encode when doing lookups but
when displaying to the user no ACE encoding should be done (if current
locale allowes all characters to be displayed - else ACE encoded).

If you have difficulties to test things I can test in my mozilla 1.4 code
and see what happens, if you can tell me what code I need to change.
Blocks: iri
*** Bug 279344 has been marked as a duplicate of this bug. ***
*** Bug 306477 has been marked as a duplicate of this bug. ***
*** Bug 284402 has been marked as a duplicate of this bug. ***
Attachment #208421 - Attachment is obsolete: true
IE, Opera, and (I've heard) Safari all show the Unicode characters in the location bar (as we do in the status bar) rather than the ugly escapes. This makes people reading sites like http://el.wikipedia.org/ a heck of a lot happier when they can read what page they're on and paste readable links into mail or blogs

I'm assuming nhotta isn't really working on this, reassigning to "nobody" so as not to raise false hopes.

Darin or smontagu: either of you want to take a crack at this?

Nominating for releases, this is unacceptably ugly and our competitors seem to be able to get it right.
Assignee: nhottanscp → nobody
Status: ASSIGNED → NEW
Flags: blocking1.9a1?
Flags: blocking1.8.1?
If you paste the pretty-looking name into the URL bar we accept it just fine, but then translate it to the escaped version.

http://el.wikipedia.org/wiki/Κύρια_Σελίδα
 vs.
http://el.wikipedia.org/wiki/%CE%9A%CF%8D%CF%81%CE%B9%CE%B1_%CE%A3%CE%B5%CE%BB%CE%AF%CE%B4%CE%B1

If you were Greek, which browser would you use? Extend to most other languages (and start to wonder how we get such high marketshare in Europe).
Sorry about the first garble, my test on landfill indicated it was going to work. It was supposed to read "Κύρια_Σελίδα" (but this might be garbled, too)
Following Daniel Veditz's comments, I have set up a demonstration page at: 

  http://dimitris.glezos.com/box/firefox-bug-demo.php

No matter what characters a link contains (escaped or not), the link copied and displayed in the URL bar should *not* contain escaped characters (this is the default behaviour of the status bar, which, as Daniel said, is good).

Again, IE, Opera and Safari seem to get it right.

Solving this bug will help many international websites that use URLs containing non-latin characters (page names, variables etc). This is quite common, for example in non-english Wikipedia sites (e.g. http://el.wikipedia.org).
For attractive display of URIs, see also bug 105909.
Perhaps some of these bugs should be merged?
(In reply to comment #32)

> No matter what characters a link contains (escaped or not), the link copied and
> displayed in the URL bar should *not* contain escaped characters (this is the
> default behaviour of the status bar, which, as Daniel said, is good).

That's not necessarily good. It's not that simple, either. 
Is this related to 261929?

The compat arguments are compelling, if we can get there without too-invasive or too-incompatible changes.  Renominate if that's the case (with explanation)?
Flags: blocking1.8.1?
Flags: blocking1.9a1? → blocking1.9-
*** Bug 243547 has been marked as a duplicate of this bug. ***
Not sure if this is a Thunderbird and/or Firefox issue, but when clicking on an IRI in Thunderbird, not only doesn't the URL display as the character sent (as discussed above), but it links to an unreadable URL. I'd suggest this latter bug be of even higher priority since it is not only a localization display issue (as important as that is). Try sending yourself an email with http://中.com/ for example and click on the link, and then compare that to typing the IRI directly in the location bar of Firefox.
Filter on "Nobody_NScomTLD_20080620"
Assignee: nobody → smontagu
QA Contact: ruixu → i18n
I stumbled upon this bug today (I hope we're talking about the same bug). I had a weird and annoying incidence where FF 3.5 (FF 3 does the same) would trash my well-formed URL while it is in the addess bar (it was correct in the first place).
(i.e. Chrome 3, IE8 and Opera 9.64 don't show this weird behavior)

You can reproduce it the following way. (I tried this on Windows (7 actually))

   1) Goto youtube.com and enter "über den wolken" into the search field.
   2) It will show you the search results and the URL in the address bar looks something like this http://www.youtube.com/results?search_query=%C3%BCber+den+wolken&search_type=&aq=0&oq=%C3%BCber
      with the difference that it doesn't show the escaping sequences (i.e. %C3%BC) but the very "special" characters (in case of %C3%BC it shows ü)
   3) Now bad things happen: click inside the addressbar followed by hitting ENTER on your keyboard
   4) The URL is now malformed: http://www.youtube.com/results?search_query=%FCber+den+wolken&search_type=&aq=0&oq=%FCber
      Strangly enough %C3%BC got substituted by %FC (the extended acsii representation of ü)

Thoughts?
I figured setting network.standard-url.encode-query-utf8 to true would make this bug disappear! It is false by default, what I can't understand.
Isn't it fixed?
The last remnant of non-ASCII-ness in nsStandardURL will be removed in bug 1447190.
All URLs are currently ASCII encoded.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: