Closed Bug 150376 Opened 18 years ago Closed 2 years ago
Handling of non-ASCII in URIs need fixing
The handling of URIs with non-ASCII characters fail or do not live up to users need in many places in Mozilla. I have before entered several bugs (138951,105909) and there are many others (for example 140472) related to this. I have tried to understand how the code works to see how it could be fixed. I may be wrong in my analysis below due to missing something in the quite complex code. First some important things about URIs: Some important documents on URIs are RFC 2396, draft-w3c-i18n-iri and draft-ietf-idn-uri. Reading RFC 2396 you will find the base for how URIs should be handled: A URI is a sequence of characters. From this it follows that when users interact with URIs they expect them to be presented using the character set of the contect the URI is in. The same is also true for URIs in protocols. This means that at all places were the user may see a URI in must use characters using the locale the user has. Example of places in Mozilla are: urlbar, urlbarhistory, bookmarks, cookie manager, history, window title and the status line at the bottom of the window where a URL is displayed when mouse hover above a link. To be clear, when I say be presented using the character set of the context, I mean that you may not %-encode all characters not ASCII and display it. You must display all characters supported by the character set of the context. The basic routine to construct a URI to be displayed for the user is something linke this: - Start with the URI in a well defined format (like UTF-8). - Convert all characters that are ok to display for the user into the local character set. All characters that cannot be displayed will be converted into %-encoding. The characters that cannot be displayed include the reserved characters in a URI that must be %-encoded, all characters not supported by local character set and not printable by current locale. Also not that even if my local character set is UTF-8 it does not mean that all can or should be displayed. For example, for me latin based characters would be OK but no Chinese even though they can be displayed (as I cannot handle Chinese or enter Chinese). When using a host name from a URI to call DNS you have to try both the proposed IDNA standard (if it gets accepted) and using local character set. Otherwise you will break things that worked before. Note that there are no %-encoding when calling DNS. When using HTTP you may have to try UTF-8, local character set or user preference before finding the correct usage (this due to the current HTTP standard not being enough for international users and they have started using different encodings). - To be able to to the above you MUST have a single common way to encode URIs internally in Mozilla! This is not so today. When you have a common encoding (for example UTF-8 with only the reserved characters that cannot be without %-encoding, %-encoded) you can easily get all other things to work. Each component using a URI must itself convert the URI from the standard internal format into what is suitable for the context the URI are going to be used. - Today there is no common internal format. For example when a URI is entered using the urlbar it looks like for me that nsDefaultURIFixup.cpp converts the URI into local character set and stores it into the URI. While others just insertes it as UTF-8. And you can get mixed character set in a URI. For example, if the web server does a redirect giving a URI using UTF-8 encoded characters, that URI is loaded as %-encoded UTF-8 into the internal URI. When then the user with a local character set of ISO 8859-1 adds a path segment to it in the urlbar, that segemnt ends up using %-encoded ISO 8859-1. So now the URI has both UTF-8 and ISO 8859-1 encoded characters. I am aware that in some contexts (like a web server redirect) it is difficult to know what character set used. In this case you have to, as far as possible, try to identify it. At least you can try to assume UTF-8 and if it cannot be interpreted as that, assume local character set and try again. When getting a redirect you must try to decode any %-encoded characters (of those allowed) and do your best to convert it into the internal standard. Otherwise things will get wrong and the user will often get unacceptible URIs diaplyed to them and as internationalisation gets in common usage, things will break. It will take some time before all standards are updated to handle non-ASCII in URIs so Mozilla must be prepared to deal with input in strict conflict with a standard or badly defined in a standard. For example, when finding a URI in a HREF in a HTML document, people will not %-encode non-ASCII characters, they will use the character set of the document. And when entering host names, they will not enter an ACE encoded names as in IDNA or what to see them as ACE. Many of the bugs related to handling of non-ASCII in URIs will be fixed if you change the handling to the way I have given above.I expect it will result in code changes in many places, but I think it will be worth the result. Finally Mozilla will work much more like people expect when using non-ASCII characters in URIs.
Assignee: yokoyama → nhotta
The mixed charset situation can happen when the user edits the existing URL. There are some issues applying the auto-detection module used for HTML documents (performance, size, accuracy, UI). But we might be able to check if the escaped string is a valid UTF-8. The other option is to add a pref to switch UTF-8 and OS charset (bug 129726).
Status: NEW → ASSIGNED
Current plan is to address this by bug 129726. Reporter, would that be acceptable? Or do you have other proposals? If so, please items them.
Bug 129726 do NOT fix the problem. In my initial description on the problem I also gave the solution. The basic problem is that a URI need to be displayed or be encoded using different formats depending on context. At all cases where a user is shown a URI it must be shown using the users character set. It may only contain %-encoding for those characters that cannot be displayed by the users character set. Any ACE-encoded hostnames in the domain name part of the URI must be decoded and displayed using the users character set. When a URI is sent over a protocol it need to be converted into the format the protocol needs. For example, the domain name in the URI cannot contain %-encoding when used in DNS, instead it have to be tried using users character set and when a standard is set trying that standard. How to fix it in Mozilla: All URIs MUST internally be in one single format. Most suitable is UTF-8. All parts of Mozilla must when storing a URI see to that they are stored using the single format. And all parts must when using a URI convert it from the single format into the format needed for its use. That is what is needed. Today Mozilla internally stores URIs in many formats making it difficult for every part of Mozilla to display or handle them in correct format for its context. A pref for sending URIs in UTF-8 does not fix the problem - for example it does not fix showing URIs in correct form for users. I can not see any solution but the one above that will make adaption of URI depending on context possible (unless you want very complex code repeated in each Mozilla component).
>How to fix it in Mozilla: >All URIs MUST internally be in one single format. Most suitable is >UTF-8. UTF-8 is used for Mozilla internal format. >All parts of Mozilla must when storing a URI see to that they are stored >using the single format. Format is either raw UTF-8 or percent escaped (in different charset encodings). Unescape URIs is not good since we may not know their original charset. >And all parts must when using a URI convert it from the single format >into the format needed for its use. I think that is the current implementation. Libnet does not do the conversion and let each protocols to handle URI (e.g. convert from UTF-8 to a charset which the server can understand). Bug 129726 is about forcing original charset to UTF-8, so any unescaped URI (like user's input in URL location bar) will be treated as UTF-8.
>>All parts of Mozilla must when storing a URI see to that they are stored >>using the single format. >Format is either raw UTF-8 or percent escaped (in different charset encodings). >Unescape URIs is not good since we may not know their original charset. But unescape should be done in all cases where the original character set can be identified. I looked at some of the codes handling the URL location bar and instead of storing the URL using UTF-8 it was converted to system local character set (in my case ISO 8859-1), %-encoded and stored in the URI object. It should have been stored as raw UTF-8 instead. I have not checked other code, but for example if a URL is found in a HTML document using ISO 8859-1 the URL should be converted into UTF-8 before stored in the URI object. Only when the URL is retrived from the URI object to be displayed or sent through a protocol should it be converted as needed. Also, especially when displaying a URL for the user, unescaping must be done as far as possible. An incoming %-enscaped UTF-8 URL or ISO 8859-1 URL need to be unescaped as far as possible before stored internally as raw UTF-8. Bug 129726 is about forcing original charset to UTF-8, so any unescaped URI (like user's input in URL location bar) will be treated as UTF-8. So it is not just sending the URL. OK that will probably fix some of the problems I have. You might add a way to register the preferred character set of a HTTP server so the user in prefs (or somewhere) just like you can define images to block, can define it for different servers. Then you could, if a URL arriving from a defined server does not look like UTF-8, be assumed ti be in the defined character set and converted into UTF-8 internally. And when sending requests to that server, the URL can be converted from UTF-8 to the character set that server wants. That would help handling many servers during the slow transition to UTF-8 encoded URLs.
> but for example if a URL is found in a HTML document using ISO 8859-1 > the URL should be converted into UTF-8 before stored in the URI object. I agree. This is not currently done. We could match the document's URI and the string in the URL location bar then assume its charset. That is useful when the user edits and modifies the string in the location bar. The user can also just type a new URL (I assume this is more common). That case, the current behavior to use system's default is reasonable. > Also, especially when displaying a URL for the user, unescaping must be done > as far as possible. I agree. I think there are cases this is not done properly. > An incoming %-enscaped UTF-8 URL or ISO 8859-1 URL > need to be unescaped as far as possible before stored internally as raw > UTF-8. I don't think there is a safe way to unescape already escaped URL unless some cases like the protocol defines the charset (e.g. to be always UTF-8). In a HTML document, it may contain escaped URLs which charset is different from the charset of the document. > You might add a way to register the preferred character set of a HTTP server There are cases where more than one possible charsets for a launguage (e.g. Japanese). That case, the user cannot really choose one charset. The other case (one charset per user's locale), the prefferred charset is usually system's default. If not, then I think that is UTF-8. The toggling between UTF-8 and system's default is about bug 129726.
>The user can also just type a new URL (I assume this is more common). That >case, >the current behavior to use system's default is reasonable. If all URLs are to be stored using UTF-8 internally, then all URLs entered using the URL bar should be stored as UTF-8. But when sent to a HTTP server it could be converted to the system's default (unless the users preference says otherwise). >> An incoming %-enscaped UTF-8 URL or ISO 8859-1 URL >> need to be unescaped as far as possible before stored internally as raw >> UTF-8. >I don't think there is a safe way to unescape already escaped URL unless some >cases like the protocol defines the charset (e.g. to be always UTF-8). >In a HTML document, it may contain escaped URLs which charset is different >from the charset of the document. Yes, it is not always easy. You can see in draft-duerst-iri-01.txt some comments on this. When unescaping a URL you should try using UTF-8 and if that fails, try the system default, and if that fails leave it as escaped. >> You might add a way to register the preferred character set of a HTTP server >There are cases where more than one possible charsets for a launguage (e.g. >Japanese). That case, the user cannot really choose one charset. >The other case (one charset per user's locale), the prefferred charset is >usually system's default. If not, then I think that is UTF-8. The toggling >between UTF-8 and system's default is about bug 129726. I guess most HTTP servers use one character set when handling URLs. Some more advanced will accept both UTF-8 and one local character set. The recommendation in the IETF/W3C drafts are UTF-8 but it will take some time before that is fexed everywhere. To help the transition I think a way to say: "for server xxx.yy.zz use character set zzzz in URLs" would ease the way. The code that does HTTP queries could just look up this database and of the hostname matches convert the URL from UTF-8 to the defined character set. This way you would not have just two cases: either use UTF-8 or system's default. Instead the user could define special cases in addition to the two default cases.
> If all URLs are to be stored using UTF-8 internally, then all URLs entered > using the URL bar should be stored as UTF-8. That is the current implementation. There is a field to store a charset which is used for the later conversion. > When unescaping a URL you should try using UTF-8 and if > that fails, try the system default, and if that fails leave it as escaped. Is that really safe? For example, how can we determine 0xc2a0 as one character instead of two as 0xc2, 0xa0? I think we could do that for UI only but would not want to unescape and possibly lose the data. About storing charsets per server, I think that is good but would not conflict to have the option to send as UTF-8 (we can implement the UTF-8 option first). The serve based charset idea is nice. Would that have UI or would it be done automatically at backend? The other question is when does the user need it? Is that when the user types URL to the location bar? If the URL is new then the default (system charset or UTF-8) has to be used anyway. Note that, clicking links in a document automatically use the document's charset.
>> If all URLs are to be stored using UTF-8 internally, then all URLs entered >> using the URL bar should be stored as UTF-8. >That is the current implementation. There is a field to store a charset which >is used for the later conversion. Yes I have seen that, but when I looked at the code (for 1.0) there was code that converted URLs entered through the URL bar into the system default before storing it in the URI object (instead of storing it as UTF-8 and just defining the system default in the field). >> When unescaping a URL you should try using UTF-8 and if >> that fails, try the system default, and if that fails leave it as escaped. >Is that really safe? >For example, how can we determine 0xc2a0 as one character instead of two as >0xc2, 0xa0? If you look in the iri draft there is some comments on that. For most character sets and text, a sequence of charcters will not look like UTF-8. So first try UTF-8 and if that fails try system default (ok if the codes are displayable). >I think we could do that for UI only but would not want to unescape and >possibly lose the data. Yes, the most important place is in all places where a user may see a URL. Though you should probably always try to assume UTF-8 and see if that works. Especially as while we not go for UTF-8 some servers will respond with raw UTF-8 and others with %-encoded UTF-8. And the %-encoded UTF-8 URLs should be unescaped so they will match as equal to the raw UTF-8 URLs. I can give you an example where "send UTF-8" fails. In MS IE you have the option "always send URLs as UTF-8". My HTTP server uses ISO 8859-1 as default character set but recognises UTF-8. So when MS IE send a URL as UTF-8 my server understands it, but when pages then contain ISO 8859-1 encoded URLs or redirects use ISO 8859-1, MS IE can follow them with no problem but fails sending cookies. I expect this is because MS IE fails to match its UTF-8 converted URLs (from system default ISO 8859-1) with my servers ISO 8859-1 URLs. Had MS IE converted all URLs (even those gotten from HTTP server) into UTF-8 matching of URLs would have worked. >About storing charsets per server, I think that is good but would not conflict >to have the option to send as UTF-8 (we can implement the UTF-8 option first). >The serve based charset idea is nice. Would that have UI or would it be done >automatically at backend? The other question is when does the user need it? Is >that when the user types URL to the location bar? If the URL is new then the >default (system charset or UTF-8) has to be used anyway. Note that, clicking >links in a document automatically use the document's charset. A new URL entered in the location bar should be stored internally as UTF-8 (possibly with a field saying "use system default when transmitting") and a link in document must be converted from document's charset into UTF-8 before it is stored internally. Both URLs should when transmitted to a HTTP server be converted to the charset that server wants. Just because you get a document using ISO 8859-1 does not mean that the server wants that charset. If you have a stored charset to be used for the HTTP server, convert to it. If unknown I would recommend to try UTF-8 first and fallback on system/document charset if UTF-8 fails before giving up. It is not the user who needs it, it is the HTTP software. It can be use both when sending and receiving URLs to know what that servers default charset is. The charset per server need to be defined through an UI by the user. Mozilla might guess the charset used by a server by identifying the charset in URLs received from that server, and use that as a fallback. Having a stored server charset is a mechanism to ease the transition from local charset to UTF-8. I am sure as IRIs gets popular we will have many more servers that, just like browsers, will have to cope with somee using UTF-8 and some using local charset.
This is the code to set a charset for the URL location bar. It does not perform conversion though. http://lxr.mozilla.org/seamonkey/source/docshell/base/nsDefaultURIFixup.cpp#120 Send as UTF-8 forces to use UTF-8, so it is expected to break raw 8 bit non UTF-8 URL. > Just because you get a document using ISO 8859-1 > does not mean that the server wants that charset. That is why we use a document charset only when the URL in the doucment is not escaped. The per server charset UI, the issue is that the user does not know what to set. This means the default would be used anyway. I am not sure about the benefit of storing the info per server when the default is used anyway. If this is mostly to support non ASCII URI in the location bar then I think this would be to match. You can file a separate bug for this feature and keep this bug for the internal URI format issue.
>This is the code to set a charset for the URL location bar. It does not >perform conversion though. >http://lxr.mozilla.org/seamonkey/source/docshell/base/nsDefaultURIFixup.cpp#120 At least previously, calling NS_NewURI with a defined charset resulted in NS_NewURI converting the UTF-8 string into the defined charset. So the URI internally had the system charset instead of UTF-8. I have not checked if the URI code now only stores the charset name without converting to it. >> Just because you get a document using ISO 8859-1 >> does not mean that the server wants that charset. >That is why we use a document charset only when the URL in the doucment is not >escaped. I wonder what you mean with this. If you read the draft-duerst-iri-01.txt about this you see that a URL shall be written using the document charset, but Mozilla must convert it into UTF-8 when sending the URL to the HTTP server. You cannot send it in document charset (unless the http server wants that charset). >The per server charset UI, the issue is that the user does not know what to >set. This means the default would be used anyway. I am not sure about the >benefit of storing the info per server when the default is used anyway. Yes, you are probably right that few users will use it (at least in the beginning). We can wait and see how many problems people get when UTF-8 URLs starts being used, and introduce it later.
> I wonder what you mean with this. If you read the draft-duerst-iri-01.txt > about this you see that a URL shall be written using the document charset, > but Mozilla must convert it into UTF-8 when sending the URL to the HTTP server. > You cannot send it in document charset (unless the http server wants that > charset). I think most of the existing documents' URI are encoded using a document charset (escaped or unescaped). Sending all of those URI as UTF-8 would break links in those documents. About unescaping, the draft does not guarantee correctness of assuming UTF-8 for converting URI to IRI (section 3.2). http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt I think the client application can choose not to unescape already escaped URI to avoid possible dataloss. And note that most of the existing URI are non UTF-8. That is why I think we may unescape URI only for UI. > b. Some escape sequences cannot be interpreted as sequences of > UTF-8 octets. > > (Note: Due to the regularities in the octet patterns of UTF-8, > there is a very high probability, but no guarantee, that escape > sequences that can be interpreted as sequences of UTF-8 octets > actually originated from UTF-8. For a detailed discussion, see > [Duer97].)
>I think most of the existing documents' URI are encoded using a document >charset (escaped or unescaped). >Sending all of those URI as UTF-8 would break links in those documents. Escaped links should probably be sent as they are, but unescaped should internally be converted to UTF-8, and should be sent to the HTTP server as UTF-8 (and retried as document set if failed to avoid breaking links to servers not yet upgraded). If you are going to follow the IRI draft. Do you think that the "send URLs as UTF-8" setting only means those entered from the URL bar? It should mean all URLs. As a middle way you could say, those enterd through UI and all in documents using the same charset as system default. Though to best promote UTF-8 as standard I think you should send as UTF-8 first and if that fails fall back to retry using system or document charset. >About unescaping, the draft does not guarantee correctness of assuming UTF-8 >for converting URI to IRI (section 3.2). >http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt >I think the client application can choose not to unescape already escaped URI >to avoid possible dataloss. And note that most of the existing URI are non >UTF-8. That is why I think we may unescape URI only for UI. The most important step is to unescape for UI. If you can get that in, it will be great and satisfy most people. By UI I mean all places where the user see a URL, for example URL bar, status bar at bottom of window, history, bookmarks.
It seems to me that it's not just |spec| of nsIURI but also |originCharset| (bug 127282) that is handled a bit inconsistently. Alternatively, I could say that it's not always set per my interpretation of nsIURI spec. http://lxr.mozilla.org/mozilla/source/netwerk/base/public/nsIURI.idl#222 According to my interpretation, it should _always_ (i.e. even after it's converted to UTF-8 and url-escaped) be the charset of a document where nsIURI comes from. However, often times it's just left empty. This happens when NS_NewURI is invoked without originCharset or when nsIURIFixup->createFixupURI is used to create a nsIURI. http://lxr.mozilla.org/seamonkey/source/docshell/base/nsIURIFixup.idl#69 nsIURIFixup may need a new method createFixupURIwithCharset. One of my patches (not uploaded) to bug 199237 does that, but I didn't take that path for bug 199237 because I found a simpler way to fix the bug. In addition, we may also have to add a new method loadURIwithCharsetto nsIWebNavigation (or modify loadURI to have a new parameter) because that's the primary path taken in xpfe to load a URI (see my obsolete patch to bug 199237, attachment 126982 [details] [diff] [review]) Related to this is bug 205682. nsIURI is now frozen so that it is probably too late.... It might have been better if we had two additional attributes in nsIURI interface, rawSpec (nsACString) and converted(boolean). |rawspec| stores URI specs (const nsACstring of unknown/unspecified charset) that we get from remote documents to which no further processing is applied (i.e. neither url-escaping and nor charset conversion is applied ). When we make a request to a remote server (especially the same server as this uri is obtained in the first place : see bug 205682 and bug 127282 comment #33 and bug 127282 comment #73), we try this first before |spec|. The assumption is that |spec| is lazilly filled up after converting to UTF-8 and url-escaping (if necessary) _on demand_ when |converted| is also set to T. We may do without |converted| by making empty |spec| mean 'not yet converted', but overloading an attribute that way appears not to work always.
OS: SunOS → All
Hardware: Sun → All
> This happens when NS_NewURI is invoked without originCharset Any time this happens when the resulting nsIURI is expected to be used anywhere outside mozilla internals, that's a bug.
jshin: >nsIURI is now frozen so that it is probably too late.... It might have been >better if we had two additional attributes in nsIURI interface, rawSpec >(nsACString) and converted(boolean). > >|rawspec| stores URI specs (const nsACstring of unknown/unspecified charset) >that we get from remote documents to which no further processing is applied >(i.e. neither url-escaping and nor charset conversion is applied ). When we >make a request to a remote server (especially the same server as this uri is >obtained in the first place : see bug 205682 and bug 127282 comment #33 and bug >127282 comment #73), we try this first before |spec|. The assumption is that >|spec| is lazilly filled up after converting to UTF-8 and url-escaping (if >necessary) _on demand_ when |converted| is also set to T. rememeber, the URI string as extracted from a remote document might might contain straight Unicode characters. the point of nsIURI::spec is to preserve the original URI string with Unicode character points left intact (no escaping). nsIURI::asciiSpec is meant to be the converted URI string that is suitable for sending to a server (i.e., a URI string that is RFC 2396 compliant). currently, however, nsIURI::spec is nearly equivalent to nsIURI::asciiSpec. this is mainly the case for historical reasons because many consumers of nsIURI are not able to deal with non-ASCII URI strings. eventually, we should be able to change the behavior of nsStandardURL such that nsIURI::spec returns the raw URI string. inotherwords, i don't think we need any new nsIURI attributes ;-)
> This happens when NS_NewURI is invoked without originCharset > Any time this happens when the resulting nsIURI is expected to be > used anywhere outside mozilla internals, that's a bug. The primary source of this 'bug' is xpfe code. For instance, see browser.xml : http://lxr.mozilla.org/seamonkey/source/xpfe/global/resources/content/bindings/browser.xml#117 or http://lxr.mozilla.org/seamonkey/source/xpfe/communicator/resources/content/contentAreaUtils.js#102 In some cases, |aReferreURI| or |referrer| may be referred to extract |originCharset| for a new |nsURI|. According to the result of my attempt to make use of it in bug 199237), either |aReferrerURI| or |referrer| doesn't have anything useful as far as |originCharset| is concerned. Other possible places are whereever |loadURI| of |nsIWebNavigator| is invoked. > rememeber, the URI string as extracted from a remote document might > might contain straight Unicode characters. the point of > nsIURI::spec is to preserve > the original URI string with Unicode character points left intact > (no escaping). nsIURI::asciiSpec is meant to be the converted URI > string that is suitable for > sending to a server (i.e., a URI string that is RFC 2396 compliant). Thank you for the reply. I overlooked |asciiSpec| (I saw it, but apparently I misunderstood it). In what follows, I'm trying to understand you and nsIURI spec. Even these days, most web pages include URLs in legacy MIME charsets(ISO-8859-1, EUC-JP, KOI8-R, etc) as you know too well. There are a few different ways to store these URLs. 1. raw 8bit strings as the author of pages entered. 2. url-escaped but without charset conversion (i.e. still in legacy mime charset) 3. converted to UTF-8 (assuming that raw 8bit urls are in the doc. charset.) but without url-escaping 4. converted to UTF-8 and url-escaped Currently, in some places in Mozilla, #2 is stored in |spec| while in other places #3 or #4 is stored in |spec|. Actually, the distinction between #3 and #4 is not much important. Moreover, |originCharset| is sometimes set to the doc. charset, but other times, it's not (but set to UTF-8 even if what |spec| contains is #2) Which of the above are |spec| and |asciiSpec| supposed to contain, respectively? |spec| is AUTF8String so that it's definitely not for #1. Neither is |asciiSpec|. However, |asciiSpec| can have #2. If what you meant is that |asciiSpec| can (and is supposed to) do what I proposed |rawSpec| should do, there's indeed no need for a new attribute (|rawSpec|). The difference between #1(|rawSpec|) and #2(|asciiSpec| is jus url-escaping so that the information is preserved either way (except for unlikely cases where the distinction can make a difference.) If my understanding is correct (reading your comment one more time, I'm less certain), wouldn't we have to modify the specification of |nsIURI| to make crystal-clear what |asciiSpec| is?
>Which of the above are |spec| and |asciiSpec| supposed to contain, >respectively? let me try to explain this more clearly... for starters, |spec| may contain UTF-8 characters sequences. |asciiSpec| may not. |spec| is meant to be used in most places by the application. |asciiSpec| is meant to be used when transferring the URI over the network, or when an RFC 2396 compatible URI string is required. |spec| is meant to be compatible with the emerging IRI specification. typically, necko does not get access to the raw URI string as it was found in, say, an HTML document. instead, the HTML parser extracts URI strings from an HTML document and stores them in a UTF-16 buffer. the HTML parser converts non-ASCII characters to unicode and expands HTML numeric character references (e.g., И). finally, when the URI string is given to necko, the URI string has been converted to UTF-8. as a result, necko does not have the original URI string, but given the charset of the document, necko can convert the URI string back to the document charset (possibly converting characters that were not originally encoded in the document charset, but rather as numeric character references). if no origin charset parameter (the document charset from which the URI string originated) is given to necko, then necko assumes an origin charset of UTF-8 (assuming IRI). now, back to |asciiSpec| and |spec|... |asciiSpec| is the fully converted and fully escaped URI string. the hostname will also be ACE encoded if necessary (for i18n domain names). |spec| on the other hand "may" differ from |asciiSpec|. i say "may" because the interface allows them to be the same or different depending on the implementation of nsIURI. this is where things get a little messy. because most of mozilla does not know how to deal with non-ASCII URI strings (at least back in the days of mozilla 1.0 this was certainly true), nsIURI::spec is almost always equivalent to asciiSpec with the exception of IDN. inotherwords, today it turns out that |spec| is equivalent to |asciiSpec| except that the hostname of |spec| will be given out as unescaped UTF-8 (because there is no escaping other than ACE encoding that works for the hostname portion of a URL... %-escaping is not valid in the hostname field). there is a preference however to control this behavior: "network.standard-url.escape-utf8" this pref is currently enabled by default, but if it were disabled |asciiSpec| would return the escaped version of |spec|, and |spec| would preserve any UTF-8 characters (without escaping them) provided the origin charset is UTF-8. when the origin charset is not UTF-8, the pref will be ignored... and the behavior will be as it is today. this could probably be changed, but a lot of callers of GetPath and GetDirectory assume the charset of the string (sans-escaping) is appropriate for their use. so, i think that |spec| and |asciiSpec| are both basically #2 with the only difference being how IDN is handled.
Darin, thanks tons for your very detailed explanation. So, at 'the entrance of necko' (i.e. by the time uri strings reaches NS_NewURI*'s), they're all in UTF-8. Then inside necko(by protocolHandler->NewURI), it's converted back to |originCharset| and url-escaped if |originCharset| is specified/known. I was confused by some code that stores UTF-8 spec (e.g http://lxr.mozilla.org/seamonkey/source/mailnews/compose/src/nsSmtpService.cpp#321 ). Is this a bug? Instead of converting to UTF-8(#3 or #4), should it store |spec| as it is (plus url-escaping : #2) along with setting |originCharset|? For mailtoURL, it doesn't seem to matter (as long as it's kept inside mailnews) although it's not consistent with your explanation. Now getting back where it matters.... there are some places #4 is stored in |spec| and I guess we have to track down where it's done and fix. (e.g. bug 205682)
>Darin, thanks tons for your very detailed explanation. So, at 'the entrance of >necko' (i.e. by the time uri strings reaches NS_NewURI*'s), they're all in >UTF-8. Then inside necko(by protocolHandler->NewURI), it's converted back to >|originCharset| and url-escaped if |originCharset| is specified/known. your welcome! actually, |spec| is url-escaped in all cases when that magic pref i mentioned is set to true. the default setting is "true." >I was confused by some code that stores UTF-8 spec (e.g >http://lxr.mozilla.org/seamonkey/source/mailnews/compose/src/nsSmtpService.cpp#321 >). Is this a bug? Instead of converting to UTF-8(#3 or #4), should it store >|spec| as it is (plus url-escaping : #2) along with setting |originCharset|? >For mailtoURL, it doesn't seem to matter (as long as it's kept inside >mailnews) although it's not consistent with your explanation. yeah, i was really only speaking about nsStandardURL. mailnews is another whole can of worms. we should strive for consistency with the interface. it may be appropriate for mailnews to ignore the originCharset in some cases. i don't really know enough about the mailnews protocols to say for sure. >Now getting back where it matters.... there are some places #4 is stored in >|spec| and I guess we have to track down where it's done and fix. (e.g. bug >205682) i think this is just cases where originCharset is not specified. because #4 is just escaped UTF-8, which as i said is the default value for originCharset. so, we probably just need to go through the tree and ensure that originCharset is always specified when NewURI is called. perhaps adding a NS_ASSERTION in nsStandardURL::Init would be wise ;-)
Thanks again for the clarification. I realized that mailtoUrl inherits nsSimpleURL instead of nsStandardURL. As for #4 being stored instead of #2, I'll try your diagnostic method. As noted earlier, other places where info. loss (originCharset is not passed to NewURI) occurs are a few spots in xpfe (comment #18). We might have to revise nsIWebNavigation(?), nsIDOMJSWindow (or nsIDOMWindow) and nsIURIFixup.
It would be nice if this could be fixed soon. I have done a few tests so see how current (Mozilla 1.4) works in, the for me, important cases. From my tests I am doubtful that Mozilla does have the URL internally as UTF-8. I have done two tests: 1) have a file with non ASCII in fetched from the server. 2) have the server do a redirect with a non ASCII URL in location header. I run my tests on a Solaris system using a locale with ISO 8859-1 as character set. Here is what happens in 1): I type: http://host/tjänst.html in the urlbar Mozilla changes it to: http://host/tj%E4nst.html Mozilla sends GET /tj%E4nst.html my server returns the page I would have expected it to be: I type http://host/tjänst.html in the urlbar Mozilla sends GET /tjänst.html or GET /tj%E4nst.html or GET utf-8 coded url my server returns page Why is my URL in the urlbar changed? If you store my URL internally as UTF-8 there is no need to convert it back to ISO 8859-1, %-encode and then display in urlbar. The correct way to display is: http://host/tjänst.html In case 2) I do: I type http://host/tjänster in urlbar which is a directory my server does redirect using UTF-8 encoded URL. mozilla displays as: http://host/tj%C3%A4nster/ If instead I configure me server to do the redirect using ISO 8859-1 mozilla displays in urlbar: http://host/tj%E4nster/ None of the above is acceptible. Both should be displayed as http://host/tjänster/ Mozilla does the same with ISO 8859-1 encoded links in an ISO 8859-1 encoded html document. If Mozilla really internally have all URIs as UTF-8 there should not be any problems to get the urlbar to display all "displayable" (by current locale) characters without %-encoding them. Why is this not done? The code is so complex that I have difficulty to find where the code handling the urlbar is, but ought not be that difficult to fix? The same goes for all other places where a url is displayed, like status line at bottom of window when a mouse is over a link. It is an other matter what you should send to a web server. There you may do %-encoding, encode to original character set or to character set of current locale (or UTF-8 if wanted by user). And for the host part you need to ACE encode when doing lookups but when displaying to the user no ACE encoding should be done (if current locale allowes all characters to be displayed - else ACE encoded). If you have difficulties to test things I can test in my mozilla 1.4 code and see what happens, if you can tell me what code I need to change.
*** Bug 279344 has been marked as a duplicate of this bug. ***
*** Bug 306477 has been marked as a duplicate of this bug. ***
*** Bug 284402 has been marked as a duplicate of this bug. ***
IE, Opera, and (I've heard) Safari all show the Unicode characters in the location bar (as we do in the status bar) rather than the ugly escapes. This makes people reading sites like http://el.wikipedia.org/ a heck of a lot happier when they can read what page they're on and paste readable links into mail or blogs I'm assuming nhotta isn't really working on this, reassigning to "nobody" so as not to raise false hopes. Darin or smontagu: either of you want to take a crack at this? Nominating for releases, this is unacceptably ugly and our competitors seem to be able to get it right.
Assignee: nhottanscp → nobody
Status: ASSIGNED → NEW
If you paste the pretty-looking name into the URL bar we accept it just fine, but then translate it to the escaped version. http://el.wikipedia.org/wiki/Κύρια_Σελίδα vs. http://el.wikipedia.org/wiki/%CE%9A%CF%8D%CF%81%CE%B9%CE%B1_%CE%A3%CE%B5%CE%BB%CE%AF%CE%B4%CE%B1 If you were Greek, which browser would you use? Extend to most other languages (and start to wonder how we get such high marketshare in Europe).
Sorry about the first garble, my test on landfill indicated it was going to work. It was supposed to read "Κύρια_Σελίδα" (but this might be garbled, too)
Following Daniel Veditz's comments, I have set up a demonstration page at: http://dimitris.glezos.com/box/firefox-bug-demo.php No matter what characters a link contains (escaped or not), the link copied and displayed in the URL bar should *not* contain escaped characters (this is the default behaviour of the status bar, which, as Daniel said, is good). Again, IE, Opera and Safari seem to get it right. Solving this bug will help many international websites that use URLs containing non-latin characters (page names, variables etc). This is quite common, for example in non-english Wikipedia sites (e.g. http://el.wikipedia.org).
For attractive display of URIs, see also bug 105909. Perhaps some of these bugs should be merged?
(In reply to comment #32) > No matter what characters a link contains (escaped or not), the link copied and > displayed in the URL bar should *not* contain escaped characters (this is the > default behaviour of the status bar, which, as Daniel said, is good). That's not necessarily good. It's not that simple, either.
How is this related to bug 261929?
Is this related to 261929? The compat arguments are compelling, if we can get there without too-invasive or too-incompatible changes. Renominate if that's the case (with explanation)?
*** Bug 243547 has been marked as a duplicate of this bug. ***
Not sure if this is a Thunderbird and/or Firefox issue, but when clicking on an IRI in Thunderbird, not only doesn't the URL display as the character sent (as discussed above), but it links to an unreadable URL. I'd suggest this latter bug be of even higher priority since it is not only a localization display issue (as important as that is). Try sending yourself an email with http://中.com/ for example and click on the link, and then compare that to typing the IRI directly in the location bar of Firefox.
Filter on "Nobody_NScomTLD_20080620"
Assignee: nobody → smontagu
QA Contact: ruixu → i18n
I stumbled upon this bug today (I hope we're talking about the same bug). I had a weird and annoying incidence where FF 3.5 (FF 3 does the same) would trash my well-formed URL while it is in the addess bar (it was correct in the first place). (i.e. Chrome 3, IE8 and Opera 9.64 don't show this weird behavior) You can reproduce it the following way. (I tried this on Windows (7 actually)) 1) Goto youtube.com and enter "über den wolken" into the search field. 2) It will show you the search results and the URL in the address bar looks something like this http://www.youtube.com/results?search_query=%C3%BCber+den+wolken&search_type=&aq=0&oq=%C3%BCber with the difference that it doesn't show the escaping sequences (i.e. %C3%BC) but the very "special" characters (in case of %C3%BC it shows ü) 3) Now bad things happen: click inside the addressbar followed by hitting ENTER on your keyboard 4) The URL is now malformed: http://www.youtube.com/results?search_query=%FCber+den+wolken&search_type=&aq=0&oq=%FCber Strangly enough %C3%BC got substituted by %FC (the extended acsii representation of ü) Thoughts?
I figured setting network.standard-url.encode-query-utf8 to true would make this bug disappear! It is false by default, what I can't understand.
Isn't it fixed?
The last remnant of non-ASCII-ness in nsStandardURL will be removed in bug 1447190. All URLs are currently ASCII encoded.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.