Last Comment Bug 284474 - Converting to UTF-8 a url with an unescaped non-ASCII chars in the query part leads to an incompatibility with most server-side programs
: Converting to UTF-8 a url with an unescaped non-ASCII chars in the query pa...
Status: RESOLVED FIXED
: fixed1.8, intl, regression
Product: Core
Classification: Components
Component: Networking (show other bugs)
: Trunk
: All All
: -- major with 4 votes (vote)
: mozilla1.8beta5
Assigned To: Darin Fisher
: benc
Mentors:
http://jshin.net/moztest/p_euckr.html
: 271828 284537 306682 308563 308732 308944 310033 311300 311748 313717 314178 (view as bug list)
Depends on:
Blocks: iri 308159 309126 311387
  Show dependency treegraph
 
Reported: 2005-03-02 10:31 PST by Jungshik Shin
Modified: 2010-03-21 11:58 PDT (History)
35 users (show)
asa: blocking1.8b5-
asa: blocking1.8rc1+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
v1 patch (1.11 KB, patch)
2005-09-27 17:17 PDT, Darin Fisher
jshin1987: review+
bzbarsky: superreview+
asa: approval1.8rc1+
Details | Diff | Splinter Review

Description Jungshik Shin 2005-03-02 10:31:04 PST
On the trunk, we turned on IRI by default (network.standard-url.encode-utf8')
thinking (that was mainly me) that IE's 'Always send URL in UTF-8' is equivalent
to the full IRI support. It turned out that with that option ON (which is the
default), IE sends the path part in %-encoded UTF-8 (regardless of the encoding
of a refering page) but uses the encoding of the refering page in the query part. 

For instance, if you press the submit button in the page given at the URL, IE
asks for the following URL 

http://jshin.net/moztest/%EA%B0%80%EB%82%98.cgi?f1=%B0%A1%B3%AA

while Mozilla asks for 

http://jshin.net/moztest/%EA%B0%80%EB%82%98.cgi?f1=%EA%B0%80%EB%82%98

The page is in EUC-KR and '%EA%B0%80%EB%82%98' is %-encoded '가나' in UTF-8
while  '%B0%A1%B3%AA' is %-encoded '가나' (U+AC00, U+B098) in EUC-KR. 

I'm not sure what to do with this thorny issue. We can just ignore the issue and
wish everyone moves on and write web pages in UTF-8 (which will not happen very
soon). We might be able to ignore the issue assuming that most web sites use 'post'

Do we want to follow the IE's suit and produce a weired 'cross' like that? I
guess not.  We can 'educate' web authors so that they check UTF8-ness first
before interpreting query string as the character encoding of the refering
document.
Comment 1 Jungshik Shin 2005-03-03 03:18:17 PST
> On the trunk, we turned on IRI by default (network.standard-url.encode-utf8')

I should have said we had turned on a 'standard-compliant' conversion of IRI to
URI. (bug 261929)

Martin, how does your fileiri module for Apache 2.x handle URIs generated by MS
IE (path part in %-escaped UTF-8 and query parth in %-escaped document encoding)?

I now realized that what MS IE does is not so unreasonable and that we might
have to considering handling the query part separately and differently from the
path part...
 
Comment 2 Jungshik Shin 2005-03-03 03:59:40 PST
If I type an IRI (http://jshin.net/moztest/가나.cgi?f1=가나) directly in the
open-location dialog box, MS IE 6 (under Korean locale) sends the following URL:

/moztest/%EA%B0%80%EB%82%98.cgi?f1=\xb0\xa1\xb3\xaa   (where '\xb0' stands for
the octet '0xB0', etc) 

This is really interesting...




Comment 3 Jungshik Shin 2005-03-03 16:47:27 PST
*** Bug 284537 has been marked as a duplicate of this bug. ***
Comment 4 Jungshik Shin 2005-03-04 18:02:05 PST
*** Bug 271828 has been marked as a duplicate of this bug. ***
Comment 5 Jungshik Shin 2005-09-02 22:51:33 PDT
My comment #0 was wrong. The query part in the second and the third cases at the
url (in the url field above) is encoded in the page encoding before being
URL-escaped whehter the pref. 'standard.url.encode_utf8' is true or false. The
path part is turned to UTF-8 if the pref. is set true. 

A real 'issue' is what to do when a literal URL (*without* url-escape) in an
html page is used (cases 4 and 6 in my test page). Firefox converts the whole
url to UTF-8 (path and query) while MS IE converts only the path part. What FF
does is standard-compliant (which I was reminded of in another bug)


Note that url-escaping the non-ASCII query part (case 5) would enable web
authors to avoid this issue in the first place.
Comment 6 Jungshik Shin 2005-09-02 22:56:38 PDT
*** Bug 306682 has been marked as a duplicate of this bug. ***
Comment 7 Jungshik Shin 2005-09-09 08:22:34 PDT
RFC 3987 (http://www.ietf.org/rfc/rfc3987.txt) is very clear about converting
IRIs to URIs and there's no doubt that what we do is compliant to RFC 3987.
However, it seems like we're on the losing end. Both MS IE and Opera treat the
query part differently from the path part. While they follow RFC 3987 when
converting the path part, they use the character encoding of the page where an
IRI is refered to (instead of UTF-8) when converting the query part. 

BTW, Safari behaves the same way as Firefox with 'standard-url.encode_utf8' set
to false. (i.e. it doesn't honor RFC 3987 at all)
Comment 8 Phil Ringnalda (:philor) 2005-09-14 20:48:41 PDT
*** Bug 308563 has been marked as a duplicate of this bug. ***
Comment 9 Phil Ringnalda (:philor) 2005-09-15 18:30:26 PDT
*** Bug 308732 has been marked as a duplicate of this bug. ***
Comment 10 Phil Ringnalda (:philor) 2005-09-17 08:39:43 PDT
*** Bug 308944 has been marked as a duplicate of this bug. ***
Comment 11 Dimitrios 2005-09-18 23:18:11 PDT
This breaks most sub-category links on a big greek on-line pc & gadget shop
(http://www.e-shop.gr) rendering the site unusable. Funny thing, they are using
FF 1.0 or older on their own machines (I saw it by myself). 
The bug is currently affecting Firefox 1.5 branch too and I wonder if it would
be proper to suggest that site's authors to modify their code as soon as final
1.5 is out.
Comment 12 Chris Beard 2005-09-19 15:35:04 PDT
It seems that we either want to take the non-risk path and revert back to 1.0
behaviour, or, depending upon the availability of someone to do the work, and
the complexity of the patch, follow IE and Opera.
Comment 13 Darin Fisher 2005-09-19 18:00:38 PDT
jshin: We need to decide what to do here.  I'm leaning toward backing out the
UTF-8 change, so that we return to a known state.  Thoughts?
Comment 14 Jungshik Shin 2005-09-20 11:15:10 PDT
 It's tough to decide what to do here. I want firefox to comply to the standard
as much as possible, but the reality doesn't seem to let us do that in this
case. There are three options:

1. go back to what we did in 1.0
2. leave our code alone, make this bug as a 'tech-evangel' bug and try to spread
the word about RFC 3987 (the query part should be %-escaped if they want to use
a non-UTF-8 encoding). In addition, document the configure option about the
URI/IRI conversion in a prominent place (release note/help...etc). We may even
add an advanced option with a real UI. 
3. implement what IE/Opera do. 

#3 is not feasible for 1.5, I guess. Even if it's feasible, I'm rather reluctant
to do that now that I have read the full text of RFC 3987. I wish I could argue
strongly for #2, but given the reality in the wild, I can't. So, I have to say
'let's go with #1' for now. 
Comment 15 Darin Fisher 2005-09-23 14:17:42 PDT
OK, #1 it is.  I'm also sad to see this change reverted :(
Comment 16 Phil Ringnalda (:philor) 2005-09-26 09:25:34 PDT
*** Bug 310033 has been marked as a duplicate of this bug. ***
Comment 17 Mike Connor [:mconnor] 2005-09-26 14:50:51 PDT
Need to get a patch ASAP flipping the pref back.
Comment 18 Zvi Devir 2005-09-27 06:31:06 PDT
Similar problems with local file handling (BUG 308159)...
Comment 19 Darin Fisher 2005-09-27 17:17:21 PDT
Created attachment 197637 [details] [diff] [review]
v1 patch
Comment 20 Boris Zbarsky [:bz] (TPAC) 2005-09-27 21:20:01 PDT
Comment on attachment 197637 [details] [diff] [review]
v1 patch

We should probably do the IE/Opera thing for 1.9... :(
Comment 21 Jungshik Shin 2005-09-29 20:05:13 PDT
I'm sorry that I haven't gotten back to this earlier. Sometime after I wrote my
comment and read Darin's reply that agreed with me, I began to have a second
thought. What portion of web pages is affected by this problem? Is it so
widespread a practice to include URLs with the query part that contains 'raw'
characters (not-%-encoded) that we have to give in? I doubt it is (We have 7 or
8 dupes, but is it a good indicator as to how widely this practice is used?) 
Shouldn't the web as a whole be 'better off' in the long run if we, *now having
some sizable market share*,  stick to the standard and spread the word that
%-encoding must be used in the query part in non-UTF-8 pages? Sorry again that I
have a second thought, but it'd be nice if we could 'deliver' this issue a bit
longer before making a final decision. 
Comment 22 Benoit 2005-09-30 07:27:15 PDT
This breaks at least one major French site: the dictionary on TVS.org doesn't
work reliably on Fx 1.5b2

See:
http://dictionnaire.tv5.org/dictionnaires.asp?Action=1

Enter a search term with accentuated characters, for example "éléphant". Observe
that it works perfectly. The URL contains "%E9l%E9phant", the page uses
ISO-8859-1 everywhere. Now, click on the title of that definition (the same
word). It doesn't work anymore: the search zone now shows unencoded UTF-8, while
the URL contains "%C3%A9l%C3%A9phant".

What's more worrying to us is that this very online dictionary is the also the
only one in French that is not using frames or session management preventing us
to do a search plugin for it. We are considering including it in the default
search plugins for 1.5 (we currently ship without any dictionary plugin in French). 

Of course I'll write to the site owners to ask them to encode their links, but
the bottom line of this is that it is a very common practice on international
sites to not encode URLs, assuming they use the same encoding as the current page.
Comment 23 Asa Dotzler [:asa] 2005-09-30 12:12:23 PDT
One proposal is to turn it off, the other is to leave the status quo and try to
reach out with evangelism. Jshin and Darin, what do you guys think?
Comment 24 Jesse Ruderman 2005-10-05 21:09:23 PDT
*** Bug 311300 has been marked as a duplicate of this bug. ***
Comment 25 Darin Fisher 2005-10-06 13:59:08 PDT
I think we should revert to 1.0 behavior for the 1.5 release until we have a
better solution in place.
Comment 26 Jungshik Shin 2005-10-07 20:56:21 PDT
Comment on attachment 197637 [details] [diff] [review]
v1 patch

Sorry for the delay. 
Ok. I'm giving in :-). Bug 297395 was decisive although it's possible to make
an exception for 'ftp://', which we have to do anyway until we implement RFC
2640 (bug 26767)
Comment 27 Phil Ringnalda (:philor) 2005-10-08 23:56:38 PDT
*** Bug 311748 has been marked as a duplicate of this bug. ***
Comment 28 Asa Dotzler [:asa] 2005-10-10 15:15:48 PDT
Comment on attachment 197637 [details] [diff] [review]
v1 patch

approving the reverting back to 1.0 defaults.
Comment 29 Darin Fisher 2005-10-10 18:43:01 PDT
fixed-on-trunk, fixed1.8
Comment 30 Martin Dürst 2005-10-11 02:01:43 PDT
I have followed this discussion (mostly just reading, occasionally also thinking
about writing a comment, but never getting around to actually write one).

My understanding is that for the mid-range future (after Firefox 1.5) it
would be very good to move to what's called the IE/Opera behavior, i.e.
using UTF-8 for the path/file part and the page encoding for the query
part. While this solution is not optimal, it would definitely be a big
step forward.

I'm wondering how this is going to be reflected in bugzilla. This bug
here is marked as fixed, but can it be marked as 'fixed for 1.8 but
open for 1.9' or some such?

Also, how much effort is needed for implementing the IE/Opera solution?

Regards,   Martin.

(In reply to comment #26)
> (From update of attachment 197637 [details] [diff] [review] [edit])
> Sorry for the delay. 
> Ok. I'm giving in :-). Bug 297395 was decisive although it's possible to make
> an exception for 'ftp://', which we have to do anyway until we implement RFC
> 2640 (bug 26767)
> 

Comment 31 Michael Lefevre 2005-10-11 03:23:10 PDT
"I'm wondering how this is going to be reflected in bugzilla. This bug
here is marked as fixed, but can it be marked as 'fixed for 1.8 but
open for 1.9' or some such?"

It could technicall (fixed1.8 keyword is for 1.8, but FIXED is for trunk/1.9),
but that hasn't happened here - the change has been made on trunk as well. So
either a new bug should be filed for 1.9, or bug 261929 should be reopened -
maybe Darin could do whatever is appropriate?
Comment 32 Darin Fisher 2005-10-11 18:06:18 PDT
I went ahead and reopened bug 261929 since the key part of the patch in that bug
was backed out.
Comment 33 Jungshik Shin 2005-10-24 17:27:26 PDT
(In reply to comment #30)

Martin, thanks for your comment. I considered asking you to add your comment here, but hadn't managed to. 

> My understanding is that for the mid-range future (after Firefox 1.5) it
> would be very good to move to what's called the IE/Opera behavior, i.e.
> using UTF-8 for the path/file part and the page encoding for the query
> part. While this solution is not optimal, it would definitely be a big
> step forward.

However, that behavior is in violation of RFC 3987 you wrote, or did I misread it? 
Comment 34 Smokey Ardisson (offline for a while; not following bugs - do not email) 2005-10-25 06:34:42 PDT
*** Bug 313717 has been marked as a duplicate of this bug. ***
Comment 35 Phil Ringnalda (:philor) 2005-10-28 22:12:45 PDT
*** Bug 314178 has been marked as a duplicate of this bug. ***
Comment 36 P. Andries 2007-05-20 21:21:12 PDT
(In reply to comment #22)
> This breaks at least one major French site: the dictionary on TVS.org doesn't
> work reliably on Fx 1.5b2
> 
> See:
> http://dictionnaire.tv5.org/dictionnaires.asp?Action=1
> 
> Enter a search term with accentuated characters, for example "éléphant". Observe
> that it works perfectly. The URL contains "%E9l%E9phant", the page uses
> ISO-8859-1 everywhere. Now, click on the title of that definition (the same
> word). It doesn't work anymore: the search zone now shows unencoded UTF-8, while
> the URL contains "%C3%A9l%C3%A9phant".

This seems to work perfectly with MSIE 7 and the "always send UTF-8 URLs" set on (which is default I believe).

http://dictionnaire.tv5.org/dictionnaires.asp?Action=1&param=éléphant&che=1
http://dictionnaire.tv5.org/dictionnaires.asp?Action=1&Mot=Mammifère&Alea=29385

I think the Latin-1 %-encoding works as well as the UTF-8 one, this is the right way of introducing RFC 3987 I believe (through a little alias ?).

http://dictionnaire.tv5.org/dictionnaires.asp?Action=1&Mot=Mammif%C3%A8re&Alea=12018
http://dictionnaire.tv5.org/dictionnaires.asp?Action=1&Mot=Mammif%E8re&Alea=12018
Comment 37 henryfhchan 2010-03-14 05:29:58 PDT
default value breaks wikipedia (with url masking OFF), facebook, youtube and a whole bunch of other fully UTF-8 common websites.  Shouldn't we be standard compliant one more time?
Comment 38 Martin Dürst 2010-03-15 00:14:41 PDT
Henry, can you be more specific about what works and what fails for wikipedia, facebook, youtube,...?
Comment 39 Nickolay_Ponomarev 2010-03-21 11:58:44 PDT
(Re comments 37,38: Henry filed bug 552273 about this.)

Note You need to log in before you can comment on or make changes to this bug.