Closed Bug 284474 Opened 19 years ago Closed 19 years ago

Converting to UTF-8 a url with an unescaped non-ASCII chars in the query part leads to an incompatibility with most server-side programs

Categories

(Core :: Networking, defect)

defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla1.8beta5

People

(Reporter: jshin1987, Assigned: darin.moz)

References

()

Details

(Keywords: fixed1.8, intl, regression)

Attachments

(1 file)

On the trunk, we turned on IRI by default (network.standard-url.encode-utf8')
thinking (that was mainly me) that IE's 'Always send URL in UTF-8' is equivalent
to the full IRI support. It turned out that with that option ON (which is the
default), IE sends the path part in %-encoded UTF-8 (regardless of the encoding
of a refering page) but uses the encoding of the refering page in the query part. 

For instance, if you press the submit button in the page given at the URL, IE
asks for the following URL 

http://jshin.net/moztest/%EA%B0%80%EB%82%98.cgi?f1=%B0%A1%B3%AA

while Mozilla asks for 

http://jshin.net/moztest/%EA%B0%80%EB%82%98.cgi?f1=%EA%B0%80%EB%82%98

The page is in EUC-KR and '%EA%B0%80%EB%82%98' is %-encoded '가나' in UTF-8
while  '%B0%A1%B3%AA' is %-encoded '가나' (U+AC00, U+B098) in EUC-KR. 

I'm not sure what to do with this thorny issue. We can just ignore the issue and
wish everyone moves on and write web pages in UTF-8 (which will not happen very
soon). We might be able to ignore the issue assuming that most web sites use 'post'

Do we want to follow the IE's suit and produce a weired 'cross' like that? I
guess not.  We can 'educate' web authors so that they check UTF8-ness first
before interpreting query string as the character encoding of the refering
document.
> On the trunk, we turned on IRI by default (network.standard-url.encode-utf8')

I should have said we had turned on a 'standard-compliant' conversion of IRI to
URI. (bug 261929)

Martin, how does your fileiri module for Apache 2.x handle URIs generated by MS
IE (path part in %-escaped UTF-8 and query parth in %-escaped document encoding)?

I now realized that what MS IE does is not so unreasonable and that we might
have to considering handling the query part separately and differently from the
path part...
 
If I type an IRI (http://jshin.net/moztest/가나.cgi?f1=가나) directly in the
open-location dialog box, MS IE 6 (under Korean locale) sends the following URL:

/moztest/%EA%B0%80%EB%82%98.cgi?f1=\xb0\xa1\xb3\xaa   (where '\xb0' stands for
the octet '0xB0', etc) 

This is really interesting...




*** Bug 284537 has been marked as a duplicate of this bug. ***
*** Bug 271828 has been marked as a duplicate of this bug. ***
Blocks: iri
My comment #0 was wrong. The query part in the second and the third cases at the
url (in the url field above) is encoded in the page encoding before being
URL-escaped whehter the pref. 'standard.url.encode_utf8' is true or false. The
path part is turned to UTF-8 if the pref. is set true. 

A real 'issue' is what to do when a literal URL (*without* url-escape) in an
html page is used (cases 4 and 6 in my test page). Firefox converts the whole
url to UTF-8 (path and query) while MS IE converts only the path part. What FF
does is standard-compliant (which I was reminded of in another bug)


Note that url-escaping the non-ASCII query part (case 5) would enable web
authors to avoid this issue in the first place.
Summary: IRI and parity with IE which has only half-baked IRI support → Converting to UTF-8 a url with an unescaped non-ASCII chars in the query part leads to an incompaitbilty with most server-side programs
*** Bug 306682 has been marked as a duplicate of this bug. ***
RFC 3987 (http://www.ietf.org/rfc/rfc3987.txt) is very clear about converting
IRIs to URIs and there's no doubt that what we do is compliant to RFC 3987.
However, it seems like we're on the losing end. Both MS IE and Opera treat the
query part differently from the path part. While they follow RFC 3987 when
converting the path part, they use the character encoding of the page where an
IRI is refered to (instead of UTF-8) when converting the query part. 

BTW, Safari behaves the same way as Firefox with 'standard-url.encode_utf8' set
to false. (i.e. it doesn't honor RFC 3987 at all)
*** Bug 308563 has been marked as a duplicate of this bug. ***
*** Bug 308732 has been marked as a duplicate of this bug. ***
*** Bug 308944 has been marked as a duplicate of this bug. ***
This breaks most sub-category links on a big greek on-line pc & gadget shop
(http://www.e-shop.gr) rendering the site unusable. Funny thing, they are using
FF 1.0 or older on their own machines (I saw it by myself). 
The bug is currently affecting Firefox 1.5 branch too and I wonder if it would
be proper to suggest that site's authors to modify their code as soon as final
1.5 is out.
Flags: blocking1.8b5?
It seems that we either want to take the non-risk path and revert back to 1.0
behaviour, or, depending upon the availability of someone to do the work, and
the complexity of the patch, follow IE and Opera.
jshin: We need to decide what to do here.  I'm leaning toward backing out the
UTF-8 change, so that we return to a known state.  Thoughts?
Severity: normal → major
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.8beta5
 It's tough to decide what to do here. I want firefox to comply to the standard
as much as possible, but the reality doesn't seem to let us do that in this
case. There are three options:

1. go back to what we did in 1.0
2. leave our code alone, make this bug as a 'tech-evangel' bug and try to spread
the word about RFC 3987 (the query part should be %-escaped if they want to use
a non-UTF-8 encoding). In addition, document the configure option about the
URI/IRI conversion in a prominent place (release note/help...etc). We may even
add an advanced option with a real UI. 
3. implement what IE/Opera do. 

#3 is not feasible for 1.5, I guess. Even if it's feasible, I'm rather reluctant
to do that now that I have read the full text of RFC 3987. I wish I could argue
strongly for #2, but given the reality in the wild, I can't. So, I have to say
'let's go with #1' for now. 
OK, #1 it is.  I'm also sad to see this change reverted :(
Keywords: regression
*** Bug 310033 has been marked as a duplicate of this bug. ***
Need to get a patch ASAP flipping the pref back.
Flags: blocking1.8b5? → blocking1.8b5+
Blocks: 308159
Similar problems with local file handling (BUG 308159)...
Attached patch v1 patchSplinter Review
Attachment #197637 - Flags: review?(jshin1987)
Attachment #197637 - Flags: superreview?(bzbarsky)
Comment on attachment 197637 [details] [diff] [review]
v1 patch

We should probably do the IE/Opera thing for 1.9... :(
Attachment #197637 - Flags: superreview?(bzbarsky) → superreview+
I'm sorry that I haven't gotten back to this earlier. Sometime after I wrote my
comment and read Darin's reply that agreed with me, I began to have a second
thought. What portion of web pages is affected by this problem? Is it so
widespread a practice to include URLs with the query part that contains 'raw'
characters (not-%-encoded) that we have to give in? I doubt it is (We have 7 or
8 dupes, but is it a good indicator as to how widely this practice is used?) 
Shouldn't the web as a whole be 'better off' in the long run if we, *now having
some sizable market share*,  stick to the standard and spread the word that
%-encoding must be used in the query part in non-UTF-8 pages? Sorry again that I
have a second thought, but it'd be nice if we could 'deliver' this issue a bit
longer before making a final decision. 
This breaks at least one major French site: the dictionary on TVS.org doesn't
work reliably on Fx 1.5b2

See:
http://dictionnaire.tv5.org/dictionnaires.asp?Action=1

Enter a search term with accentuated characters, for example "éléphant". Observe
that it works perfectly. The URL contains "%E9l%E9phant", the page uses
ISO-8859-1 everywhere. Now, click on the title of that definition (the same
word). It doesn't work anymore: the search zone now shows unencoded UTF-8, while
the URL contains "%C3%A9l%C3%A9phant".

What's more worrying to us is that this very online dictionary is the also the
only one in French that is not using frames or session management preventing us
to do a search plugin for it. We are considering including it in the default
search plugins for 1.5 (we currently ship without any dictionary plugin in French). 

Of course I'll write to the site owners to ask them to encode their links, but
the bottom line of this is that it is a very common practice on international
sites to not encode URLs, assuming they use the same encoding as the current page.
One proposal is to turn it off, the other is to leave the status quo and try to
reach out with evangelism. Jshin and Darin, what do you guys think?
Flags: blocking1.8b5+ → blocking1.8b5-
Blocks: 309126
*** Bug 311300 has been marked as a duplicate of this bug. ***
I think we should revert to 1.0 behavior for the 1.5 release until we have a
better solution in place.
Flags: blocking1.8rc1?
Blocks: 311387
Comment on attachment 197637 [details] [diff] [review]
v1 patch

Sorry for the delay. 
Ok. I'm giving in :-). Bug 297395 was decisive although it's possible to make
an exception for 'ftp://', which we have to do anyway until we implement RFC
2640 (bug 26767)
Attachment #197637 - Flags: review?(jshin1987)
Attachment #197637 - Flags: review+
Attachment #197637 - Flags: approval1.8rc1?
*** Bug 311748 has been marked as a duplicate of this bug. ***
Comment on attachment 197637 [details] [diff] [review]
v1 patch

approving the reverting back to 1.0 defaults.
Attachment #197637 - Flags: approval1.8rc1? → approval1.8rc1+
Flags: blocking1.8rc1? → blocking1.8rc1+
fixed-on-trunk, fixed1.8
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Keywords: fixed1.8
Resolution: --- → FIXED
I have followed this discussion (mostly just reading, occasionally also thinking
about writing a comment, but never getting around to actually write one).

My understanding is that for the mid-range future (after Firefox 1.5) it
would be very good to move to what's called the IE/Opera behavior, i.e.
using UTF-8 for the path/file part and the page encoding for the query
part. While this solution is not optimal, it would definitely be a big
step forward.

I'm wondering how this is going to be reflected in bugzilla. This bug
here is marked as fixed, but can it be marked as 'fixed for 1.8 but
open for 1.9' or some such?

Also, how much effort is needed for implementing the IE/Opera solution?

Regards,   Martin.

(In reply to comment #26)
> (From update of attachment 197637 [details] [diff] [review] [edit])
> Sorry for the delay. 
> Ok. I'm giving in :-). Bug 297395 was decisive although it's possible to make
> an exception for 'ftp://', which we have to do anyway until we implement RFC
> 2640 (bug 26767)
> 

"I'm wondering how this is going to be reflected in bugzilla. This bug
here is marked as fixed, but can it be marked as 'fixed for 1.8 but
open for 1.9' or some such?"

It could technicall (fixed1.8 keyword is for 1.8, but FIXED is for trunk/1.9),
but that hasn't happened here - the change has been made on trunk as well. So
either a new bug should be filed for 1.9, or bug 261929 should be reopened -
maybe Darin could do whatever is appropriate?
I went ahead and reopened bug 261929 since the key part of the patch in that bug
was backed out.
(In reply to comment #30)

Martin, thanks for your comment. I considered asking you to add your comment here, but hadn't managed to. 

> My understanding is that for the mid-range future (after Firefox 1.5) it
> would be very good to move to what's called the IE/Opera behavior, i.e.
> using UTF-8 for the path/file part and the page encoding for the query
> part. While this solution is not optimal, it would definitely be a big
> step forward.

However, that behavior is in violation of RFC 3987 you wrote, or did I misread it? 
*** Bug 313717 has been marked as a duplicate of this bug. ***
*** Bug 314178 has been marked as a duplicate of this bug. ***
(In reply to comment #22)
> This breaks at least one major French site: the dictionary on TVS.org doesn't
> work reliably on Fx 1.5b2
> 
> See:
> http://dictionnaire.tv5.org/dictionnaires.asp?Action=1
> 
> Enter a search term with accentuated characters, for example "éléphant". Observe
> that it works perfectly. The URL contains "%E9l%E9phant", the page uses
> ISO-8859-1 everywhere. Now, click on the title of that definition (the same
> word). It doesn't work anymore: the search zone now shows unencoded UTF-8, while
> the URL contains "%C3%A9l%C3%A9phant".

This seems to work perfectly with MSIE 7 and the "always send UTF-8 URLs" set on (which is default I believe).

http://dictionnaire.tv5.org/dictionnaires.asp?Action=1&param=éléphant&che=1
http://dictionnaire.tv5.org/dictionnaires.asp?Action=1&Mot=Mammifère&Alea=29385

I think the Latin-1 %-encoding works as well as the UTF-8 one, this is the right way of introducing RFC 3987 I believe (through a little alias ?).

http://dictionnaire.tv5.org/dictionnaires.asp?Action=1&Mot=Mammif%C3%A8re&Alea=12018
http://dictionnaire.tv5.org/dictionnaires.asp?Action=1&Mot=Mammif%E8re&Alea=12018
Summary: Converting to UTF-8 a url with an unescaped non-ASCII chars in the query part leads to an incompaitbilty with most server-side programs → Converting to UTF-8 a url with an unescaped non-ASCII chars in the query part leads to an incompatibility with most server-side programs
default value breaks wikipedia (with url masking OFF), facebook, youtube and a whole bunch of other fully UTF-8 common websites.  Shouldn't we be standard compliant one more time?
Henry, can you be more specific about what works and what fails for wikipedia, facebook, youtube,...?
(Re comments 37,38: Henry filed bug 552273 about this.)
You need to log in before you can comment on or make changes to this bug.