284474 - Converting to UTF-8 a url with an unescaped non-ASCII chars in the query part leads to an incompatibility with most server-side programs

Reporter

Description

•

20 years ago

On the trunk, we turned on IRI by default (network.standard-url.encode-utf8') thinking (that was mainly me) that IE's 'Always send URL in UTF-8' is equivalent to the full IRI support. It turned out that with that option ON (which is the default), IE sends the path part in %-encoded UTF-8 (regardless of the encoding of a refering page) but uses the encoding of the refering page in the query part. For instance, if you press the submit button in the page given at the URL, IE asks for the following URL http://jshin.net/moztest/%EA%B0%80%EB%82%98.cgi?f1=%B0%A1%B3%AA while Mozilla asks for http://jshin.net/moztest/%EA%B0%80%EB%82%98.cgi?f1=%EA%B0%80%EB%82%98 The page is in EUC-KR and '%EA%B0%80%EB%82%98' is %-encoded '가나' in UTF-8 while '%B0%A1%B3%AA' is %-encoded '가나' (U+AC00, U+B098) in EUC-KR. I'm not sure what to do with this thorny issue. We can just ignore the issue and wish everyone moves on and write web pages in UTF-8 (which will not happen very soon). We might be able to ignore the issue assuming that most web sites use 'post' Do we want to follow the IE's suit and produce a weired 'cross' like that? I guess not. We can 'educate' web authors so that they check UTF8-ness first before interpreting query string as the character encoding of the refering document.

Jungshik Shin

Reporter

Comment 1

•

20 years ago

> On the trunk, we turned on IRI by default (network.standard-url.encode-utf8') I should have said we had turned on a 'standard-compliant' conversion of IRI to URI. (bug 261929) Martin, how does your fileiri module for Apache 2.x handle URIs generated by MS IE (path part in %-escaped UTF-8 and query parth in %-escaped document encoding)? I now realized that what MS IE does is not so unreasonable and that we might have to considering handling the query part separately and differently from the path part...

URL: http://jshin.net/moztest/p_euckr.html → http://jshin.net/moztest/p_euckr.html

Jungshik Shin

Reporter

Comment 2

•

20 years ago

If I type an IRI (http://jshin.net/moztest/가나.cgi?f1=가나) directly in the open-location dialog box, MS IE 6 (under Korean locale) sends the following URL: /moztest/%EA%B0%80%EB%82%98.cgi?f1=\xb0\xa1\xb3\xaa (where '\xb0' stands for the octet '0xB0', etc) This is really interesting...

Jungshik Shin

Reporter

Comment 3

•

20 years ago

*** Bug 284537 has been marked as a duplicate of this bug. ***

Jungshik Shin

Reporter

Comment 4

•

20 years ago

*** Bug 271828 has been marked as a duplicate of this bug. ***

Anne (:annevk)

Updated

•

20 years ago

Blocks: iri

Jungshik Shin

Reporter

Comment 5

•

19 years ago

My comment #0 was wrong. The query part in the second and the third cases at the url (in the url field above) is encoded in the page encoding before being URL-escaped whehter the pref. 'standard.url.encode_utf8' is true or false. The path part is turned to UTF-8 if the pref. is set true. A real 'issue' is what to do when a literal URL (*without* url-escape) in an html page is used (cases 4 and 6 in my test page). Firefox converts the whole url to UTF-8 (path and query) while MS IE converts only the path part. What FF does is standard-compliant (which I was reminded of in another bug) Note that url-escaping the non-ASCII query part (case 5) would enable web authors to avoid this issue in the first place.

Summary: IRI and parity with IE which has only half-baked IRI support → Converting to UTF-8 a url with an unescaped non-ASCII chars in the query part leads to an incompaitbilty with most server-side programs

Jungshik Shin

Reporter

Comment 6

•

19 years ago

*** Bug 306682 has been marked as a duplicate of this bug. ***

Jungshik Shin

Reporter

Comment 7

•

19 years ago

RFC 3987 (http://www.ietf.org/rfc/rfc3987.txt) is very clear about converting IRIs to URIs and there's no doubt that what we do is compliant to RFC 3987. However, it seems like we're on the losing end. Both MS IE and Opera treat the query part differently from the path part. While they follow RFC 3987 when converting the path part, they use the character encoding of the page where an IRI is refered to (instead of UTF-8) when converting the query part. BTW, Safari behaves the same way as Firefox with 'standard-url.encode_utf8' set to false. (i.e. it doesn't honor RFC 3987 at all)

Phil Ringnalda (:philor)

Comment 8

•

19 years ago

*** Bug 308563 has been marked as a duplicate of this bug. ***

Phil Ringnalda (:philor)

Comment 9

•

19 years ago

*** Bug 308732 has been marked as a duplicate of this bug. ***

Phil Ringnalda (:philor)

Comment 10

•

19 years ago

*** Bug 308944 has been marked as a duplicate of this bug. ***

Dimitrios

Comment 11

•

19 years ago

This breaks most sub-category links on a big greek on-line pc & gadget shop (http://www.e-shop.gr) rendering the site unusable. Funny thing, they are using FF 1.0 or older on their own machines (I saw it by myself). The bug is currently affecting Firefox 1.5 branch too and I wonder if it would be proper to suggest that site's authors to modify their code as soon as final 1.5 is out.

Flags: blocking1.8b5?

Chris Beard

Comment 12

•

19 years ago

It seems that we either want to take the non-risk path and revert back to 1.0 behaviour, or, depending upon the availability of someone to do the work, and the complexity of the patch, follow IE and Opera.

Darin Fisher

Assignee

Comment 13

•

19 years ago

jshin: We need to decide what to do here. I'm leaning toward backing out the UTF-8 change, so that we return to a known state. Thoughts?

Darin Fisher

Assignee

Updated

•

19 years ago

Severity: normal → major

Status: NEW → ASSIGNED

Target Milestone: --- → mozilla1.8beta5

Jungshik Shin

Reporter

Comment 14

•

19 years ago

It's tough to decide what to do here. I want firefox to comply to the standard as much as possible, but the reality doesn't seem to let us do that in this case. There are three options: 1. go back to what we did in 1.0 2. leave our code alone, make this bug as a 'tech-evangel' bug and try to spread the word about RFC 3987 (the query part should be %-escaped if they want to use a non-UTF-8 encoding). In addition, document the configure option about the URI/IRI conversion in a prominent place (release note/help...etc). We may even add an advanced option with a real UI. 3. implement what IE/Opera do. #3 is not feasible for 1.5, I guess. Even if it's feasible, I'm rather reluctant to do that now that I have read the full text of RFC 3987. I wish I could argue strongly for #2, but given the reality in the wild, I can't. So, I have to say 'let's go with #1' for now.

Darin Fisher

Assignee

Comment 15

•

19 years ago

OK, #1 it is. I'm also sad to see this change reverted :(

Keywords: regression

Phil Ringnalda (:philor)

Comment 16

•

19 years ago

*** Bug 310033 has been marked as a duplicate of this bug. ***

Mike Connor [:mconnor]

Comment 17

•

19 years ago

Need to get a patch ASAP flipping the pref back.

Flags: blocking1.8b5? → blocking1.8b5+

Zvi Devir

Updated

•

19 years ago

Blocks: 308159

Zvi Devir

Comment 18

•

19 years ago

Similar problems with local file handling (BUG 308159)...

Darin Fisher

Assignee

Comment 19

•

19 years ago

Attached patch v1 patch — Details — Splinter Review

Attachment #197637 - Flags: review?(jshin1987)

Darin Fisher

Assignee

Updated

•

19 years ago

Attachment #197637 - Flags: superreview?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 20

•

19 years ago

Comment on attachment 197637 [details] [diff] [review] v1 patch We should probably do the IE/Opera thing for 1.9... :(

Attachment #197637 - Flags: superreview?(bzbarsky) → superreview+

Jungshik Shin

Reporter

Comment 21

•

19 years ago

I'm sorry that I haven't gotten back to this earlier. Sometime after I wrote my comment and read Darin's reply that agreed with me, I began to have a second thought. What portion of web pages is affected by this problem? Is it so widespread a practice to include URLs with the query part that contains 'raw' characters (not-%-encoded) that we have to give in? I doubt it is (We have 7 or 8 dupes, but is it a good indicator as to how widely this practice is used?) Shouldn't the web as a whole be 'better off' in the long run if we, *now having some sizable market share*, stick to the standard and spread the word that %-encoding must be used in the query part in non-UTF-8 pages? Sorry again that I have a second thought, but it'd be nice if we could 'deliver' this issue a bit longer before making a final decision.

Benoit

Comment 22

•

19 years ago

This breaks at least one major French site: the dictionary on TVS.org doesn't work reliably on Fx 1.5b2 See: http://dictionnaire.tv5.org/dictionnaires.asp?Action=1 Enter a search term with accentuated characters, for example "éléphant". Observe that it works perfectly. The URL contains "%E9l%E9phant", the page uses ISO-8859-1 everywhere. Now, click on the title of that definition (the same word). It doesn't work anymore: the search zone now shows unencoded UTF-8, while the URL contains "%C3%A9l%C3%A9phant". What's more worrying to us is that this very online dictionary is the also the only one in French that is not using frames or session management preventing us to do a search plugin for it. We are considering including it in the default search plugins for 1.5 (we currently ship without any dictionary plugin in French). Of course I'll write to the site owners to ask them to encode their links, but the bottom line of this is that it is a very common practice on international sites to not encode URLs, assuming they use the same encoding as the current page.

Asa Dotzler [:asa]

Comment 23

•

19 years ago

One proposal is to turn it off, the other is to leave the status quo and try to reach out with evangelism. Jshin and Darin, what do you guys think?

Flags: blocking1.8b5+ → blocking1.8b5-

Martijn Wargers (dead)

Updated

•

19 years ago

Blocks: 309126

Jesse Ruderman

Comment 24

•

19 years ago

*** Bug 311300 has been marked as a duplicate of this bug. ***

Darin Fisher

Assignee

Comment 25

•

19 years ago

I think we should revert to 1.0 behavior for the 1.5 release until we have a better solution in place.

Flags: blocking1.8rc1?

Darin Fisher

Assignee

Updated

•

19 years ago

Blocks: 311387

Jungshik Shin

Reporter

Comment 26

•

19 years ago

Comment on attachment 197637 [details] [diff] [review] v1 patch Sorry for the delay. Ok. I'm giving in :-). Bug 297395 was decisive although it's possible to make an exception for 'ftp://', which we have to do anyway until we implement RFC 2640 (bug 26767)

Attachment #197637 - Flags: review?(jshin1987)

Attachment #197637 - Flags: review+

Attachment #197637 - Flags: approval1.8rc1?

Phil Ringnalda (:philor)

Comment 27

•

19 years ago

*** Bug 311748 has been marked as a duplicate of this bug. ***

Asa Dotzler [:asa]

Comment 28

•

19 years ago

Comment on attachment 197637 [details] [diff] [review] v1 patch approving the reverting back to 1.0 defaults.

Attachment #197637 - Flags: approval1.8rc1? → approval1.8rc1+

Asa Dotzler [:asa]

Updated

•

19 years ago

Flags: blocking1.8rc1? → blocking1.8rc1+

Darin Fisher

Assignee

Comment 29

•

19 years ago

fixed-on-trunk, fixed1.8

Status: ASSIGNED → RESOLVED

Closed: 19 years ago

Keywords: fixed1.8

Resolution: --- → FIXED

Martin Dürst

Comment 30

•

19 years ago

I have followed this discussion (mostly just reading, occasionally also thinking about writing a comment, but never getting around to actually write one). My understanding is that for the mid-range future (after Firefox 1.5) it would be very good to move to what's called the IE/Opera behavior, i.e. using UTF-8 for the path/file part and the page encoding for the query part. While this solution is not optimal, it would definitely be a big step forward. I'm wondering how this is going to be reflected in bugzilla. This bug here is marked as fixed, but can it be marked as 'fixed for 1.8 but open for 1.9' or some such? Also, how much effort is needed for implementing the IE/Opera solution? Regards, Martin. (In reply to comment #26) > (From update of attachment 197637 [details] [diff] [review] [edit]) > Sorry for the delay. > Ok. I'm giving in :-). Bug 297395 was decisive although it's possible to make > an exception for 'ftp://', which we have to do anyway until we implement RFC > 2640 (bug 26767) >

Michael Lefevre

Comment 31

•

19 years ago

"I'm wondering how this is going to be reflected in bugzilla. This bug here is marked as fixed, but can it be marked as 'fixed for 1.8 but open for 1.9' or some such?" It could technicall (fixed1.8 keyword is for 1.8, but FIXED is for trunk/1.9), but that hasn't happened here - the change has been made on trunk as well. So either a new bug should be filed for 1.9, or bug 261929 should be reopened - maybe Darin could do whatever is appropriate?

Darin Fisher

Assignee

Comment 32

•

19 years ago

I went ahead and reopened bug 261929 since the key part of the patch in that bug was backed out.

Jungshik Shin

Reporter

Comment 33

•

19 years ago

(In reply to comment #30) Martin, thanks for your comment. I considered asking you to add your comment here, but hadn't managed to. > My understanding is that for the mid-range future (after Firefox 1.5) it > would be very good to move to what's called the IE/Opera behavior, i.e. > using UTF-8 for the path/file part and the page encoding for the query > part. While this solution is not optimal, it would definitely be a big > step forward. However, that behavior is in violation of RFC 3987 you wrote, or did I misread it?

Smokey Ardisson (offline for a while; not following bugs - do not email)

Comment 34

•

19 years ago

*** Bug 313717 has been marked as a duplicate of this bug. ***

Phil Ringnalda (:philor)

Comment 35

•

19 years ago

*** Bug 314178 has been marked as a duplicate of this bug. ***

P. Andries

Comment 36

•

18 years ago

(In reply to comment #22) > This breaks at least one major French site: the dictionary on TVS.org doesn't > work reliably on Fx 1.5b2 > > See: > http://dictionnaire.tv5.org/dictionnaires.asp?Action=1 > > Enter a search term with accentuated characters, for example "éléphant". Observe > that it works perfectly. The URL contains "%E9l%E9phant", the page uses > ISO-8859-1 everywhere. Now, click on the title of that definition (the same > word). It doesn't work anymore: the search zone now shows unencoded UTF-8, while > the URL contains "%C3%A9l%C3%A9phant". This seems to work perfectly with MSIE 7 and the "always send UTF-8 URLs" set on (which is default I believe). http://dictionnaire.tv5.org/dictionnaires.asp?Action=1&param=éléphant&che=1 http://dictionnaire.tv5.org/dictionnaires.asp?Action=1&Mot=Mammifère&Alea=29385 I think the Latin-1 %-encoding works as well as the UTF-8 one, this is the right way of introducing RFC 3987 I believe (through a little alias ?). http://dictionnaire.tv5.org/dictionnaires.asp?Action=1&Mot=Mammif%C3%A8re&Alea=12018 http://dictionnaire.tv5.org/dictionnaires.asp?Action=1&Mot=Mammif%E8re&Alea=12018

OstGote!

Updated

•

18 years ago

Summary: Converting to UTF-8 a url with an unescaped non-ASCII chars in the query part leads to an incompaitbilty with most server-side programs → Converting to UTF-8 a url with an unescaped non-ASCII chars in the query part leads to an incompatibility with most server-side programs

henryfhchan

Comment 37

•

15 years ago

default value breaks wikipedia (with url masking OFF), facebook, youtube and a whole bunch of other fully UTF-8 common websites. Shouldn't we be standard compliant one more time?

Martin Dürst

Comment 38

•

15 years ago

Henry, can you be more specific about what works and what fails for wikipedia, facebook, youtube,...?