Closed Bug 284474 Opened 20 years ago Closed 19 years ago

Converting to UTF-8 a url with an unescaped non-ASCII chars in the query part leads to an incompatibility with most server-side programs

Categories

(Core :: Networking, defect)

defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla1.8beta5

People

(Reporter: jshin1987, Assigned: darin.moz)

References

()

Details

(Keywords: fixed1.8, intl, regression)

Attachments

(1 file)

On the trunk, we turned on IRI by default (network.standard-url.encode-utf8') thinking (that was mainly me) that IE's 'Always send URL in UTF-8' is equivalent to the full IRI support. It turned out that with that option ON (which is the default), IE sends the path part in %-encoded UTF-8 (regardless of the encoding of a refering page) but uses the encoding of the refering page in the query part. For instance, if you press the submit button in the page given at the URL, IE asks for the following URL http://jshin.net/moztest/%EA%B0%80%EB%82%98.cgi?f1=%B0%A1%B3%AA while Mozilla asks for http://jshin.net/moztest/%EA%B0%80%EB%82%98.cgi?f1=%EA%B0%80%EB%82%98 The page is in EUC-KR and '%EA%B0%80%EB%82%98' is %-encoded '가나' in UTF-8 while '%B0%A1%B3%AA' is %-encoded '가나' (U+AC00, U+B098) in EUC-KR. I'm not sure what to do with this thorny issue. We can just ignore the issue and wish everyone moves on and write web pages in UTF-8 (which will not happen very soon). We might be able to ignore the issue assuming that most web sites use 'post' Do we want to follow the IE's suit and produce a weired 'cross' like that? I guess not. We can 'educate' web authors so that they check UTF8-ness first before interpreting query string as the character encoding of the refering document.
> On the trunk, we turned on IRI by default (network.standard-url.encode-utf8') I should have said we had turned on a 'standard-compliant' conversion of IRI to URI. (bug 261929) Martin, how does your fileiri module for Apache 2.x handle URIs generated by MS IE (path part in %-escaped UTF-8 and query parth in %-escaped document encoding)? I now realized that what MS IE does is not so unreasonable and that we might have to considering handling the query part separately and differently from the path part...
If I type an IRI (http://jshin.net/moztest/가나.cgi?f1=가나) directly in the open-location dialog box, MS IE 6 (under Korean locale) sends the following URL: /moztest/%EA%B0%80%EB%82%98.cgi?f1=\xb0\xa1\xb3\xaa (where '\xb0' stands for the octet '0xB0', etc) This is really interesting...
*** Bug 284537 has been marked as a duplicate of this bug. ***
*** Bug 271828 has been marked as a duplicate of this bug. ***
Blocks: iri
My comment #0 was wrong. The query part in the second and the third cases at the url (in the url field above) is encoded in the page encoding before being URL-escaped whehter the pref. 'standard.url.encode_utf8' is true or false. The path part is turned to UTF-8 if the pref. is set true. A real 'issue' is what to do when a literal URL (*without* url-escape) in an html page is used (cases 4 and 6 in my test page). Firefox converts the whole url to UTF-8 (path and query) while MS IE converts only the path part. What FF does is standard-compliant (which I was reminded of in another bug) Note that url-escaping the non-ASCII query part (case 5) would enable web authors to avoid this issue in the first place.
Summary: IRI and parity with IE which has only half-baked IRI support → Converting to UTF-8 a url with an unescaped non-ASCII chars in the query part leads to an incompaitbilty with most server-side programs
*** Bug 306682 has been marked as a duplicate of this bug. ***
RFC 3987 (http://www.ietf.org/rfc/rfc3987.txt) is very clear about converting IRIs to URIs and there's no doubt that what we do is compliant to RFC 3987. However, it seems like we're on the losing end. Both MS IE and Opera treat the query part differently from the path part. While they follow RFC 3987 when converting the path part, they use the character encoding of the page where an IRI is refered to (instead of UTF-8) when converting the query part. BTW, Safari behaves the same way as Firefox with 'standard-url.encode_utf8' set to false. (i.e. it doesn't honor RFC 3987 at all)
*** Bug 308563 has been marked as a duplicate of this bug. ***
*** Bug 308732 has been marked as a duplicate of this bug. ***
*** Bug 308944 has been marked as a duplicate of this bug. ***
This breaks most sub-category links on a big greek on-line pc & gadget shop (http://www.e-shop.gr) rendering the site unusable. Funny thing, they are using FF 1.0 or older on their own machines (I saw it by myself). The bug is currently affecting Firefox 1.5 branch too and I wonder if it would be proper to suggest that site's authors to modify their code as soon as final 1.5 is out.
Flags: blocking1.8b5?
It seems that we either want to take the non-risk path and revert back to 1.0 behaviour, or, depending upon the availability of someone to do the work, and the complexity of the patch, follow IE and Opera.
jshin: We need to decide what to do here. I'm leaning toward backing out the UTF-8 change, so that we return to a known state. Thoughts?
Severity: normal → major
Status: NEW → ASSIGNED
Target Milestone: --- → mozilla1.8beta5
It's tough to decide what to do here. I want firefox to comply to the standard as much as possible, but the reality doesn't seem to let us do that in this case. There are three options: 1. go back to what we did in 1.0 2. leave our code alone, make this bug as a 'tech-evangel' bug and try to spread the word about RFC 3987 (the query part should be %-escaped if they want to use a non-UTF-8 encoding). In addition, document the configure option about the URI/IRI conversion in a prominent place (release note/help...etc). We may even add an advanced option with a real UI. 3. implement what IE/Opera do. #3 is not feasible for 1.5, I guess. Even if it's feasible, I'm rather reluctant to do that now that I have read the full text of RFC 3987. I wish I could argue strongly for #2, but given the reality in the wild, I can't. So, I have to say 'let's go with #1' for now.
OK, #1 it is. I'm also sad to see this change reverted :(
Keywords: regression
*** Bug 310033 has been marked as a duplicate of this bug. ***
Need to get a patch ASAP flipping the pref back.
Flags: blocking1.8b5? → blocking1.8b5+
Blocks: 308159
Similar problems with local file handling (BUG 308159)...
Attached patch v1 patchSplinter Review
Attachment #197637 - Flags: review?(jshin1987)
Attachment #197637 - Flags: superreview?(bzbarsky)
Comment on attachment 197637 [details] [diff] [review] v1 patch We should probably do the IE/Opera thing for 1.9... :(
Attachment #197637 - Flags: superreview?(bzbarsky) → superreview+
I'm sorry that I haven't gotten back to this earlier. Sometime after I wrote my comment and read Darin's reply that agreed with me, I began to have a second thought. What portion of web pages is affected by this problem? Is it so widespread a practice to include URLs with the query part that contains 'raw' characters (not-%-encoded) that we have to give in? I doubt it is (We have 7 or 8 dupes, but is it a good indicator as to how widely this practice is used?) Shouldn't the web as a whole be 'better off' in the long run if we, *now having some sizable market share*, stick to the standard and spread the word that %-encoding must be used in the query part in non-UTF-8 pages? Sorry again that I have a second thought, but it'd be nice if we could 'deliver' this issue a bit longer before making a final decision.
This breaks at least one major French site: the dictionary on TVS.org doesn't work reliably on Fx 1.5b2 See: http://dictionnaire.tv5.org/dictionnaires.asp?Action=1 Enter a search term with accentuated characters, for example "éléphant". Observe that it works perfectly. The URL contains "%E9l%E9phant", the page uses ISO-8859-1 everywhere. Now, click on the title of that definition (the same word). It doesn't work anymore: the search zone now shows unencoded UTF-8, while the URL contains "%C3%A9l%C3%A9phant". What's more worrying to us is that this very online dictionary is the also the only one in French that is not using frames or session management preventing us to do a search plugin for it. We are considering including it in the default search plugins for 1.5 (we currently ship without any dictionary plugin in French). Of course I'll write to the site owners to ask them to encode their links, but the bottom line of this is that it is a very common practice on international sites to not encode URLs, assuming they use the same encoding as the current page.
One proposal is to turn it off, the other is to leave the status quo and try to reach out with evangelism. Jshin and Darin, what do you guys think?
Flags: blocking1.8b5+ → blocking1.8b5-
Blocks: 309126
*** Bug 311300 has been marked as a duplicate of this bug. ***
I think we should revert to 1.0 behavior for the 1.5 release until we have a better solution in place.
Flags: blocking1.8rc1?
Blocks: 311387
Comment on attachment 197637 [details] [diff] [review] v1 patch Sorry for the delay. Ok. I'm giving in :-). Bug 297395 was decisive although it's possible to make an exception for 'ftp://', which we have to do anyway until we implement RFC 2640 (bug 26767)
Attachment #197637 - Flags: review?(jshin1987)
Attachment #197637 - Flags: review+
Attachment #197637 - Flags: approval1.8rc1?
*** Bug 311748 has been marked as a duplicate of this bug. ***
Comment on attachment 197637 [details] [diff] [review] v1 patch approving the reverting back to 1.0 defaults.
Attachment #197637 - Flags: approval1.8rc1? → approval1.8rc1+
Flags: blocking1.8rc1? → blocking1.8rc1+
fixed-on-trunk, fixed1.8
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Keywords: fixed1.8
Resolution: --- → FIXED
I have followed this discussion (mostly just reading, occasionally also thinking about writing a comment, but never getting around to actually write one). My understanding is that for the mid-range future (after Firefox 1.5) it would be very good to move to what's called the IE/Opera behavior, i.e. using UTF-8 for the path/file part and the page encoding for the query part. While this solution is not optimal, it would definitely be a big step forward. I'm wondering how this is going to be reflected in bugzilla. This bug here is marked as fixed, but can it be marked as 'fixed for 1.8 but open for 1.9' or some such? Also, how much effort is needed for implementing the IE/Opera solution? Regards, Martin. (In reply to comment #26) > (From update of attachment 197637 [details] [diff] [review] [edit]) > Sorry for the delay. > Ok. I'm giving in :-). Bug 297395 was decisive although it's possible to make > an exception for 'ftp://', which we have to do anyway until we implement RFC > 2640 (bug 26767) >
"I'm wondering how this is going to be reflected in bugzilla. This bug here is marked as fixed, but can it be marked as 'fixed for 1.8 but open for 1.9' or some such?" It could technicall (fixed1.8 keyword is for 1.8, but FIXED is for trunk/1.9), but that hasn't happened here - the change has been made on trunk as well. So either a new bug should be filed for 1.9, or bug 261929 should be reopened - maybe Darin could do whatever is appropriate?
I went ahead and reopened bug 261929 since the key part of the patch in that bug was backed out.
(In reply to comment #30) Martin, thanks for your comment. I considered asking you to add your comment here, but hadn't managed to. > My understanding is that for the mid-range future (after Firefox 1.5) it > would be very good to move to what's called the IE/Opera behavior, i.e. > using UTF-8 for the path/file part and the page encoding for the query > part. While this solution is not optimal, it would definitely be a big > step forward. However, that behavior is in violation of RFC 3987 you wrote, or did I misread it?
*** Bug 313717 has been marked as a duplicate of this bug. ***
*** Bug 314178 has been marked as a duplicate of this bug. ***
(In reply to comment #22) > This breaks at least one major French site: the dictionary on TVS.org doesn't > work reliably on Fx 1.5b2 > > See: > http://dictionnaire.tv5.org/dictionnaires.asp?Action=1 > > Enter a search term with accentuated characters, for example "éléphant". Observe > that it works perfectly. The URL contains "%E9l%E9phant", the page uses > ISO-8859-1 everywhere. Now, click on the title of that definition (the same > word). It doesn't work anymore: the search zone now shows unencoded UTF-8, while > the URL contains "%C3%A9l%C3%A9phant". This seems to work perfectly with MSIE 7 and the "always send UTF-8 URLs" set on (which is default I believe). http://dictionnaire.tv5.org/dictionnaires.asp?Action=1&param=éléphant&che=1 http://dictionnaire.tv5.org/dictionnaires.asp?Action=1&Mot=Mammifère&Alea=29385 I think the Latin-1 %-encoding works as well as the UTF-8 one, this is the right way of introducing RFC 3987 I believe (through a little alias ?). http://dictionnaire.tv5.org/dictionnaires.asp?Action=1&Mot=Mammif%C3%A8re&Alea=12018 http://dictionnaire.tv5.org/dictionnaires.asp?Action=1&Mot=Mammif%E8re&Alea=12018
Summary: Converting to UTF-8 a url with an unescaped non-ASCII chars in the query part leads to an incompaitbilty with most server-side programs → Converting to UTF-8 a url with an unescaped non-ASCII chars in the query part leads to an incompatibility with most server-side programs
default value breaks wikipedia (with url masking OFF), facebook, youtube and a whole bunch of other fully UTF-8 common websites. Shouldn't we be standard compliant one more time?
Henry, can you be more specific about what works and what fails for wikipedia, facebook, youtube,...?
(Re comments 37,38: Henry filed bug 552273 about this.)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: