Closed Bug 129726 Opened 23 years ago Closed 21 years ago

option to send URL as UTF-8

Categories

(Core :: Networking, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: nhottanscp, Assigned: jshin1987)

References

Details

(Keywords: intl)

Attachments

(1 file, 4 obsolete files)

This is separated from bug 127282. Windows has an option "Always send URL as UTF-8" in "Internet Options" in control panel. Mozilla does not listen to that pref. Mozilla can have a similar option for cross platform. Before implementing this, we need to identify cases which we really need to send URL as UTF-8 instead of the document charset.
Keywords: intl
reassign to nhotta, cc to darin
Assignee: new-network-bugs → nhotta
Blocks: 157673
so this with this pref enabled, you'd ignore the origincharset hint? that seems troublesome. many websites will break. seems like it'd be better to leave this up to the websites completely. that is, let them provide their documents using UTF-8.. that way we'd have some idea that the webserver expects UTF-8. or am i missing something... what's the real benefit of this pref?
Two purposes. * Provide similar behavior as IE's "Send as UTF-8" option. One example I saw it usefull is that you can open non ASCII files of any (Windows) machines in a local network not depending on each machines file system's charsets. * Once UTF-8 URI become standard then the user can switch to the strict mode. It is also possible to supply a fallback as discussed in bug 150376.
maybe i'm just being pessimistic, but i really think that a fallback mechanism isn't going to work very well (i.e., it'll most likely be inefficient). we really want to give the server what it needs the first time instead of assuming it can deal with UTF-8 and then falling back on the document charset. for years now, servers are accustomed to receiving URLs in the document charset... what will make that suddenly become a minority implementation? most people aren't going to rush to adopt UTF-8 as their default because the default of using the document charset just works in most cases. sad, but doesn't it seem most likely?
I agree that the fallback would not work well for the servers. In fact, the fallback mentioned in bug 150376 is about unescaping URL (try UTF-8 conversion then a document charset). I shouldn't have mentioned that, sorry about the confusion. > most people aren't > going to rush to adopt UTF-8 as their default because the default of using the > document charset just works in most cases. I agree. That is why it is proposed as a backend only pref default to be disabled.
Status: NEW → ASSIGNED
i18n triage team: nsbeta1-
Severity: normal → enhancement
Keywords: nsbeta1-
I was sent here from bug 205471 :-) I agree with Darin and Naoki about the current situation of web servers' handling of UTF-8 URLs. Just FYI, there's an apache module that converts incoming URLs in UTF-8 to the file system charset of a server. If every server has something like this, it'd be much easier...Note that the file system charset of the server(say EUC-JP) can be different from the charset of the refering document(Shift_JIS). In an ideal world where every server understands URLs in UTF-8, this kind of cross- site referencing can benefit from this feature. Unfortunately, we don't live in an ideal world... BTW, in one case, Mozilla already sends URLs in UTF-8, that leads to 'file not found' error. That happens when an inline image (<img src="non-ascii- name.png">) is right-clicked on and 'view image' is selected. Mozilla sends the URL in url-escaped UTF-8 and most servers respons with '50x file not found'...
jshin: do you have a link to such a site? thx!
I thought I filed a bug for that, but apparently I didn't. Anyway, you can try that at http://jshin.net/moztest/download.html. In the page (in EUC-KR), right-click on the second blue ball (next to 'KO : EUC-KR : raw 8bit') and select 'view image'. Mozilla sends the URL in url-escaped UTF-8. On the other hand, if you do the same on the third blueball (<img src="url-escaped image name in EUC-KR.png">), Mozilla sends the URL in url-escaped EUC-KR.
*** Bug 206525 has been marked as a duplicate of this bug. ***
When will this patch be included in standard release? How can I help?
In bug 224280, I found that a recent version of PHP (PHP 5. I'm not sure of PHP 4.3.x) has no problem with POST data sent in UTF-8. I guess it also work well with GET data (embedded in URL) in url-escaped UTF-8. The PHP change is probably because in MS IE, 'always send URL in UTF-8' option is on by default. Actually, in that bug, I found that PHP doesn't work with POST data sent in UTF-16LE/BE (even though I modified Mozilla to specify 'charset=UTF-16LE' in Content-Type header of POST data). Anyway, considering PHP's popularity, there are gonna be a considerable number of server side applications that understand UTF-8 URLs. Besides, Martin Duerst (of W3) has been working on an Apache module (there were a couple of similar modules, but not included in Apache distribution) that translates file paths in url-escaped UTF-8 to the local filesystem encoding. With all these developments, it's time to move this forward. Naoki, if you're too busy with your day job to deal with this, let me know and I'll take care of it.
Attached patch update (obsolete) — Splinter Review
updated to the trunk. include diff for firebird, calendar, thunderbird and sunbird(standalone composer).
Attachment #91553 - Attachment is obsolete: true
Comment on attachment 135554 [details] [diff] [review] update asking for r/sr. it's off by default so that it wouldn't hurt anything. BTW, Martin's apache module for IRI is at http://www.w3.org/2003/Talks/0904-IUC-IRI/ http://dev.w3.org/cvsweb/apache-modules/mod_fileiri/mod_fileiri.c
Attachment #135554 - Flags: superreview?(bz-vacation)
Attachment #135554 - Flags: review?(darin)
+ printf("escape UTF-8 %s\n", gAlwaysSendUTF8 ? "enabled" : "disabled"); make that #ifdef DEBUG or remove it... I really think we should have a necko.js or gre.js that contains all the necko/gecko default preferences (possibly overridden by app specific files)
Comment on attachment 135554 [details] [diff] [review] update >Index: netwerk/base/src/nsStandardURL.cpp >+ printf("escape UTF-8 %s\n", gAlwaysSendUTF8 ? "enabled" : "disabled"); That should be a LOG(), right? sr=bzbarsky with that.
Attachment #135554 - Flags: superreview?(bz-vacation) → superreview+
Thanks for sr. Yea, I should've replaced printf with LOG(). I'll do when checking in. I agree that we need a common file shared by all applications so that we don't have to edit several different files. This is the worst case so far (5 of them).
Comment on attachment 135554 [details] [diff] [review] update >Index: netwerk/base/src/nsStandardURL.cpp > #define NS_NET_PREF_ESCAPEUTF8 "network.standard-url.escape-utf8" > #define NS_NET_PREF_ENABLEIDN "network.enableIDN" >+#define NS_NET_PREF_ALWAYSSENDUTF8 "network.standard-url.always-send-utf8" do me a favor and line up the "network...." values so they appear to be listed in a column. >+ printf("escape UTF-8 %s\n", gAlwaysSendUTF8 ? "enabled" : "disabled"); no raw printfs in production code! use LOG() macro instead. >Index: modules/libpref/src/init/all.js >+// This preference controls whether or not URLs are always encoded and send as >+// UTF-8. >+pref("network.standard-url.always-send-utf8", false); s/send/sent/ i think this pref name could be better. always-send-utf8 makes it sound like the URL is sending something somewhere. URLs don't send anything. maybe this should be called: pref("network.standard-url.assume-utf8-charset", false); that seems clearer to me. i may be wrong, but i don't think this preference is going to be that useful. i think we should instead work on a fallback mechanism to start with either UTF-8 and fallback to the origin charset or start with the origin charset and fallback to UTF-8. but, maybe this pref can help with a few websites. i don't know.
Attachment #135554 - Flags: review?(darin) → review-
Attached patch update (obsolete) — Splinter Review
It seems like I should've done a bit more than updating the old patch. assume-utf8-charset is clearer to us, but maybe not so clear to 'end-users' What about 'network.standard-url.always-send-in-utf8'? (this patch uses that) Or, we can use 'always-use-iri', but that's not friendly to end-users either unless they know what IRI stands for. Let me know which one you like best along with r and I'll make the necessary change when landing. as for the usefulness, I'm not sure, but one area in which this can help would be cross-site references. For instance, Japanese web pages are split about 50:50 into EUC-JP and Shift_JIS. If the apache module (or other server-mechanism) to support IRI) gets widely deployed, turning this on would be useful. (don't ask me why they'd not switch over to UTF-8 :-)) The same can be the case of Russian web sites (atlhgouth transcoding proxy seems to be pretty common for Russian web servers) for which KOI8-R and ISO-8859-5(?) are commonly used (and Windows-1251?. I forgot which is which).
Attachment #135554 - Attachment is obsolete: true
Attached patch update (obsolete) — Splinter Review
attachment 135804 [details] [diff] [review] still has a few 'send utf8'. I replaced them all with 'send in utf8'.
Attachment #135804 - Attachment is obsolete: true
Comment on attachment 137706 [details] [diff] [review] update asking for r (assuming bz's sr still holds because nothing substantial has changed)
Attachment #137706 - Flags: review?(darin)
*** Bug 205471 has been marked as a duplicate of this bug. ***
Blocks: 185659
darin, can you review? Japanese users want this (as I expected in comment #20). See http://bugzilla.mozilla.gr.jp/show_bug.cgi?id=3162 (it's in Japanese.)
Comment on attachment 137706 [details] [diff] [review] update so my problem with "always-send-in-utf8" is that: (1) nsStandardURL doesn't "send" anything anywhere as i said, so this doesn't make any sense from that point-of-view, and (2) the URL is not going to be sent as UTF-8... it'll be sent as %-escaped UTF-8. this is a subtle distinction, but it still makes the text wrong. as for making this pref name user-friendly, i really don't think that that should be a goal of ours. users are going to find out about this pref by reading some FAQ or document. or maybe we'll have actual UI for this pref just like IE does (iirc). at the code level, i think we should choose technically accurate pref names. r=me if you change the pref name to be something technically accurate such as "network.standard-url.encode-utf8" note: that this patch will cause problems if a site provides URLs in a non-UTF8 encoding and partially %-escapes the URLs. nsStandardURL does not unescape %-escaped chars since it cannot determine the charset of the %-escaped byte sequences. so there is no way for it to ensure that the UTF-8 version of the URL is correct. it's up to sites to configure things correctly. inheriting the document charset seems best since it gives sites control over the URL charset. without a fallback strategy, this pref is not very useful in my opinion. it might help some users, but they will be constrained in the sites they can visit.
Attachment #137706 - Flags: review?(darin) → review-
(In reply to comment #25) > (From update of attachment 137706 [details] [diff] [review]) > so my problem with "always-send-in-utf8" is that: > > (1) nsStandardURL doesn't "send" anything anywhere as i said, so this doesn't > make any sense from that point-of-view, and The difference between you and me arises from that I think the subject of 'always send' is 'Mozilla/Firefox' and the object (which is omitted in the proposed pref. name) is a URL while you think that the subject is our implementation of 'nsStandardURL'. Given an ambiguous status of pref-entries (are they a part of UI or a convenient vehicle for hacking Mozilla's behavior that's not supposed to be touched by most users? Perhaps, they're more of the former than of the latter), I guess I just have to do what you want. Later when we decide to add UI (as MS IE has), we can use a user-friendly wording there.
jshin: the problem is that the pref name includes "standard-url" ... that implies that the pref is particular to nsStandardURL. maybe we should have pref names that are not so closely tied to class names. note: it just so happens that nsStandardURL is the only class that honors the originCharset parameter. however, we might one day want to make nsSimpleURI better handle UTF-8 and other charset business. in that case, some of these prefs would be better named in some more generic way. the existing "standard-url" prefs would probably be better off with a more generic name too. if you want to propose better names for these prefs, then i'm all for it. i just think that while we are mentioning "standard-url" in the pref name that it is appropriate to make the name very specific in function. does this sound reasonable?
darin, thanks for the explanation. After writing my comment, I also realized that 'standard-url' in the name makes the pref. closely tied to nsStandardURL, but I had to run. At the moment, I don't have any good idea. I'll file a new bug for the name change (so that we wouldn't forget). In the meantime, I'll go with what you suggested for this bug.
Attached patch patch (updated)Splinter Review
Per darin's suggestion, I changed the pref. name to 'encode-utf8' along with macro constant names in the code.
Assignee: nhottanscp → jshin
Attachment #137706 - Attachment is obsolete: true
Comment on attachment 144154 [details] [diff] [review] patch (updated) asking for r (the patch is identical to the previous one except for the name change)
Attachment #144154 - Flags: review?(darin)
Attachment #144154 - Flags: review?(darin) → review+
Comment on attachment 144154 [details] [diff] [review] patch (updated) asking for a1.7beta because bz's sr should stand valid (nothing signficant in terms of actual code has changed since his sr.) risk: almost none because it's just adding a pref. entry that is off by default. platforms : all affected users: those who turn on this option via about:config or editing prefs.js (mostly Japanese and Russian)
Attachment #144154 - Flags: approval1.7b?
Comment on attachment 144154 [details] [diff] [review] patch (updated) > prefBranch->AddObserver(NS_NET_PREF_ESCAPEUTF8, obs.get(), PR_FALSE); >+ pbi->AddObserver(NS_NET_PREF_ALWAYSENCODEINUTF8, obs.get(), PR_FALSE); > prefBranch->AddObserver(NS_NET_PREF_ENABLEIDN, obs.get(), PR_FALSE); s/pbi/prefBranch/
Attachment #144154 - Flags: approval1.7b? → approval1.7?
Comment on attachment 144154 [details] [diff] [review] patch (updated) a=asa (on behalf of drivers) for checkin to 1.7
Attachment #144154 - Flags: approval1.7? → approval1.7+
fix checked in
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
(In reply to comment #8) >BTW, in one case, Mozilla already sends URLs in UTF-8, that leads to 'file >not found' error. That happens when an inline image (<img src="non-ascii- >name.png">) is right-clicked on and 'view image' is selected. Mozilla sends >the URL in url-escaped UTF-8 and most servers respons with '50x file not >found'... Is there a bug filed on this?
JFYI, this option has become more useful with the proliferation of blogs which are more likely to be in UTF-8 than in legacy encodings (for RSS feed and other reasons). When a reference is made to a blog page with 'raw 8bit characters' in URL from a page in a legacy encoding, assuming the encoding of the refering page doesn't work so that turning on this options comes handy. A Korean blogger complained about this at www.mozilla.or.kr and I told him to turn on this option.
JFYI, how to enable this option (require Mozilla 1.7 - when will this be included in Firefox? When it will be enabled by default?): 1. Enter "about:config" in the address bar. 2. Enter "utf" in filter input box, find "network.standard-url.encode-utf8" item. Double click it and change the item value to "true", hit "OK" button. 3. Try it yourself!
(In reply to comment #35) > (In reply to comment #8) > >BTW, in one case, Mozilla already sends URLs in UTF-8, that leads to 'file > >not found' error. That happens when an inline image (<img src="non-ascii- > >name.png">) is right-clicked on and 'view image' is selected. Mozilla sends > >the URL in url-escaped UTF-8 and most servers respons with '50x file not > >found'... > Is there a bug filed on this? It's fixed (I guess I fixed it). > when will this be included in Firefox? When it will be enabled by default? It's fixed in firefox. When will it be enabled by default? I guess it'll be a few years before we can turn it on by default. IRI support by web servers is not yet wide-spread.
(In reply to comment #38) > I guess it'll be a > few years before we can turn it on by default. IRI support by web servers is not > yet wide-spread. I may have been wrong. MS IIS seems to support it rather well. So does Apache 2.x on Windows 2k/XP/2003. Apache on Linux/Unix lags behind. I hope Martin's module for IRI will be more widely used (see comment #15 and http://www.w3.org/2004/Talks/IUC25iri/ )
*** Bug 285967 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: