Closed Bug 129726 Opened 22 years ago Closed 20 years ago

option to send URL as UTF-8

Categories

(Core :: Networking, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: nhottanscp, Assigned: jshin1987)

References

Details

(Keywords: intl)

Attachments

(1 file, 4 obsolete files)

This is separated from bug 127282.
Windows has an option "Always send URL as UTF-8" in "Internet Options" in
control panel. Mozilla does not listen to that pref. 

Mozilla can have a similar option for cross platform.
Before implementing this, we need to identify cases which we really need to send
URL as UTF-8 instead of the document charset.
Keywords: intl
reassign to nhotta, cc to darin
Assignee: new-network-bugs → nhotta
Blocks: 157673
so this with this pref enabled, you'd ignore the origincharset hint?  that seems
troublesome.  many websites will break.  seems like it'd be better to leave this
up to the websites completely.  that is, let them provide their documents using
UTF-8.. that way we'd have some idea that the webserver expects UTF-8.  or am i
missing something... what's the real benefit of this pref?
Two purposes.

* Provide similar behavior as IE's "Send as UTF-8" option. 
One example I saw it usefull is that you can open non ASCII files of any
(Windows) machines in a local network not depending on each machines file
system's charsets.

* Once UTF-8 URI become standard then the user can switch to the strict mode.
It is also possible to supply a fallback as discussed in bug 150376.
maybe i'm just being pessimistic, but i really think that a fallback mechanism
isn't going to work very well (i.e., it'll most likely be inefficient).  we
really want to give the server what it needs the first time instead of assuming
it can deal with UTF-8 and then falling back on the document charset.  for years
now, servers are accustomed to receiving URLs in the document charset... what
will make that suddenly become a minority implementation?  most people aren't
going to rush to adopt UTF-8 as their default because the default of using the
document charset just works in most cases.  sad, but doesn't it seem most likely?
I agree that the fallback would not work well for the servers.
In fact, the fallback mentioned in bug 150376 is about unescaping URL (try UTF-8
conversion then a document charset). I shouldn't have mentioned that, sorry
about the confusion.


>  most people aren't
> going to rush to adopt UTF-8 as their default because the default of using the
> document charset just works in most cases.
I agree. That is why it is proposed as a backend only pref default to be disabled.
Status: NEW → ASSIGNED
i18n triage team: nsbeta1-
Severity: normal → enhancement
Keywords: nsbeta1-
I was sent here from bug 205471 :-)

I agree with Darin and Naoki about the current situation of web servers' 
handling of UTF-8 URLs. 

Just FYI, there's an apache module that converts incoming URLs in UTF-8 to the 
file system charset of a server. If every server has something like this, it'd 
be much easier...Note that the file system charset of the server(say EUC-JP) 
can be different from the charset of the refering document(Shift_JIS). In an 
ideal world where every server understands URLs in UTF-8, this kind of cross-
site referencing can benefit from this feature. Unfortunately, we don't live 
in an ideal world...

BTW, in one case, Mozilla already sends URLs in UTF-8, that leads to 'file not 
found' error. That happens when an inline image (<img src="non-ascii-
name.png">) is right-clicked on and 'view image' is selected. Mozilla sends 
the URL in url-escaped UTF-8 and most servers respons with '50x file not 
found'... 
jshin: do you have a link to such a site?  thx!
I thought I filed a bug for that, but apparently I didn't. 
Anyway, you can try that at http://jshin.net/moztest/download.html.
In the page (in EUC-KR), right-click on the second blue ball (next to
'KO : EUC-KR : raw 8bit') and select 'view image'. Mozilla sends the URL
in url-escaped UTF-8. On the other hand, if you do the same on the third
blueball (<img src="url-escaped image name in EUC-KR.png">), Mozilla
sends the URL in url-escaped EUC-KR. 
*** Bug 206525 has been marked as a duplicate of this bug. ***
When will this patch be included in standard release? How can I help?
In bug 224280, I found that a recent version of PHP (PHP 5. I'm not sure of PHP
4.3.x) has no problem with POST data sent in UTF-8. I guess it also work well
with GET data (embedded in URL) in url-escaped UTF-8. The PHP change is probably
because in MS IE, 'always send URL in UTF-8' option is on by default. Actually,
in that bug, I found that PHP doesn't work with POST data sent in UTF-16LE/BE
(even though I modified Mozilla to specify 'charset=UTF-16LE' in Content-Type
header of POST data). 

Anyway, considering PHP's popularity, there are gonna be a considerable number
of server side applications that understand UTF-8 URLs. Besides, Martin Duerst
(of W3) has been working on an Apache module (there were a couple of similar
modules, but not included in Apache distribution) that translates file paths in
url-escaped UTF-8 to the local filesystem encoding. 

With all these developments, it's time to move this forward. Naoki, if you're
too busy with your day job to deal with this, let me know and I'll take care of it.

  
Attached patch update (obsolete) — Splinter Review
updated to the trunk. include diff for firebird, calendar, thunderbird and
sunbird(standalone composer).
Attachment #91553 - Attachment is obsolete: true
Comment on attachment 135554 [details] [diff] [review]
update 

asking for r/sr.
it's off by default so that it wouldn't hurt anything. 

BTW, Martin's apache module for IRI is at 

http://www.w3.org/2003/Talks/0904-IUC-IRI/

http://dev.w3.org/cvsweb/apache-modules/mod_fileiri/mod_fileiri.c
Attachment #135554 - Flags: superreview?(bz-vacation)
Attachment #135554 - Flags: review?(darin)
+                printf("escape UTF-8 %s\n", gAlwaysSendUTF8 ? "enabled" :
"disabled");

make that #ifdef DEBUG or remove it...

I really think we should have a necko.js or gre.js that contains all the
necko/gecko default preferences (possibly overridden by app specific files)
Comment on attachment 135554 [details] [diff] [review]
update 

>Index: netwerk/base/src/nsStandardURL.cpp

>+                printf("escape UTF-8 %s\n", gAlwaysSendUTF8 ? "enabled" : "disabled");

That should be a LOG(), right?

sr=bzbarsky with that.
Attachment #135554 - Flags: superreview?(bz-vacation) → superreview+
Thanks for sr. Yea, I should've replaced printf with LOG(). I'll do when
checking in.

I agree that we need a common file shared by all applications so that we don't
have to edit several different files. This is the worst case so far (5 of them).
Comment on attachment 135554 [details] [diff] [review]
update 

>Index: netwerk/base/src/nsStandardURL.cpp

> #define NS_NET_PREF_ESCAPEUTF8 "network.standard-url.escape-utf8"
> #define NS_NET_PREF_ENABLEIDN  "network.enableIDN"
>+#define NS_NET_PREF_ALWAYSSENDUTF8 "network.standard-url.always-send-utf8"

do me a favor and line up the "network...." values so they appear to
be listed in a column.


>+                printf("escape UTF-8 %s\n", gAlwaysSendUTF8 ? "enabled" : "disabled");

no raw printfs in production code!  use LOG() macro instead.


>Index: modules/libpref/src/init/all.js

>+// This preference controls whether or not URLs are always encoded and send as
>+// UTF-8.
>+pref("network.standard-url.always-send-utf8", false);

s/send/sent/

i think this pref name could be better.  always-send-utf8 makes it
sound like the URL is sending something somewhere.  URLs don't send
anything.  maybe this should be called:

  pref("network.standard-url.assume-utf8-charset", false); 

that seems clearer to me.

i may be wrong, but i don't think this preference is going to be that
useful.  i think we should instead work on a fallback mechanism to 
start with either UTF-8 and fallback to the origin charset or start
with the origin charset and fallback to UTF-8.	but, maybe this pref
can help with a few websites.  i don't know.
Attachment #135554 - Flags: review?(darin) → review-
Attached patch update (obsolete) — Splinter Review
It seems like I should've done a bit more than updating the old patch.
assume-utf8-charset is clearer to us, but maybe not so clear to 'end-users' 
What about 'network.standard-url.always-send-in-utf8'? (this patch uses that)
Or, we can use 'always-use-iri', but that's not friendly to end-users either
unless they know what IRI stands for. Let me know which one you like best along
with r and I'll make the necessary change when landing.

as for the usefulness, I'm not sure, but one area in which this can help would
be cross-site references. For instance, Japanese web pages are split about
50:50 into EUC-JP and Shift_JIS. If the apache module (or other
server-mechanism) to support IRI) gets widely deployed, turning this on would
be useful. (don't ask me why they'd not switch over to UTF-8 :-))  The same can
be the case of Russian web sites (atlhgouth transcoding proxy seems to be
pretty common for Russian web servers) for which KOI8-R and ISO-8859-5(?) are
commonly used (and Windows-1251?. I forgot which is which).
Attachment #135554 - Attachment is obsolete: true
Attached patch update (obsolete) — Splinter Review
attachment 135804 [details] [diff] [review] still has a few 'send utf8'. I replaced them all with 'send
in utf8'.
Attachment #135804 - Attachment is obsolete: true
Comment on attachment 137706 [details] [diff] [review]
update

asking for r (assuming bz's sr still holds because nothing substantial has
changed)
Attachment #137706 - Flags: review?(darin)
*** Bug 205471 has been marked as a duplicate of this bug. ***
Blocks: 185659
darin, can you review? Japanese users want this (as I expected in comment #20).
See http://bugzilla.mozilla.gr.jp/show_bug.cgi?id=3162 (it's in Japanese.) 
Comment on attachment 137706 [details] [diff] [review]
update

so my problem with "always-send-in-utf8" is that:

(1) nsStandardURL doesn't "send" anything anywhere as i said, so this doesn't
make any sense from that point-of-view, and

(2) the URL is not going to be sent as UTF-8... it'll be sent as %-escaped
UTF-8.	this is a subtle distinction, but it still makes the text wrong.

as for making this pref name user-friendly, i really don't think that that
should be a goal of ours.  users are going to find out about this pref by
reading some FAQ or document.  or maybe we'll have actual UI for this pref just
like IE does (iirc).

at the code level, i think we should choose technically accurate pref names.

r=me if you change the pref name to be something technically accurate such as
"network.standard-url.encode-utf8"

note: that this patch will cause problems if a site provides URLs in a non-UTF8
encoding and partially %-escapes the URLs.  nsStandardURL does not unescape
%-escaped chars since it cannot determine the charset of the %-escaped byte
sequences.  so there is no way for it to ensure that the UTF-8 version of the
URL is correct.  it's up to sites to configure things correctly.  inheriting
the document charset seems best since it gives sites control over the URL
charset.

without a fallback strategy, this pref is not very useful in my opinion.  it
might help some users, but they will be constrained in the sites they can
visit.
Attachment #137706 - Flags: review?(darin) → review-
(In reply to comment #25)
> (From update of attachment 137706 [details] [diff] [review])
> so my problem with "always-send-in-utf8" is that:
> 
> (1) nsStandardURL doesn't "send" anything anywhere as i said, so this doesn't
> make any sense from that point-of-view, and
  

  The difference between you and me arises from that I think the subject of
'always send' is 'Mozilla/Firefox' and the object (which is omitted in the
proposed pref. name) is a URL while you think that the subject is our
implementation of 'nsStandardURL'. Given an ambiguous status of pref-entries
(are they a part of UI or a convenient vehicle for hacking Mozilla's behavior
that's not supposed to be touched by most users? Perhaps, they're more of the
former than of the latter), I guess I just have to do what you want. Later when
we decide to add UI (as MS IE has), we can use a user-friendly wording there. 

jshin:

the problem is that the pref name includes "standard-url" ... that implies that
the pref is particular to nsStandardURL.  maybe we should have pref names that
are not so closely tied to class names.

note: it just so happens that nsStandardURL is the only class that honors the
originCharset parameter.  however, we might one day want to make nsSimpleURI
better handle UTF-8 and other charset business.  in that case, some of these
prefs would be better named in some more generic way.  the existing
"standard-url" prefs would probably be better off with a more generic name too.

if you want to propose better names for these prefs, then i'm all for it.  i
just think that while we are mentioning "standard-url" in the pref name that it
is appropriate to make the name very specific in function.

does this sound reasonable?
darin, thanks for the explanation. After writing my comment, I also realized
that 'standard-url' in the name makes the pref. closely tied to nsStandardURL,
but I had to run. At the moment, I don't have any good idea.  I'll file a new
bug for the name change (so that we wouldn't forget). In the meantime, I'll go
with what you suggested for this bug.
Attached patch patch (updated)Splinter Review
Per darin's suggestion, I changed the pref. name to 'encode-utf8' along with
macro constant names in the code.
Assignee: nhottanscp → jshin
Attachment #137706 - Attachment is obsolete: true
Comment on attachment 144154 [details] [diff] [review]
patch (updated)

asking for r (the patch is identical to the previous one except for the name
change)
Attachment #144154 - Flags: review?(darin)
Attachment #144154 - Flags: review?(darin) → review+
Comment on attachment 144154 [details] [diff] [review]
patch (updated)

asking for a1.7beta because bz's sr should stand valid (nothing signficant in
terms of actual code has changed since his sr.)

risk: almost none because it's just adding a pref. entry that is off by
default.
platforms : all
affected users: those who turn on this option via about:config or editing
prefs.js (mostly Japanese and Russian)
Attachment #144154 - Flags: approval1.7b?
Comment on attachment 144154 [details] [diff] [review]
patch (updated)

>         prefBranch->AddObserver(NS_NET_PREF_ESCAPEUTF8, obs.get(), PR_FALSE); 
>+        pbi->AddObserver(NS_NET_PREF_ALWAYSENCODEINUTF8, obs.get(), PR_FALSE);
>         prefBranch->AddObserver(NS_NET_PREF_ENABLEIDN, obs.get(), PR_FALSE); 

  s/pbi/prefBranch/
Attachment #144154 - Flags: approval1.7b? → approval1.7?
Comment on attachment 144154 [details] [diff] [review]
patch (updated)

a=asa (on behalf of drivers) for checkin to 1.7
Attachment #144154 - Flags: approval1.7? → approval1.7+
fix checked in
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
(In reply to comment #8)
>BTW, in one case, Mozilla already sends URLs in UTF-8, that leads to 'file 
>not found' error. That happens when an inline image (<img src="non-ascii-
>name.png">) is right-clicked on and 'view image' is selected. Mozilla sends
>the URL in url-escaped UTF-8 and most servers respons with '50x file not
>found'...
Is there a bug filed on this?
JFYI, this option has become more useful with the proliferation of blogs which
are more likely to be in UTF-8 than in legacy encodings (for RSS feed and other
reasons). When a reference is made to a blog page with 'raw 8bit characters' in
URL from a page in a legacy encoding, assuming the encoding of the refering page
doesn't work so that turning on this options comes handy. A Korean blogger
complained about this at www.mozilla.or.kr and I told him to turn on this option. 
JFYI, how to enable this option (require Mozilla 1.7 - when will this be
included in Firefox? When it will be enabled by default?):

1. Enter "about:config" in the address bar.
2. Enter "utf" in filter input box, find "network.standard-url.encode-utf8"
item. Double click it and change the item value to "true", hit "OK" button.
3. Try it yourself!
(In reply to comment #35)
> (In reply to comment #8)
> >BTW, in one case, Mozilla already sends URLs in UTF-8, that leads to 'file 
> >not found' error. That happens when an inline image (<img src="non-ascii-
> >name.png">) is right-clicked on and 'view image' is selected. Mozilla sends
> >the URL in url-escaped UTF-8 and most servers respons with '50x file not
> >found'...
> Is there a bug filed on this?

  It's fixed (I guess I fixed it).

> when will this be included in Firefox? When it will be enabled by default?

It's fixed in firefox. When will it be enabled by default? I guess it'll be a
few years before we can turn it on by default. IRI support by web servers is not
yet wide-spread.

(In reply to comment #38)
> I guess it'll be a
> few years before we can turn it on by default. IRI support by web servers is not
> yet wide-spread.

I may have been wrong. MS IIS seems to support it rather well. So does Apache
2.x on Windows 2k/XP/2003. Apache on Linux/Unix lags behind. I hope Martin's
module for IRI will be more widely used (see comment #15 and
http://www.w3.org/2004/Talks/IUC25iri/ )
*** Bug 285967 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.