Closed Bug 261929 Opened 20 years ago Closed 18 years ago

Consider sending urls in UTF-8 by default (images/links with non-ASCII chacters not displayed)

Categories

(Core :: Networking, enhancement, P4)

enhancement

Tracking

()

RESOLVED FIXED
mozilla1.9alpha1

People

(Reporter: c00lyman, Assigned: emk)

References

(Depends on 1 open bug, )

Details

(Keywords: intl, jp-critical)

Attachments

(1 file, 6 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20040910
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20040910

There are german Umlaute in the image link addresses and it looks like mozilla
can't figure out the right request address for the images. If you try the same
page with IE you will see the images ...

Reproducible: Always
Steps to Reproduce:
1.load the page
2.
3.

Actual Results:  
broken link images - picture is not loadad

Expected Results:  
You should see a picture
Joerg: is "Always send URLs in UTF-8" enabled in your MSIE settings?
Unchecking the "Always send URLs as UTF-8" makes it not work in MSIE6.
Setting network.standard-url.encode-utf8 makes it work in Mozilla.  If this pref
defaults to true in MSIE, we should probably do the same.
Hey, great timing - you guys make a good job!
We have a dillemma here. It seems like MS IIS has a built-in support for
translating url-escaped UTF-8 in URL to whatever is used on the file system
(otherwise, MS IE wouldn't have set 'always send .. UTF-8' on by default, I
guess). Our trouble is that a significant portion of Apache servers don't have a
module (that does the same thing) on and are installed on Unix boxes where the
local file system character encoding is not UTF-8. (this has been changing with
recent Linux distributions using UTF-8 by default).  In addition, the module is
only available for apache 2.x which is lot less widely used than apache 1.x.

Perhaps, we have to turn it on by default. In the past, I saw many web sites
asking their visitors to turn off 'Always send URLs in UTF-8' in MSIE. These
days, I rarely see it. However, my sample may be skewed. What do others think
about this? 


 
Status: UNCONFIRMED → ASSIGNED
Component: Browser-General → Networking
Ever confirmed: true
Keywords: intl
Summary: You can't see the images on the page due to german ä in the image link You can't see the images on the page due to german ä in the image link → Consider sending urls in UTF-8 by default (images/links with non-ASCII chacters not displayed) Consider sending urls in UTF-8 by default (images/links with non-ASCII chacters not displayed)
I still prefer a failover-based solution (though it would be non-trivial to
implement), but maybe the proliferation of IE means that we can assume UTF-8 as
a defacto standard?
I am all for making UTF-8 ubiquitous, especially in URIs.
I also for encoding urls as utf-8 by default.

But we must not change the behavior with unescaped urls and form content on the page. They must 
always use the page charset.
The IRI spec (currently at
http://www.w3.org/International/iri-edit/draft-duerst-iri-11.txt)
has been approved as an IETF Proposed Standard by the IESG (see
http://www1.ietf.org/mail-archive/web/ietf-announce/current/msg00752.html)
and will soon be issued as an RFC.

In short, IRI spec defines how to convert non-ASCII characters in a
Web address to a fully standard URI, using UTF-8 and %HH-encoding.

Given that this is now approved, Mozilla code should move towards
implementing it without delay. In particular, this means that any
URIs that contain non-ASCII characters should be converted to
%HH-form for resolution using UTF-8. Using a legacy encoding is
(and always has been) a clearly totally non-standard way of
resolving Web addresses with non-ASCII characters.

So:
1) 'network.standard-url.encode-utf8' should be on by default
   (and eventually become unnecessary)
2) 'always use the page charset' for 'unescaped uris', as
   Nil Soffer asks for, is clearly against standards.
3) For form content generated on a page by the user
   (as opposed to the query part of an uri), the page
   encoding has to be used.

Opera and Safari already have a good implementation of IRIs.
IE does so, too, with the exception of IDNs. Mozilla has
all the right code, it just has to be activated and deployed.
A very simple test is available at
http://www.w3.org/2001/08/iri-test/resumeHtmlImgSrcBase.html.

I suggest changing 'OS' to 'All', because this is a general issue,
not a Win XP issue.
Martin,

Yes, we are aware of the IRI specification, but has anyone proposed a migration
strategy that web browsers might implement?  Clearly, much of the web does not
conform to the IRI specification, so it would be a mistake for Mozilla to switch
to IRI encoding by default.  Perhaps servers should indicate IRI expectation
somehow?
OS: Windows XP → All
Hardware: PC → All
Hmm... on second thought, the fact that IE encodes URLs as UTF-8 by default (at
least it appears to be true on my system) means that we should be in the clear
to do the same.

I would like to study IE's behavior more before I agree that we can make this
change, but for now I'll put it on my TODO list for Moz 1.8 / FFox 1.1.
Assignee: general → darin
Status: ASSIGNED → NEW
Target Milestone: --- → mozilla1.8beta
Target Milestone: mozilla1.8beta → ---
Target Milestone: --- → mozilla1.8beta
Attached patch v1 patch (obsolete) — Splinter Review
This patch flips the default value of the pref.  This patch means that the URI
object will discard the information it was given about the encoding of the page
from which it was extracted.  We might want to change that, and instead make
the pref only affect the encoding and not affect the value of mOriginCharset. 
Or, perhaps that's unnecessary.
Attachment #169156 - Flags: review?(smontagu)
Darin Fisher asks about a migration strategy.
If you really want that, then the best idea is
probably to do the following:

1) Request a resource using %-encoding based on UTF-8
2) If you get a 404 (not found), do another request
   based on the legacy encoding of the page.

For that, it would indeed make sense to keep the
page encoding in the URI object. There should
probably be an option to switch part 2 off, because
hopefully in the future it won't be needed anymore.

This approach has been proposed ages ago, but I'm
not sure it has ever been implemented. But it should
not be too difficult.

It has the following advantages:
- The official, standard, thing is done first,
  bugwards compatibility stuff follows only later.
- Servers serving things with legacy encodings
  'get the message', i.e. they'll see all the
  UTF-8 requests in their logs.

It has the following issues:
- double requests for broken links
- Exceptional wrong positive (if the UTF-8 form
  of a path coincides with an existing path in
  legacy encoding; for actual preexisting pages
  (as opposed to query-like things that get
  constructed), the chance of this happening
  is very low.

The 404 is an efficient way (in particular in terms of
implementation!) for the server to indicate that it's
not doing UTF-8 (at least not on the resource requested).

For the server to use some additional thing (e.g. header)
to indicate that it's actually doing the standard thing
(i.e. IRIs) as such doesn't seem like a good idea.

If you're going to upgrade the server anyway, the best
way to do it is probably to use mod_fileiri
(see http://www.w3.org/2003/06/mod_fileiri/), which lets
you use a feature already in the HTTP protocol
(permanent redirect) to tell the client that it's using
UTF-8 now.
(In reply to comment #11)
> Hmm... on second thought, the fact that IE encodes URLs as UTF-8 by default (at
> least it appears to be true on my system) means that we should be in the clear
> to do the same.

Darin, see comment #5. IIS apparently has that. However, the majority of Apache
servers in the wild don't have Martin's IRI extension module installed. His
module is only available for Apache 2.x, but the share of Apache 1.x is far
larger than that of Apache 2.x. However, due to IE's behavior, it seems that a
lot of apache 1.x admins have added some ad-hock solutions because nowadays I
rarely see a note that asks users to turn off 'always send urls in UTF-8' in MS
IE (again comment #5).

As for Martin's migration strategy, that has been considered before, but somehow
it wasn't implemented (for perf. reason?).  

> object will discard the information it was given about the encoding of the page

This should be kept, IMO. 


Re. my fileiri module for Apache 1.3, I don't think it would be
difficult to port it; I just haven't done it, nor have I heard
from anybody else doing it. If you need it, please drop me a line.

Re. Jungshik Shin's comment "nowadays I rarely see a note that
asks users to turn off 'always send urls in UTF-8' in MS IE",
if there is any place where non-UTF-8-based URIs are (or where)
in wide use, it's Korea, and Jungshik is extremely familliar
with the situation there.
Attached patch v2 patch (obsolete) — Splinter Review
OK, this version preserves the value of nsIURI::originCharset, and makes the
implementation simply ignore it when the encode-as-utf8 preference is set.
Attachment #169156 - Attachment is obsolete: true
Attachment #169156 - Flags: review?(smontagu)
Attachment #169230 - Flags: review?(jshin)
I've changed access to http://www.w3.org/2003/06/mod_fileiri/.
It should be visible for everybody now.
Sorry, I thought this was already visible.
Comment on attachment 169230 [details] [diff] [review]
v2 patch


This looks fine as long as references to remote files are concerned. I just
came across another complication. See bug 263570 comment #17. I believe that
the problem described in bug 263570 comment #17 only happens when this option
is set true. We may just go ahead here and then have to come up with a way to
special-case(?) local file access.
Attachment #169230 - Flags: review?(jshin) → review+
Comment on attachment 169230 [details] [diff] [review]
v2 patch

If we want to display IRIs unescaped in the address bar, what do we need to do?
I'm not clear whether the pref "network.standard-url.escape-utf8" affects only
the display or also the form in which the URI/IRI is sent to the server.
(In reply to comment #20)

> If we want to display IRIs unescaped in the address bar, what do we need to do?
> I'm not clear whether the pref "network.standard-url.escape-utf8" affects only
> the display or also the form in which the URI/IRI is sent to the server.

It's only for  what's sent to the server. As for the display in the addressbar,
I guess we have to use |nsITextToSubURI| 


yeah, what jshin said.
Attachment #169230 - Flags: superreview?(bzbarsky)
Comment on attachment 169230 [details] [diff] [review]
v2 patch

sr=bzbarsky
Attachment #169230 - Flags: superreview?(bzbarsky) → superreview+
> We may just go ahead here and then have to come up with a way to
> special-case(?) local file access. 

Hmm... yeah, local file access is going to be an interesting issue.  We could
ignore this preference when nsStandardURL::mSupportsFileURL is set the true. 
That may be best.  Thoughts?  (bz: sorry for the possibly premature sr request)
Hmm... yeah, if this breaks us on Windows then we may want to condition on the
file url flag.

But really, it would break us any time the HTML charset doesn't match the native
charset, no?  So it's just a problem no matter what...
if we're redesigning this, i have a consumer who wants the ability to send
unencoded urls :).
> But really, it would break us any time the HTML charset doesn't match the native
> charset, no?  So it's just a problem no matter what...

boris, well... if everything a user does is Shift-JIS (documents on their
harddrive, documents downloaded from websites, etc.) then it would be a problem
to switch suddenly to UTF-8 for local files.  but you make a good point when
considering that the average user probably does encounter documents in a variety
of character encodings.
> if we're redesigning this, i have a consumer who wants the ability to send
> unencoded urls :).

timeless: we're not redesigning this per se.  we're just talking about when to
enable an existing preference (and some details related to it).  you might want
to file a new bug for that feature.
(In reply to comment #25)

> But really, it would break us any time the HTML charset doesn't match the native
> charset, no?  So it's just a problem no matter what...

How about changing |net_GetFileFromURLSpec| so that it works with url-escaped
UTF-8's and call |InitWithPath| with a UTF-16 path instead of
|InitWithNativePath|? If its ripple effect is so huge that we need to keep the
backward compatibility (i.e. have to make so many changes throughout the code),
we may consider checking the UTF8ness of an escaped URLspec and branching
depending on the result.

  http://lxr.mozilla.org/seamonkey/source/netwerk/base/src/nsURLHelperWin.cpp#99

Making net_GetFileFromURLSpec handle UTF-8 encoded file:// URLs is probably a
good idea, and it will probably be necessary to support UTF-8 (and native
encoding as a fallback) when we start supporting Unicode file paths under windows.
OK, I went ahead and checked this patch in.  I want to make sure we get as much
feedback as possible on this change.  We need to fix the way we handle non-ASCII
file paths anyways, so that need not block this patch.

Marking FIXED.
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Thanks. I've just come across a page (actually, a Korean firefox 1.0 user
reported that he couldn't see images in the page)  in EUC-KR (one at one of
major portals in Korea) where I had to turn on this because the server only
understands IRI. 
The page address  is
http://www.cdpkorea.com/zboard4/zboard.php?id=pdsboard&page=1&page_num=20&select_arrange=headnum&desc=&sn=off&ss=on&sc=on&keyword=&no=43865&category=
With this option on, four photos of a CD player will show up. Otherwise, they're
replaced with an image for 'not found files'. 
Blocks: iri
*** Bug 283051 has been marked as a duplicate of this bug. ***
This change was reverted because it UTF-8'ed some things that IE and Opera don't
UTF-8, breaking some web sites.  See bug 284474.
reopening this bug since the fix was basically reverted.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Severity: normal → enhancement
Target Milestone: mozilla1.8beta1 → mozilla1.9alpha
Two comments I'm quoting were taken from bug 284474, but I'm adding my comment here because this bug is where we need to discuss what's gotta be done.

(In reply to comment #33)
> (In reply to comment #30)

> > My understanding is that for the mid-range future (after Firefox 1.5) it
> > would be very good to move to what's called the IE/Opera behavior, i.e.
> > using UTF-8 for the path/file part and the page encoding for the query
> > part. While this solution is not optimal, it would definitely be a big
> > step forward.
> 
> However, that behavior is in violation of RFC 3987 you wrote, or did I misread
> it? 

I haven't yet confirmed it myself (I have yet to find a Windows box where I can safely test IE 7 beta preview2??. It doesn't work on my Win 2k box), but it seems like IE 7 beta preview2 behaves exactly the same way as firefox with the backed-out patch in place (judging from a few test reports I got in the Korean mozilla community). I don't know whether that's a just oversight on the part of IE7 team or done on purpose to get IE7 in compliance with the letters of RFC 3987 at the risk of breaking some web sites. The former is more likely than the latter, but who knows....


Priority: -- → P4
Attached patch patch v3 (obsolete) — Splinter Review
This will make Mozilla compatible with IE and Opera.
When network.standard-url.encode-utf8 is true, URLs except query part will be encoded in UTF-8 (similar to IE6 "Always encode in UTF-8").
I added a new pref "network.standard-url.encode-query-utf8". The query part will be encoded in UTF-8 only if it's true (similar to "Send UTF-8 query strings" introduced in IE7).
This patch also contains a fix for bug 284474.
Assignee: darin → VYV03354
Status: REOPENED → ASSIGNED
Attachment #215656 - Flags: superreview?(darin)
Attachment #215656 - Flags: review?(jshin1987)
If we are inputting non-ASCII URI to location bar directly, the sending request is not compatible at IE6(default settings).

+ jp-critical
Keywords: jp-critical
Comment on attachment 215656 [details] [diff] [review]
patch v3

>Index: netwerk/base/src/nsStandardURL.cpp

>+        nsCAutoString loweredSpec(spec);
>+        ToLowerCase(loweredSpec);
>+        GET_SEGMENT_ENCODER_FROM_SPEC(encoder, loweredSpec.get());

I don't understand why it is necessary to lowercase the entire URL spec.
Attached patch patch v3.1 (obsolete) — Splinter Review
using PL_strncasecmp instead of stricmp + ToLowerCase.
Attachment #169230 - Attachment is obsolete: true
Attachment #215656 - Attachment is obsolete: true
Attachment #215810 - Flags: superreview?(darin)
Attachment #215810 - Flags: review?(jshin1987)
Attachment #215656 - Flags: superreview?(darin)
Attachment #215656 - Flags: review?(jshin1987)
Attached patch patch v3.2 (obsolete) — Splinter Review
Removed file protocol hack since bug 278161 was fixed.
Attachment #215810 - Attachment is obsolete: true
Attachment #216628 - Flags: review?(jshin1987)
Attachment #215810 - Flags: superreview?(darin)
Attachment #215810 - Flags: review?(jshin1987)
Comment on attachment 216628 [details] [diff] [review]
patch v3.2

Looks good, but let biesi take a look.

>Index: netwerk/base/src/nsStandardURL.cpp
>
> #define NS_NET_PREF_ALWAYSENCODEINUTF8 "network.standard-url.encode-utf8"
>+#define NS_NET_PREF_ENCODEQUERYINUTF8 "network.standard-url.encode-query-utf8"

  Can you align the opening quote of a new line with that of the line above it? Even with that, it's only 79 chars. long.




>+#define GET_QUERY_ENCODER(name) \
>+    GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8 && \
>+         gEncodeQueryInUTF8)

Perhaps, it's better this way:

    GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8 && \
                                 gEncodeQueryInUTF8)
Attachment #216628 - Flags: review?(jshin1987) → review?(cbiesinger)
Attached patch resolved points of comment #42 (obsolete) — Splinter Review
Attachment #216628 - Attachment is obsolete: true
Attachment #216631 - Flags: review?(cbiesinger)
Attachment #216628 - Flags: review?(cbiesinger)
Comment on attachment 216631 [details] [diff] [review]
resolved points of comment #42

unfortunate that we have to do this, oh well :(

what about mozilla code that unescapes URIs? like nsTextToSubURI.cpp? doesn't that need to be changed to deal with URIs that are partially in UTF-8 and partially in the origin charset?
Attachment #216631 - Flags: review?(cbiesinger) → review+
Comment on attachment 216631 [details] [diff] [review]
resolved points of comment #42

> what about mozilla code that unescapes URIs? like nsTextToSubURI.cpp? doesn't
> that need to be changed to deal with URIs that are partially in UTF-8 and
> partially in the origin charset?
That's bug 320807.
Attachment #216631 - Flags: superreview?(darin)
Comment on attachment 216631 [details] [diff] [review]
resolved points of comment #42

>Index: netwerk/base/src/nsStandardURL.cpp

> #define GET_SEGMENT_ENCODER(name) \
>+    GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8)
>+
>+#define GET_QUERY_ENCODER(name) \
>+    GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8 && \
>+                                 gEncodeQueryInUTF8)
>+
>+#define GET_SEGMENT_ENCODER_INTERNAL(name, useUTF8) \
>+    nsSegmentEncoder name(useUTF8 ? nsnull : mOriginCharset.get())

It's usually nice to define a macro above where it is first used.


sr=darin
Attachment #216631 - Flags: superreview?(darin) → superreview+
Thank you.
Carring over r+sr.
Attachment #216631 - Attachment is obsolete: true
Attachment #218073 - Flags: superreview+
Attachment #218073 - Flags: review+
checked-in.

Don't we need this for 1.8.1 if there are no regressions?
Status: ASSIGNED → RESOLVED
Closed: 20 years ago18 years ago
Resolution: --- → FIXED
I really want to land this on branch, but we will have to wait for bug 278161.
Depends on: 278161
I've filed bug 333859 to followup the query part issue.
Don't we need the fix for bug 320807 anywhere where we land this?
Depends on: 320807
(In reply to comment #51)
> Don't we need the fix for bug 320807 anywhere where we land this?
At least, we need to handle the "mixed-charset" url (comment #45).
*** Bug 340376 has been marked as a duplicate of this bug. ***
(In reply to comment #49)
> I really want to land this on branch, but we will have to wait for bug 278161.
> 

Bug 278161 did land on the branch on May 1st. Can we check this in too ? (there's no blocking-flag yet)
Comment on attachment 218073 [details] [diff] [review]
resolved darin's comment

Bug 278161 is fixed now.
And I didn't hear any regressions.
Attachment #218073 - Flags: approval-branch-1.8.1?(cbiesinger)
(In reply to comment #55)
> (From update of attachment 218073 [details] [diff] [review] [edit])
> Bug 278161 is fixed now.
> And I didn't hear any regressions.

What about comment 51?

Bug 320807 will not be suitable for landing on the branch because it's large and contains interface change.
Probably we will need a subset patch to handle only mixed-charset case.
Comment on attachment 218073 [details] [diff] [review]
resolved darin's comment

Clearing request until bug 320807 is solved
Attachment #218073 - Flags: approval-branch-1.8.1?(cbiesinger)
*** Bug 349671 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.