Consider sending urls in UTF-8 by default (images/links with non-ASCII chacters not displayed)

RESOLVED FIXED in mozilla1.9alpha1

Status

()

Core
Networking
P4
enhancement
RESOLVED FIXED
13 years ago
5 years ago

People

(Reporter: Joerg Bartelheimer, Assigned: emk)

Tracking

(Depends on: 1 bug, Blocks: 1 bug, {intl, jp-critical})

Trunk
mozilla1.9alpha1
intl, jp-critical
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(URL)

Attachments

(1 attachment, 6 obsolete attachments)

(Reporter)

Description

13 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20040910
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20040910

There are german Umlaute in the image link addresses and it looks like mozilla
can't figure out the right request address for the images. If you try the same
page with IE you will see the images ...

Reproducible: Always
Steps to Reproduce:
1.load the page
2.
3.

Actual Results:  
broken link images - picture is not loadad

Expected Results:  
You should see a picture
Joerg: is "Always send URLs in UTF-8" enabled in your MSIE settings?
Unchecking the "Always send URLs as UTF-8" makes it not work in MSIE6.
Setting network.standard-url.encode-utf8 makes it work in Mozilla.  If this pref
defaults to true in MSIE, we should probably do the same.
(Reporter)

Comment 4

13 years ago
Hey, great timing - you guys make a good job!

Comment 5

13 years ago
We have a dillemma here. It seems like MS IIS has a built-in support for
translating url-escaped UTF-8 in URL to whatever is used on the file system
(otherwise, MS IE wouldn't have set 'always send .. UTF-8' on by default, I
guess). Our trouble is that a significant portion of Apache servers don't have a
module (that does the same thing) on and are installed on Unix boxes where the
local file system character encoding is not UTF-8. (this has been changing with
recent Linux distributions using UTF-8 by default).  In addition, the module is
only available for apache 2.x which is lot less widely used than apache 1.x.

Perhaps, we have to turn it on by default. In the past, I saw many web sites
asking their visitors to turn off 'Always send URLs in UTF-8' in MSIE. These
days, I rarely see it. However, my sample may be skewed. What do others think
about this? 


 
Status: UNCONFIRMED → ASSIGNED
Component: Browser-General → Networking
Ever confirmed: true
Keywords: intl
Summary: You can't see the images on the page due to german ä in the image link You can't see the images on the page due to german ä in the image link → Consider sending urls in UTF-8 by default (images/links with non-ASCII chacters not displayed) Consider sending urls in UTF-8 by default (images/links with non-ASCII chacters not displayed)

Comment 6

13 years ago
I still prefer a failover-based solution (though it would be non-trivial to
implement), but maybe the proliferation of IE means that we can assume UTF-8 as
a defacto standard?
I am all for making UTF-8 ubiquitous, especially in URIs.

Comment 8

13 years ago
I also for encoding urls as utf-8 by default.

But we must not change the behavior with unescaped urls and form content on the page. They must 
always use the page charset.

Comment 9

13 years ago
The IRI spec (currently at
http://www.w3.org/International/iri-edit/draft-duerst-iri-11.txt)
has been approved as an IETF Proposed Standard by the IESG (see
http://www1.ietf.org/mail-archive/web/ietf-announce/current/msg00752.html)
and will soon be issued as an RFC.

In short, IRI spec defines how to convert non-ASCII characters in a
Web address to a fully standard URI, using UTF-8 and %HH-encoding.

Given that this is now approved, Mozilla code should move towards
implementing it without delay. In particular, this means that any
URIs that contain non-ASCII characters should be converted to
%HH-form for resolution using UTF-8. Using a legacy encoding is
(and always has been) a clearly totally non-standard way of
resolving Web addresses with non-ASCII characters.

So:
1) 'network.standard-url.encode-utf8' should be on by default
   (and eventually become unnecessary)
2) 'always use the page charset' for 'unescaped uris', as
   Nil Soffer asks for, is clearly against standards.
3) For form content generated on a page by the user
   (as opposed to the query part of an uri), the page
   encoding has to be used.

Opera and Safari already have a good implementation of IRIs.
IE does so, too, with the exception of IDNs. Mozilla has
all the right code, it just has to be activated and deployed.
A very simple test is available at
http://www.w3.org/2001/08/iri-test/resumeHtmlImgSrcBase.html.

I suggest changing 'OS' to 'All', because this is a general issue,
not a Win XP issue.

Comment 10

13 years ago
Martin,

Yes, we are aware of the IRI specification, but has anyone proposed a migration
strategy that web browsers might implement?  Clearly, much of the web does not
conform to the IRI specification, so it would be a mistake for Mozilla to switch
to IRI encoding by default.  Perhaps servers should indicate IRI expectation
somehow?
OS: Windows XP → All
Hardware: PC → All

Comment 11

13 years ago
Hmm... on second thought, the fact that IE encodes URLs as UTF-8 by default (at
least it appears to be true on my system) means that we should be in the clear
to do the same.

I would like to study IE's behavior more before I agree that we can make this
change, but for now I'll put it on my TODO list for Moz 1.8 / FFox 1.1.
Assignee: general → darin
Status: ASSIGNED → NEW
Target Milestone: --- → mozilla1.8beta
Target Milestone: mozilla1.8beta → ---

Updated

13 years ago
Target Milestone: --- → mozilla1.8beta

Comment 12

13 years ago
Created attachment 169156 [details] [diff] [review]
v1 patch

This patch flips the default value of the pref.  This patch means that the URI
object will discard the information it was given about the encoding of the page
from which it was extracted.  We might want to change that, and instead make
the pref only affect the encoding and not affect the value of mOriginCharset. 
Or, perhaps that's unnecessary.
Attachment #169156 - Flags: review?(smontagu)

Comment 13

13 years ago
Darin Fisher asks about a migration strategy.
If you really want that, then the best idea is
probably to do the following:

1) Request a resource using %-encoding based on UTF-8
2) If you get a 404 (not found), do another request
   based on the legacy encoding of the page.

For that, it would indeed make sense to keep the
page encoding in the URI object. There should
probably be an option to switch part 2 off, because
hopefully in the future it won't be needed anymore.

This approach has been proposed ages ago, but I'm
not sure it has ever been implemented. But it should
not be too difficult.

It has the following advantages:
- The official, standard, thing is done first,
  bugwards compatibility stuff follows only later.
- Servers serving things with legacy encodings
  'get the message', i.e. they'll see all the
  UTF-8 requests in their logs.

It has the following issues:
- double requests for broken links
- Exceptional wrong positive (if the UTF-8 form
  of a path coincides with an existing path in
  legacy encoding; for actual preexisting pages
  (as opposed to query-like things that get
  constructed), the chance of this happening
  is very low.

The 404 is an efficient way (in particular in terms of
implementation!) for the server to indicate that it's
not doing UTF-8 (at least not on the resource requested).

For the server to use some additional thing (e.g. header)
to indicate that it's actually doing the standard thing
(i.e. IRIs) as such doesn't seem like a good idea.

If you're going to upgrade the server anyway, the best
way to do it is probably to use mod_fileiri
(see http://www.w3.org/2003/06/mod_fileiri/), which lets
you use a feature already in the HTTP protocol
(permanent redirect) to tell the client that it's using
UTF-8 now.

Comment 14

13 years ago
(In reply to comment #11)
> Hmm... on second thought, the fact that IE encodes URLs as UTF-8 by default (at
> least it appears to be true on my system) means that we should be in the clear
> to do the same.

Darin, see comment #5. IIS apparently has that. However, the majority of Apache
servers in the wild don't have Martin's IRI extension module installed. His
module is only available for Apache 2.x, but the share of Apache 1.x is far
larger than that of Apache 2.x. However, due to IE's behavior, it seems that a
lot of apache 1.x admins have added some ad-hock solutions because nowadays I
rarely see a note that asks users to turn off 'always send urls in UTF-8' in MS
IE (again comment #5).

As for Martin's migration strategy, that has been considered before, but somehow
it wasn't implemented (for perf. reason?).  

> object will discard the information it was given about the encoding of the page

This should be kept, IMO. 


Comment 15

13 years ago
Re. my fileiri module for Apache 1.3, I don't think it would be
difficult to port it; I just haven't done it, nor have I heard
from anybody else doing it. If you need it, please drop me a line.

Re. Jungshik Shin's comment "nowadays I rarely see a note that
asks users to turn off 'always send urls in UTF-8' in MS IE",
if there is any place where non-UTF-8-based URIs are (or where)
in wide use, it's Korea, and Jungshik is extremely familliar
with the situation there.
> (see http://www.w3.org/2003/06/mod_fileiri/)

that seems to require authentication...

Comment 17

13 years ago
Created attachment 169230 [details] [diff] [review]
v2 patch

OK, this version preserves the value of nsIURI::originCharset, and makes the
implementation simply ignore it when the encode-as-utf8 preference is set.
Attachment #169156 - Attachment is obsolete: true

Updated

13 years ago
Attachment #169156 - Flags: review?(smontagu)

Updated

13 years ago
Attachment #169230 - Flags: review?(jshin)

Comment 18

13 years ago
I've changed access to http://www.w3.org/2003/06/mod_fileiri/.
It should be visible for everybody now.
Sorry, I thought this was already visible.

Comment 19

13 years ago
Comment on attachment 169230 [details] [diff] [review]
v2 patch


This looks fine as long as references to remote files are concerned. I just
came across another complication. See bug 263570 comment #17. I believe that
the problem described in bug 263570 comment #17 only happens when this option
is set true. We may just go ahead here and then have to come up with a way to
special-case(?) local file access.
Attachment #169230 - Flags: review?(jshin) → review+
Comment on attachment 169230 [details] [diff] [review]
v2 patch

If we want to display IRIs unescaped in the address bar, what do we need to do?
I'm not clear whether the pref "network.standard-url.escape-utf8" affects only
the display or also the form in which the URI/IRI is sent to the server.

Comment 21

13 years ago
(In reply to comment #20)

> If we want to display IRIs unescaped in the address bar, what do we need to do?
> I'm not clear whether the pref "network.standard-url.escape-utf8" affects only
> the display or also the form in which the URI/IRI is sent to the server.

It's only for  what's sent to the server. As for the display in the addressbar,
I guess we have to use |nsITextToSubURI| 


Comment 22

13 years ago
yeah, what jshin said.

Updated

13 years ago
Attachment #169230 - Flags: superreview?(bzbarsky)
Comment on attachment 169230 [details] [diff] [review]
v2 patch

sr=bzbarsky
Attachment #169230 - Flags: superreview?(bzbarsky) → superreview+

Comment 24

13 years ago
> We may just go ahead here and then have to come up with a way to
> special-case(?) local file access. 

Hmm... yeah, local file access is going to be an interesting issue.  We could
ignore this preference when nsStandardURL::mSupportsFileURL is set the true. 
That may be best.  Thoughts?  (bz: sorry for the possibly premature sr request)
Hmm... yeah, if this breaks us on Windows then we may want to condition on the
file url flag.

But really, it would break us any time the HTML charset doesn't match the native
charset, no?  So it's just a problem no matter what...

Comment 26

13 years ago
if we're redesigning this, i have a consumer who wants the ability to send
unencoded urls :).

Comment 27

13 years ago
> But really, it would break us any time the HTML charset doesn't match the native
> charset, no?  So it's just a problem no matter what...

boris, well... if everything a user does is Shift-JIS (documents on their
harddrive, documents downloaded from websites, etc.) then it would be a problem
to switch suddenly to UTF-8 for local files.  but you make a good point when
considering that the average user probably does encounter documents in a variety
of character encodings.

Comment 28

13 years ago
> if we're redesigning this, i have a consumer who wants the ability to send
> unencoded urls :).

timeless: we're not redesigning this per se.  we're just talking about when to
enable an existing preference (and some details related to it).  you might want
to file a new bug for that feature.

Comment 29

13 years ago
(In reply to comment #25)

> But really, it would break us any time the HTML charset doesn't match the native
> charset, no?  So it's just a problem no matter what...

How about changing |net_GetFileFromURLSpec| so that it works with url-escaped
UTF-8's and call |InitWithPath| with a UTF-16 path instead of
|InitWithNativePath|? If its ripple effect is so huge that we need to keep the
backward compatibility (i.e. have to make so many changes throughout the code),
we may consider checking the UTF8ness of an escaped URLspec and branching
depending on the result.

  http://lxr.mozilla.org/seamonkey/source/netwerk/base/src/nsURLHelperWin.cpp#99

Comment 30

13 years ago
Making net_GetFileFromURLSpec handle UTF-8 encoded file:// URLs is probably a
good idea, and it will probably be necessary to support UTF-8 (and native
encoding as a fallback) when we start supporting Unicode file paths under windows.

Comment 31

13 years ago
OK, I went ahead and checked this patch in.  I want to make sure we get as much
feedback as possible on this change.  We need to fix the way we handle non-ASCII
file paths anyways, so that need not block this patch.

Marking FIXED.
Status: NEW → RESOLVED
Last Resolved: 13 years ago
Resolution: --- → FIXED

Comment 32

13 years ago
Thanks. I've just come across a page (actually, a Korean firefox 1.0 user
reported that he couldn't see images in the page)  in EUC-KR (one at one of
major portals in Korea) where I had to turn on this because the server only
understands IRI. 
The page address  is
http://www.cdpkorea.com/zboard4/zboard.php?id=pdsboard&page=1&page_num=20&select_arrange=headnum&desc=&sn=off&ss=on&sc=on&keyword=&no=43865&category=
With this option on, four photos of a CD player will show up. Otherwise, they're
replaced with an image for 'not found files'. 

Updated

13 years ago
Blocks: 42899
*** Bug 283051 has been marked as a duplicate of this bug. ***

Comment 34

12 years ago
This change was reverted because it UTF-8'ed some things that IE and Opera don't
UTF-8, breaking some web sites.  See bug 284474.

Comment 35

12 years ago
reopening this bug since the fix was basically reverted.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Updated

12 years ago
Severity: normal → enhancement
Target Milestone: mozilla1.8beta1 → mozilla1.9alpha

Comment 36

11 years ago
Two comments I'm quoting were taken from bug 284474, but I'm adding my comment here because this bug is where we need to discuss what's gotta be done.

(In reply to comment #33)
> (In reply to comment #30)

> > My understanding is that for the mid-range future (after Firefox 1.5) it
> > would be very good to move to what's called the IE/Opera behavior, i.e.
> > using UTF-8 for the path/file part and the page encoding for the query
> > part. While this solution is not optimal, it would definitely be a big
> > step forward.
> 
> However, that behavior is in violation of RFC 3987 you wrote, or did I misread
> it? 

I haven't yet confirmed it myself (I have yet to find a Windows box where I can safely test IE 7 beta preview2??. It doesn't work on my Win 2k box), but it seems like IE 7 beta preview2 behaves exactly the same way as firefox with the backed-out patch in place (judging from a few test reports I got in the Korean mozilla community). I don't know whether that's a just oversight on the part of IE7 team or done on purpose to get IE7 in compliance with the letters of RFC 3987 at the risk of breaking some web sites. The former is more likely than the latter, but who knows....


Updated

11 years ago
Priority: -- → P4
(Assignee)

Comment 37

11 years ago
Created attachment 215656 [details] [diff] [review]
patch v3

This will make Mozilla compatible with IE and Opera.
When network.standard-url.encode-utf8 is true, URLs except query part will be encoded in UTF-8 (similar to IE6 "Always encode in UTF-8").
I added a new pref "network.standard-url.encode-query-utf8". The query part will be encoded in UTF-8 only if it's true (similar to "Send UTF-8 query strings" introduced in IE7).
This patch also contains a fix for bug 284474.
Assignee: darin → VYV03354
Status: REOPENED → ASSIGNED
Attachment #215656 - Flags: superreview?(darin)
Attachment #215656 - Flags: review?(jshin1987)
If we are inputting non-ASCII URI to location bar directly, the sending request is not compatible at IE6(default settings).

+ jp-critical
Keywords: jp-critical

Comment 39

11 years ago
Comment on attachment 215656 [details] [diff] [review]
patch v3

>Index: netwerk/base/src/nsStandardURL.cpp

>+        nsCAutoString loweredSpec(spec);
>+        ToLowerCase(loweredSpec);
>+        GET_SEGMENT_ENCODER_FROM_SPEC(encoder, loweredSpec.get());

I don't understand why it is necessary to lowercase the entire URL spec.
(Assignee)

Comment 40

11 years ago
Created attachment 215810 [details] [diff] [review]
patch v3.1

using PL_strncasecmp instead of stricmp + ToLowerCase.
Attachment #169230 - Attachment is obsolete: true
Attachment #215656 - Attachment is obsolete: true
Attachment #215810 - Flags: superreview?(darin)
Attachment #215810 - Flags: review?(jshin1987)
Attachment #215656 - Flags: superreview?(darin)
Attachment #215656 - Flags: review?(jshin1987)
Blocks: 311387
(Assignee)

Comment 41

11 years ago
Created attachment 216628 [details] [diff] [review]
patch v3.2

Removed file protocol hack since bug 278161 was fixed.
Attachment #215810 - Attachment is obsolete: true
Attachment #216628 - Flags: review?(jshin1987)
Attachment #215810 - Flags: superreview?(darin)
Attachment #215810 - Flags: review?(jshin1987)

Comment 42

11 years ago
Comment on attachment 216628 [details] [diff] [review]
patch v3.2

Looks good, but let biesi take a look.

>Index: netwerk/base/src/nsStandardURL.cpp
>
> #define NS_NET_PREF_ALWAYSENCODEINUTF8 "network.standard-url.encode-utf8"
>+#define NS_NET_PREF_ENCODEQUERYINUTF8 "network.standard-url.encode-query-utf8"

  Can you align the opening quote of a new line with that of the line above it? Even with that, it's only 79 chars. long.




>+#define GET_QUERY_ENCODER(name) \
>+    GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8 && \
>+         gEncodeQueryInUTF8)

Perhaps, it's better this way:

    GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8 && \
                                 gEncodeQueryInUTF8)
Attachment #216628 - Flags: review?(jshin1987) → review?(cbiesinger)
(Assignee)

Comment 43

11 years ago
Created attachment 216631 [details] [diff] [review]
resolved points of comment #42
Attachment #216628 - Attachment is obsolete: true
Attachment #216631 - Flags: review?(cbiesinger)
Attachment #216628 - Flags: review?(cbiesinger)
Comment on attachment 216631 [details] [diff] [review]
resolved points of comment #42

unfortunate that we have to do this, oh well :(

what about mozilla code that unescapes URIs? like nsTextToSubURI.cpp? doesn't that need to be changed to deal with URIs that are partially in UTF-8 and partially in the origin charset?
Attachment #216631 - Flags: review?(cbiesinger) → review+
(Assignee)

Comment 45

11 years ago
Comment on attachment 216631 [details] [diff] [review]
resolved points of comment #42

> what about mozilla code that unescapes URIs? like nsTextToSubURI.cpp? doesn't
> that need to be changed to deal with URIs that are partially in UTF-8 and
> partially in the origin charset?
That's bug 320807.
Attachment #216631 - Flags: superreview?(darin)

Comment 46

11 years ago
Comment on attachment 216631 [details] [diff] [review]
resolved points of comment #42

>Index: netwerk/base/src/nsStandardURL.cpp

> #define GET_SEGMENT_ENCODER(name) \
>+    GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8)
>+
>+#define GET_QUERY_ENCODER(name) \
>+    GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8 && \
>+                                 gEncodeQueryInUTF8)
>+
>+#define GET_SEGMENT_ENCODER_INTERNAL(name, useUTF8) \
>+    nsSegmentEncoder name(useUTF8 ? nsnull : mOriginCharset.get())

It's usually nice to define a macro above where it is first used.


sr=darin
Attachment #216631 - Flags: superreview?(darin) → superreview+
(Assignee)

Comment 47

11 years ago
Created attachment 218073 [details] [diff] [review]
resolved darin's comment

Thank you.
Carring over r+sr.
Attachment #216631 - Attachment is obsolete: true
Attachment #218073 - Flags: superreview+
Attachment #218073 - Flags: review+
checked-in.

Don't we need this for 1.8.1 if there are no regressions?
Status: ASSIGNED → RESOLVED
Last Resolved: 13 years ago11 years ago
Resolution: --- → FIXED
(Assignee)

Comment 49

11 years ago
I really want to land this on branch, but we will have to wait for bug 278161.
Depends on: 278161
(Assignee)

Comment 50

11 years ago
I've filed bug 333859 to followup the query part issue.
Don't we need the fix for bug 320807 anywhere where we land this?
Depends on: 320807
(Assignee)

Comment 52

11 years ago
(In reply to comment #51)
> Don't we need the fix for bug 320807 anywhere where we land this?
At least, we need to handle the "mixed-charset" url (comment #45).
*** Bug 340376 has been marked as a duplicate of this bug. ***

Comment 54

11 years ago
(In reply to comment #49)
> I really want to land this on branch, but we will have to wait for bug 278161.
> 

Bug 278161 did land on the branch on May 1st. Can we check this in too ? (there's no blocking-flag yet)
(Assignee)

Comment 55

11 years ago
Comment on attachment 218073 [details] [diff] [review]
resolved darin's comment

Bug 278161 is fixed now.
And I didn't hear any regressions.
Attachment #218073 - Flags: approval-branch-1.8.1?(cbiesinger)
(In reply to comment #55)
> (From update of attachment 218073 [details] [diff] [review] [edit])
> Bug 278161 is fixed now.
> And I didn't hear any regressions.

What about comment 51?

(Assignee)

Comment 57

11 years ago
Bug 320807 will not be suitable for landing on the branch because it's large and contains interface change.
Probably we will need a subset patch to handle only mixed-charset case.
(Assignee)

Comment 58

11 years ago
Comment on attachment 218073 [details] [diff] [review]
resolved darin's comment

Clearing request until bug 320807 is solved
Attachment #218073 - Flags: approval-branch-1.8.1?(cbiesinger)
*** Bug 349671 has been marked as a duplicate of this bug. ***
Duplicate of this bug: 378267
Duplicate of this bug: 393246
Depends on: 501868
You need to log in before you can comment on or make changes to this bug.