Closed
Bug 261929
Opened 20 years ago
Closed 18 years ago
Consider sending urls in UTF-8 by default (images/links with non-ASCII chacters not displayed)
Categories
(Core :: Networking, enhancement, P4)
Core
Networking
Tracking
()
RESOLVED
FIXED
mozilla1.9alpha1
People
(Reporter: c00lyman, Assigned: emk)
References
(Depends on 1 open bug, )
Details
(Keywords: intl, jp-critical)
Attachments
(1 file, 6 obsolete files)
9.78 KB,
patch
|
emk
:
review+
emk
:
superreview+
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20040910 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20040910 There are german Umlaute in the image link addresses and it looks like mozilla can't figure out the right request address for the images. If you try the same page with IE you will see the images ... Reproducible: Always Steps to Reproduce: 1.load the page 2. 3. Actual Results: broken link images - picture is not loadad Expected Results: You should see a picture
Comment 1•20 years ago
|
||
Joerg: is "Always send URLs in UTF-8" enabled in your MSIE settings?
Unchecking the "Always send URLs as UTF-8" makes it not work in MSIE6.
Setting network.standard-url.encode-utf8 makes it work in Mozilla. If this pref defaults to true in MSIE, we should probably do the same.
Reporter | ||
Comment 4•20 years ago
|
||
Hey, great timing - you guys make a good job!
Comment 5•20 years ago
|
||
We have a dillemma here. It seems like MS IIS has a built-in support for translating url-escaped UTF-8 in URL to whatever is used on the file system (otherwise, MS IE wouldn't have set 'always send .. UTF-8' on by default, I guess). Our trouble is that a significant portion of Apache servers don't have a module (that does the same thing) on and are installed on Unix boxes where the local file system character encoding is not UTF-8. (this has been changing with recent Linux distributions using UTF-8 by default). In addition, the module is only available for apache 2.x which is lot less widely used than apache 1.x. Perhaps, we have to turn it on by default. In the past, I saw many web sites asking their visitors to turn off 'Always send URLs in UTF-8' in MSIE. These days, I rarely see it. However, my sample may be skewed. What do others think about this?
Status: UNCONFIRMED → ASSIGNED
Component: Browser-General → Networking
Ever confirmed: true
Keywords: intl
Summary: You can't see the images on the page due to german ä in the image link You can't see the images on the page due to german ä in the image link → Consider sending urls in UTF-8 by default (images/links with non-ASCII chacters not displayed) Consider sending urls in UTF-8 by default (images/links with non-ASCII chacters not displayed)
Comment 6•20 years ago
|
||
I still prefer a failover-based solution (though it would be non-trivial to implement), but maybe the proliferation of IE means that we can assume UTF-8 as a defacto standard?
Comment 7•20 years ago
|
||
I am all for making UTF-8 ubiquitous, especially in URIs.
Comment 8•20 years ago
|
||
I also for encoding urls as utf-8 by default. But we must not change the behavior with unescaped urls and form content on the page. They must always use the page charset.
Comment 9•20 years ago
|
||
The IRI spec (currently at http://www.w3.org/International/iri-edit/draft-duerst-iri-11.txt) has been approved as an IETF Proposed Standard by the IESG (see http://www1.ietf.org/mail-archive/web/ietf-announce/current/msg00752.html) and will soon be issued as an RFC. In short, IRI spec defines how to convert non-ASCII characters in a Web address to a fully standard URI, using UTF-8 and %HH-encoding. Given that this is now approved, Mozilla code should move towards implementing it without delay. In particular, this means that any URIs that contain non-ASCII characters should be converted to %HH-form for resolution using UTF-8. Using a legacy encoding is (and always has been) a clearly totally non-standard way of resolving Web addresses with non-ASCII characters. So: 1) 'network.standard-url.encode-utf8' should be on by default (and eventually become unnecessary) 2) 'always use the page charset' for 'unescaped uris', as Nil Soffer asks for, is clearly against standards. 3) For form content generated on a page by the user (as opposed to the query part of an uri), the page encoding has to be used. Opera and Safari already have a good implementation of IRIs. IE does so, too, with the exception of IDNs. Mozilla has all the right code, it just has to be activated and deployed. A very simple test is available at http://www.w3.org/2001/08/iri-test/resumeHtmlImgSrcBase.html. I suggest changing 'OS' to 'All', because this is a general issue, not a Win XP issue.
Comment 10•20 years ago
|
||
Martin, Yes, we are aware of the IRI specification, but has anyone proposed a migration strategy that web browsers might implement? Clearly, much of the web does not conform to the IRI specification, so it would be a mistake for Mozilla to switch to IRI encoding by default. Perhaps servers should indicate IRI expectation somehow?
OS: Windows XP → All
Hardware: PC → All
Comment 11•20 years ago
|
||
Hmm... on second thought, the fact that IE encodes URLs as UTF-8 by default (at least it appears to be true on my system) means that we should be in the clear to do the same. I would like to study IE's behavior more before I agree that we can make this change, but for now I'll put it on my TODO list for Moz 1.8 / FFox 1.1.
Assignee: general → darin
Status: ASSIGNED → NEW
Target Milestone: --- → mozilla1.8beta
Updated•20 years ago
|
Target Milestone: mozilla1.8beta → ---
Updated•20 years ago
|
Target Milestone: --- → mozilla1.8beta
Comment 12•20 years ago
|
||
This patch flips the default value of the pref. This patch means that the URI object will discard the information it was given about the encoding of the page from which it was extracted. We might want to change that, and instead make the pref only affect the encoding and not affect the value of mOriginCharset. Or, perhaps that's unnecessary.
Attachment #169156 -
Flags: review?(smontagu)
Comment 13•20 years ago
|
||
Darin Fisher asks about a migration strategy. If you really want that, then the best idea is probably to do the following: 1) Request a resource using %-encoding based on UTF-8 2) If you get a 404 (not found), do another request based on the legacy encoding of the page. For that, it would indeed make sense to keep the page encoding in the URI object. There should probably be an option to switch part 2 off, because hopefully in the future it won't be needed anymore. This approach has been proposed ages ago, but I'm not sure it has ever been implemented. But it should not be too difficult. It has the following advantages: - The official, standard, thing is done first, bugwards compatibility stuff follows only later. - Servers serving things with legacy encodings 'get the message', i.e. they'll see all the UTF-8 requests in their logs. It has the following issues: - double requests for broken links - Exceptional wrong positive (if the UTF-8 form of a path coincides with an existing path in legacy encoding; for actual preexisting pages (as opposed to query-like things that get constructed), the chance of this happening is very low. The 404 is an efficient way (in particular in terms of implementation!) for the server to indicate that it's not doing UTF-8 (at least not on the resource requested). For the server to use some additional thing (e.g. header) to indicate that it's actually doing the standard thing (i.e. IRIs) as such doesn't seem like a good idea. If you're going to upgrade the server anyway, the best way to do it is probably to use mod_fileiri (see http://www.w3.org/2003/06/mod_fileiri/), which lets you use a feature already in the HTTP protocol (permanent redirect) to tell the client that it's using UTF-8 now.
Comment 14•20 years ago
|
||
(In reply to comment #11) > Hmm... on second thought, the fact that IE encodes URLs as UTF-8 by default (at > least it appears to be true on my system) means that we should be in the clear > to do the same. Darin, see comment #5. IIS apparently has that. However, the majority of Apache servers in the wild don't have Martin's IRI extension module installed. His module is only available for Apache 2.x, but the share of Apache 1.x is far larger than that of Apache 2.x. However, due to IE's behavior, it seems that a lot of apache 1.x admins have added some ad-hock solutions because nowadays I rarely see a note that asks users to turn off 'always send urls in UTF-8' in MS IE (again comment #5). As for Martin's migration strategy, that has been considered before, but somehow it wasn't implemented (for perf. reason?). > object will discard the information it was given about the encoding of the page This should be kept, IMO.
Comment 15•20 years ago
|
||
Re. my fileiri module for Apache 1.3, I don't think it would be difficult to port it; I just haven't done it, nor have I heard from anybody else doing it. If you need it, please drop me a line. Re. Jungshik Shin's comment "nowadays I rarely see a note that asks users to turn off 'always send urls in UTF-8' in MS IE", if there is any place where non-UTF-8-based URIs are (or where) in wide use, it's Korea, and Jungshik is extremely familliar with the situation there.
Comment 16•20 years ago
|
||
> (see http://www.w3.org/2003/06/mod_fileiri/)
that seems to require authentication...
Comment 17•20 years ago
|
||
OK, this version preserves the value of nsIURI::originCharset, and makes the implementation simply ignore it when the encode-as-utf8 preference is set.
Attachment #169156 -
Attachment is obsolete: true
Updated•20 years ago
|
Attachment #169156 -
Flags: review?(smontagu)
Updated•20 years ago
|
Attachment #169230 -
Flags: review?(jshin)
Comment 18•20 years ago
|
||
I've changed access to http://www.w3.org/2003/06/mod_fileiri/. It should be visible for everybody now. Sorry, I thought this was already visible.
Comment 19•20 years ago
|
||
Comment on attachment 169230 [details] [diff] [review] v2 patch This looks fine as long as references to remote files are concerned. I just came across another complication. See bug 263570 comment #17. I believe that the problem described in bug 263570 comment #17 only happens when this option is set true. We may just go ahead here and then have to come up with a way to special-case(?) local file access.
Attachment #169230 -
Flags: review?(jshin) → review+
Comment 20•20 years ago
|
||
Comment on attachment 169230 [details] [diff] [review] v2 patch If we want to display IRIs unescaped in the address bar, what do we need to do? I'm not clear whether the pref "network.standard-url.escape-utf8" affects only the display or also the form in which the URI/IRI is sent to the server.
Comment 21•20 years ago
|
||
(In reply to comment #20) > If we want to display IRIs unescaped in the address bar, what do we need to do? > I'm not clear whether the pref "network.standard-url.escape-utf8" affects only > the display or also the form in which the URI/IRI is sent to the server. It's only for what's sent to the server. As for the display in the addressbar, I guess we have to use |nsITextToSubURI|
Comment 22•20 years ago
|
||
yeah, what jshin said.
Updated•20 years ago
|
Attachment #169230 -
Flags: superreview?(bzbarsky)
Comment 23•20 years ago
|
||
Comment on attachment 169230 [details] [diff] [review] v2 patch sr=bzbarsky
Attachment #169230 -
Flags: superreview?(bzbarsky) → superreview+
Comment 24•20 years ago
|
||
> We may just go ahead here and then have to come up with a way to
> special-case(?) local file access.
Hmm... yeah, local file access is going to be an interesting issue. We could
ignore this preference when nsStandardURL::mSupportsFileURL is set the true.
That may be best. Thoughts? (bz: sorry for the possibly premature sr request)
Comment 25•20 years ago
|
||
Hmm... yeah, if this breaks us on Windows then we may want to condition on the file url flag. But really, it would break us any time the HTML charset doesn't match the native charset, no? So it's just a problem no matter what...
Comment 26•20 years ago
|
||
if we're redesigning this, i have a consumer who wants the ability to send unencoded urls :).
Comment 27•20 years ago
|
||
> But really, it would break us any time the HTML charset doesn't match the native
> charset, no? So it's just a problem no matter what...
boris, well... if everything a user does is Shift-JIS (documents on their
harddrive, documents downloaded from websites, etc.) then it would be a problem
to switch suddenly to UTF-8 for local files. but you make a good point when
considering that the average user probably does encounter documents in a variety
of character encodings.
Comment 28•20 years ago
|
||
> if we're redesigning this, i have a consumer who wants the ability to send
> unencoded urls :).
timeless: we're not redesigning this per se. we're just talking about when to
enable an existing preference (and some details related to it). you might want
to file a new bug for that feature.
Comment 29•20 years ago
|
||
(In reply to comment #25) > But really, it would break us any time the HTML charset doesn't match the native > charset, no? So it's just a problem no matter what... How about changing |net_GetFileFromURLSpec| so that it works with url-escaped UTF-8's and call |InitWithPath| with a UTF-16 path instead of |InitWithNativePath|? If its ripple effect is so huge that we need to keep the backward compatibility (i.e. have to make so many changes throughout the code), we may consider checking the UTF8ness of an escaped URLspec and branching depending on the result. http://lxr.mozilla.org/seamonkey/source/netwerk/base/src/nsURLHelperWin.cpp#99
Comment 30•20 years ago
|
||
Making net_GetFileFromURLSpec handle UTF-8 encoded file:// URLs is probably a good idea, and it will probably be necessary to support UTF-8 (and native encoding as a fallback) when we start supporting Unicode file paths under windows.
Comment 31•20 years ago
|
||
OK, I went ahead and checked this patch in. I want to make sure we get as much feedback as possible on this change. We need to fix the way we handle non-ASCII file paths anyways, so that need not block this patch. Marking FIXED.
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Comment 32•20 years ago
|
||
Thanks. I've just come across a page (actually, a Korean firefox 1.0 user reported that he couldn't see images in the page) in EUC-KR (one at one of major portals in Korea) where I had to turn on this because the server only understands IRI. The page address is http://www.cdpkorea.com/zboard4/zboard.php?id=pdsboard&page=1&page_num=20&select_arrange=headnum&desc=&sn=off&ss=on&sc=on&keyword=&no=43865&category= With this option on, four photos of a CD player will show up. Otherwise, they're replaced with an image for 'not found files'.
Comment 33•19 years ago
|
||
*** Bug 283051 has been marked as a duplicate of this bug. ***
Comment 34•19 years ago
|
||
This change was reverted because it UTF-8'ed some things that IE and Opera don't UTF-8, breaking some web sites. See bug 284474.
Comment 35•19 years ago
|
||
reopening this bug since the fix was basically reverted.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•19 years ago
|
Severity: normal → enhancement
Target Milestone: mozilla1.8beta1 → mozilla1.9alpha
Comment 36•19 years ago
|
||
Two comments I'm quoting were taken from bug 284474, but I'm adding my comment here because this bug is where we need to discuss what's gotta be done. (In reply to comment #33) > (In reply to comment #30) > > My understanding is that for the mid-range future (after Firefox 1.5) it > > would be very good to move to what's called the IE/Opera behavior, i.e. > > using UTF-8 for the path/file part and the page encoding for the query > > part. While this solution is not optimal, it would definitely be a big > > step forward. > > However, that behavior is in violation of RFC 3987 you wrote, or did I misread > it? I haven't yet confirmed it myself (I have yet to find a Windows box where I can safely test IE 7 beta preview2??. It doesn't work on my Win 2k box), but it seems like IE 7 beta preview2 behaves exactly the same way as firefox with the backed-out patch in place (judging from a few test reports I got in the Korean mozilla community). I don't know whether that's a just oversight on the part of IE7 team or done on purpose to get IE7 in compliance with the letters of RFC 3987 at the risk of breaking some web sites. The former is more likely than the latter, but who knows....
Updated•18 years ago
|
Priority: -- → P4
Assignee | ||
Comment 37•18 years ago
|
||
This will make Mozilla compatible with IE and Opera. When network.standard-url.encode-utf8 is true, URLs except query part will be encoded in UTF-8 (similar to IE6 "Always encode in UTF-8"). I added a new pref "network.standard-url.encode-query-utf8". The query part will be encoded in UTF-8 only if it's true (similar to "Send UTF-8 query strings" introduced in IE7). This patch also contains a fix for bug 284474.
Assignee: darin → VYV03354
Status: REOPENED → ASSIGNED
Attachment #215656 -
Flags: superreview?(darin)
Attachment #215656 -
Flags: review?(jshin1987)
Comment 38•18 years ago
|
||
If we are inputting non-ASCII URI to location bar directly, the sending request is not compatible at IE6(default settings). + jp-critical
Keywords: jp-critical
Comment 39•18 years ago
|
||
Comment on attachment 215656 [details] [diff] [review] patch v3 >Index: netwerk/base/src/nsStandardURL.cpp >+ nsCAutoString loweredSpec(spec); >+ ToLowerCase(loweredSpec); >+ GET_SEGMENT_ENCODER_FROM_SPEC(encoder, loweredSpec.get()); I don't understand why it is necessary to lowercase the entire URL spec.
Assignee | ||
Comment 40•18 years ago
|
||
using PL_strncasecmp instead of stricmp + ToLowerCase.
Attachment #169230 -
Attachment is obsolete: true
Attachment #215656 -
Attachment is obsolete: true
Attachment #215810 -
Flags: superreview?(darin)
Attachment #215810 -
Flags: review?(jshin1987)
Attachment #215656 -
Flags: superreview?(darin)
Attachment #215656 -
Flags: review?(jshin1987)
Assignee | ||
Comment 41•18 years ago
|
||
Removed file protocol hack since bug 278161 was fixed.
Attachment #215810 -
Attachment is obsolete: true
Attachment #216628 -
Flags: review?(jshin1987)
Attachment #215810 -
Flags: superreview?(darin)
Attachment #215810 -
Flags: review?(jshin1987)
Comment 42•18 years ago
|
||
Comment on attachment 216628 [details] [diff] [review] patch v3.2 Looks good, but let biesi take a look. >Index: netwerk/base/src/nsStandardURL.cpp > > #define NS_NET_PREF_ALWAYSENCODEINUTF8 "network.standard-url.encode-utf8" >+#define NS_NET_PREF_ENCODEQUERYINUTF8 "network.standard-url.encode-query-utf8" Can you align the opening quote of a new line with that of the line above it? Even with that, it's only 79 chars. long. >+#define GET_QUERY_ENCODER(name) \ >+ GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8 && \ >+ gEncodeQueryInUTF8) Perhaps, it's better this way: GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8 && \ gEncodeQueryInUTF8)
Attachment #216628 -
Flags: review?(jshin1987) → review?(cbiesinger)
Assignee | ||
Comment 43•18 years ago
|
||
Attachment #216628 -
Attachment is obsolete: true
Attachment #216631 -
Flags: review?(cbiesinger)
Attachment #216628 -
Flags: review?(cbiesinger)
Comment 44•18 years ago
|
||
Comment on attachment 216631 [details] [diff] [review] resolved points of comment #42 unfortunate that we have to do this, oh well :( what about mozilla code that unescapes URIs? like nsTextToSubURI.cpp? doesn't that need to be changed to deal with URIs that are partially in UTF-8 and partially in the origin charset?
Attachment #216631 -
Flags: review?(cbiesinger) → review+
Assignee | ||
Comment 45•18 years ago
|
||
Comment on attachment 216631 [details] [diff] [review] resolved points of comment #42 > what about mozilla code that unescapes URIs? like nsTextToSubURI.cpp? doesn't > that need to be changed to deal with URIs that are partially in UTF-8 and > partially in the origin charset? That's bug 320807.
Attachment #216631 -
Flags: superreview?(darin)
Comment 46•18 years ago
|
||
Comment on attachment 216631 [details] [diff] [review] resolved points of comment #42 >Index: netwerk/base/src/nsStandardURL.cpp > #define GET_SEGMENT_ENCODER(name) \ >+ GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8) >+ >+#define GET_QUERY_ENCODER(name) \ >+ GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8 && \ >+ gEncodeQueryInUTF8) >+ >+#define GET_SEGMENT_ENCODER_INTERNAL(name, useUTF8) \ >+ nsSegmentEncoder name(useUTF8 ? nsnull : mOriginCharset.get()) It's usually nice to define a macro above where it is first used. sr=darin
Attachment #216631 -
Flags: superreview?(darin) → superreview+
Assignee | ||
Comment 47•18 years ago
|
||
Thank you. Carring over r+sr.
Attachment #216631 -
Attachment is obsolete: true
Attachment #218073 -
Flags: superreview+
Attachment #218073 -
Flags: review+
Comment 48•18 years ago
|
||
checked-in. Don't we need this for 1.8.1 if there are no regressions?
Status: ASSIGNED → RESOLVED
Closed: 20 years ago → 18 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 49•18 years ago
|
||
I really want to land this on branch, but we will have to wait for bug 278161.
Depends on: 278161
Assignee | ||
Comment 50•18 years ago
|
||
I've filed bug 333859 to followup the query part issue.
Comment 51•18 years ago
|
||
Don't we need the fix for bug 320807 anywhere where we land this?
Depends on: 320807
Assignee | ||
Comment 52•18 years ago
|
||
(In reply to comment #51) > Don't we need the fix for bug 320807 anywhere where we land this? At least, we need to handle the "mixed-charset" url (comment #45).
Comment 53•18 years ago
|
||
*** Bug 340376 has been marked as a duplicate of this bug. ***
Comment 54•18 years ago
|
||
(In reply to comment #49) > I really want to land this on branch, but we will have to wait for bug 278161. > Bug 278161 did land on the branch on May 1st. Can we check this in too ? (there's no blocking-flag yet)
Assignee | ||
Comment 55•18 years ago
|
||
Comment on attachment 218073 [details] [diff] [review] resolved darin's comment Bug 278161 is fixed now. And I didn't hear any regressions.
Attachment #218073 -
Flags: approval-branch-1.8.1?(cbiesinger)
Comment 56•18 years ago
|
||
(In reply to comment #55) > (From update of attachment 218073 [details] [diff] [review] [edit]) > Bug 278161 is fixed now. > And I didn't hear any regressions. What about comment 51?
Assignee | ||
Comment 57•18 years ago
|
||
Bug 320807 will not be suitable for landing on the branch because it's large and contains interface change. Probably we will need a subset patch to handle only mixed-charset case.
Assignee | ||
Comment 58•18 years ago
|
||
Comment on attachment 218073 [details] [diff] [review] resolved darin's comment Clearing request until bug 320807 is solved
Attachment #218073 -
Flags: approval-branch-1.8.1?(cbiesinger)
Comment 59•18 years ago
|
||
*** Bug 349671 has been marked as a duplicate of this bug. ***
You need to log in
before you can comment on or make changes to this bug.
Description
•