Closed Bug 66515 Opened 24 years ago Closed 12 years ago

Mozilla incorrectly rewrites URLs containing ISO characters

Categories

(Core :: DOM: Navigation, defect)

defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: michele, Unassigned)

References

()

Details

Attachments

(1 file)

If I type the URL Mozilla rewrites it to
  http://www.dmoz.org/World/Fran%C3%A7ais/
instead of
  http://www.dmoz.org/World/Fran%E7ais/

The same URL is correctly translated if it appears as an <A HREF> tag in an HTML
document.
Confirming with build 2001012420 on NT4. Platform/OS -> All/All,
Component -> Networking, Severity -> Major (can't type in that URL)
Assignee: asa → neeti
Severity: normal → major
Status: UNCONFIRMED → NEW
Component: Browser-General → Networking
Ever confirmed: true
OS: Linux → All
QA Contact: doronr → tever
Hardware: PC → All
Could be a dup of bug 31225, although here the URI works as long as it is not
typed in by hand.
Adding dependency to URI tracking bug.
Blocks: 61999
No, no dup of 31225, this is not related to host resolving. This seems to be
related to character encoding and conversion, also not with url parsing.
Keywords: nsbeta1
Target Milestone: --- → mozilla0.9.1
I think this is being caused by the difference in how we are escaping the URLs
for the location bar vs. the href handler. Someone in docshell should verify this. 
Assignee: neeti → adamlock
Component: Networking → Embedding: Docshell
QA Contact: tever → adamlock
->Chak per Jud

Assignee: adamlock → chak
I think the issue here is the usage of ToNewUTF8String() in NS_NEWUri() at

http://lxr.mozilla.org/seamonkey/source/netwerk/base/public/nsNetUtil.h#86

Piping a URL with non-ascii chars thru'  ToNewUTF8String() to force a conversion 
to single byte char * seems incorrect. This function basically converts the 'ç' 
to UTF8 which results in two chars.

Finally, when the HTTP request string is built inside of 
nsHTTPRequest::formBuffer() (at
http://lxr.mozilla.org/seamonkey/source/netwerk/protocol/http/src/nsHTTPRequest.
cpp#410) a call to GetPath() is made which results in the escaped string 
"World/Fran%C3%A7ais/" being added to the request - hence the server responds 
with a page not found.

I think the way to fix this would be to call ToNewCString() instead of 
ToNewUTF8String() (at 
http://lxr.mozilla.org/seamonkey/source/netwerk/base/public/nsNetUtil.h#86)
(I'll submit the patch)
I tested with that change and it seems to work fine. I'll let the experts in 
this area to tell me if this breaks anything else and/or if there's a better way 
to fix this.

PS : If you want to try out this change yourself...
Since the functions in nsNetUtils.h are inlined you may have to do a clean build 
of atleast netwerk and docshell to test this change out i.e. just changing the 
nsNEtUtils.h and doing a make won't help.


Is there anyone else who can r= this one?....Thanks
Seems like ftang added the UTF8 conversion. Any reason why they should not be
changed back Frank?
no- please do not chagne to ToNewCString
This will break other stuff. 
Please read http://www.ietf.org/rfc/rfc2718.txt
the URI should be UTF-8
also, please read
ftp://ftp.isi.edu/in-notes/rfc2396.txt
The right thing to do is to % encode the text in the upper level while we still 
 know the encoding information, and pass down in % form. ToNewCString will break 
I-DNS work, Internet Keyword and so on. 
Frank : Assuming that we do not call ToNewCString() like you're suggesting, 
how/where do we fix this current issue?
Domain names has to be UTF-8. For path part, it can try in a document charset
after failing with UTF-8.
A similar issue for HTML anchor, see the HTML spec.
http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1

What http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1 says is 
fine for an href inside a document. But, the note does not address the issue of 
what should be done in the case there's no document charset specified or for ex, 
when we enter via the URL bar (which is what this bug is about).

[
Also, adhering to the above reccomendation can get us into a lot of trouble if 
we're not careful. 
For ex: Imagine the user requests a URL (with non-ascii chars) which does not 
really exist on the server. We request the URL first with UTF8 encoding and get 
back a HTTP 404(doc not found) since the doc does not really exist. Now, we make 
another request, this time with the URL encoded with the document charset which 
happens to be UTF-8. We get a 404 back again since the second request was 
essentially the same as the first. At this stage we need to keep track of the 
request count so as not to get into a recursive requests for a non-existent URL.
]

I'm also not sure Comminicator 4.x implements what's specified in the note above 
and it seems to be working fine with URLs with non-ascii chars. How is 
Communicator handling this issue? Just curious...Thanks



I just mentioned about HTML spec because it describles about the fallback method
of trying UTF-8 then a document charset.

I think 4.x converts URL to OS default charset. This works for limited cases
where the server's charset is same as the client OS charset.
If there is a way to know the server's charset then that should be used.
If it's not possible to get a server's charset or a document charset then OS
charset could be uses as a fallback for UTF-8. 
we've pretty much got to ignore the specs because we can't code to them anyway. 
backwards compat and real-world scenarios are our masters.

cc'ing darin because we were recently talking about erradicating unicode from 
necko altogehter :-), and this sort of falls in those lines.

I'm going to try and break this down like so many have before:

1. From the UI standpoint, we want to present URL's in their native character 
format. If that means unicode at the UI level, fine, but let's keep that at the 
level *above* necko. If we can't do this, someone please explain why we can't.
2. From necko's standpoint, all it should be dealing w/ is raw escaped char*'s. 
If I hand necko a url w/ a space in it, it needs to escape (not UTF8 encode) 
that space, and send the escaped request out onto the network.

so, can't we remove all the encoding from necko, and ensure that all of the 
encoding happens *above* necko, necko can't do anything w/ it anyway if my 
*real-world* understanding is correct. this would mean necko util callsites 
would need to encode/decode on their own.
The *old* URL spec said everything have to be ASCII or to be % escaped- so 
http://www.dmoz.org/World/Français/ is illegal in term of the *old* URL spec. 
The *new* URL guide line proposed to use UTF8 in URL so the ISO Latin 1 
http://www.dmoz.org/World/Français/ is ALSO an illegal in term of the *new* URL 
draft. 
In the mean time, we could also see ShiftJIS, EUCJP, Russian, Big5 case of URL 
path similar to http://www.dmoz.org/World/Français/ in the real world. The data 
may or maynot encoded in ISO-8859-1.

I think there are no real solution if user type into the URL bar. Basically, 
there are no way we can know what charset it is. It could be ISO-8859-1 in the 
server side, it could be ISO-8859-2 in the server side, it could be anything in 
the server side. We have no context for it. The next best thing we could have is 
to assume it is for the new URL / IDNS spec- which is UTF-8 in the URL. That is 
why we convert to UTF-8.

The other reason we convert to UTF-8 is because 1) IDNS, 2) Internet Keyword 3) 
what is related, 4) ODP accept UTF-8 URL and UTF-8 is the only choice which we 
won't loss data. 

I think the right thing to do is to % escape as possible as we can in the upper 
level (as what we did now). And for edge case which we cannot, convert it to 
UTF-8 will ensure forward compatability at least.
Also, I think I did something special for file:/// url. If it is file url, we 
convert to FileSystem charset and escape it. 
for HTTP url, I think there are no real solution. 
Also, be aware that both LDAP and IMAP URL are in UTF8 as defined in 
ftp://ftp.isi.edu/in-notes/rfc2253.txt
ftp://ftp.isi.edu/in-notes/rfc2255.txt
ftp://ftp.isi.edu/in-notes/rfc2192.txt

>we've pretty much got to ignore the specs because we can't code to them anyway. 
backwards compat and real-world scenarios are our masters.
I agree with you for ftp:// and http:// case. But for IMAP and LDAP URL, we did 
UTF8 for a while already. And you have to allow UTF-8 in the nsNetUtil since 
nsNetUtil is not only for http/ftp/file protocol. 

We want to make sure nsNetUtil work for IMAP/LDAP also- if it contains UTF-8.
Changing milestone to 0.9.2 since there's going to me some reworking of the 
Necko layer wrt to handling wide char strings.

This bug depends on those chages and will revist when they're in place.
Status: NEW → ASSIGNED
Target Milestone: mozilla0.9.1 → mozilla0.9.2
->0.9.3
Target Milestone: mozilla0.9.2 → ---
Target Milestone: --- → mozilla0.9.3
->0.9.4
Target Milestone: mozilla0.9.3 → mozilla0.9.4
Target Milestone: mozilla0.9.4 → mozilla1.0
Blocks: 104166
Bugs targeted at mozilla1.0 without the mozilla1.0 keyword moved to mozilla1.0.1 
(you can query for this string to delete spam or retrieve the list of bugs I've 
moved)
Target Milestone: mozilla1.0 → mozilla1.0.1
Keywords: mozilla1.3, patch, review
Summary: Mozilla incorrectly rewrites URLS containing ISO characters → Mozilla incorrectly rewrites URLs containing ISO characters
I can reproduce this in Linux with FF 20040406 by copying the URL onto the
clipboard and pasting it into the URL bar.
Keywords: mozilla1.3top100
I have no problem with this URL in XP on FF 20040419. Just copy/paste to URL-bar
and I get http://www.dmoz.org/World/Fran%E7ais/.
Assignee: chak → nobody
Status: ASSIGNED → NEW
QA Contact: adamlock → docshell
* dmoz's encoding is UTF-8, these days.
* This is probably WONTFIX, for Awesomebar deeply depends on UTF-8 URI.

->wfm
Severity: major → normal
Status: NEW → RESOLVED
Closed: 12 years ago
Keywords: top100
Resolution: --- → WORKSFORME
Target Milestone: mozilla1.0.1 → ---
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: