Non-ASCII url cannot be found

VERIFIED FIXED in M17

Status

()

Core
Layout
P2
blocker
VERIFIED FIXED
19 years ago
17 years ago

People

(Reporter: Teruko Kobayashi, Assigned: Chris Waterson)

Tracking

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [nsbeta2+], URL)

Attachments

(4 attachments)

(Reporter)

Description

19 years ago
Above URL name is included Japanese directory.

When you go to the page, the Dos console says
"http://babel/tests/browser/http-test/%3f%3f%3f loaded sucessfully",
but in the Apprunner says "Not Found".

Step of reproduce
1. Go to the http://babel/tests/browser/http-test
2. Click on "??"

"Not Found" shows.

Tested 7-21-16-NECKO Win32 build and 7-22-08-M9 Win32 build.
(Reporter)

Updated

19 years ago
Priority: P3 → P2
Target Milestone: M9

Comment 1

19 years ago
The loaded successfully is currently tied to document load status. it really has
no clue as to what the HTTP status was. I am not sure what the other problem you
are mentioning is... Is it expected to show the document (I verified in 4.6 it
doesn't)?

There is a whole section on URL encoding that is missing right now from the
picture. But I want to confirm that the bug really is about encoding URLs.

Comment 2

19 years ago
As a maintainer of the Babel server cited above, I would like to
offer some additional facts of relevance. (Sorry, teruko, I forgot
to tell you about the Babel server's limitation described below.)

1. The directory name which ends the above URL is actually in Japanese
   and contain three 2-byte words. Unfortunately the server is running
   on an US Windows NT4 and thus mangles the multi-byte directory and
   file names. This is why you see 3 ? marks. Thus, even if you properly
   encode the URL, you will never find the directory.
2. Let me offer a sample on a Unix server which can handle 8-bit
   file names.

   http://kaze:8000/url/

3. In the directory, you will find 3 sub-directories. The 3rd from top
   (if viewed with 4.6/4.7 or 5.0 under Japanese (EUC-JP) encoding)
   will show in Japanese.
4. The first 2 directories are named using the escaped URLs
   and the 2nd one actually represents the escaped URL version of
   the 3rd Japanese one. If esacping is working correctly, you should
   see the escaped URL matching that of the 2nd one when the cursor
   is perched on the 3rd directory. (Compare FTP URL below.)

Some issues:

A. Click on the 3rd directory and it fails. We seem to be
   escaping special characters like "%" but not 8-bit charcters.
   (4.6/4.7 and 5.0 w/necko are pretty much the same here.)
   We should fix this in 5.0

B. 4.6/4.7 actually escapes 8-bit path names in FTP url.
   For example, try the same 3 directory listing with the ftp
   protocol at the URL below. You can get inside the 3rd directory
   under Japanese (EUC-JP) encoding:

   ftp://kaze/pub/

   And you can also see the escaped URL on the status bar when
   the cursor is on the 3rd directory.
   5.0 w/necko does not escape 8-bit names and cannot get inside this
   directory. We should fix this in 5.0

Question:

Are you planning on supporting escaping to both native encoding
and UTF-8 if a server indicates it can accept Unicode? I believe
there is a recent IETF draft on url-i18n which discusses
the UTF-8 implementation.

Comment 3

19 years ago
*** Bug 7399 has been marked as a duplicate of this bug. ***

Comment 4

19 years ago
*** Bug 7847 has been marked as a duplicate of this bug. ***

Comment 5

19 years ago
*** Bug 8333 has been marked as a duplicate of this bug. ***

Comment 6

19 years ago
*** Bug 8337 has been marked as a duplicate of this bug. ***

Comment 7

19 years ago
*** Bug 10429 has been marked as a duplicate of this bug. ***

Updated

19 years ago
Severity: major → blocker

Comment 8

19 years ago
10429 was gonna be marked a blocker, this the dup, blocker.

Updated

19 years ago
Status: NEW → RESOLVED
Last Resolved: 19 years ago
Resolution: --- → FIXED

Comment 9

19 years ago
This should work ok now... Pl. verify.

Updated

19 years ago
Whiteboard: waiting for new build with fix to verify

Comment 10

19 years ago
Can you international types verify this? I have verified that you can have
spaces in your path, which was bug 10429

Updated

19 years ago
Whiteboard: waiting for new build with fix to verify → waiting for reporter to verify

Updated

19 years ago
QA Contact: paulmac → teruko

Comment 11

19 years ago
Yes. This should go to teruko now.
(Reporter)

Updated

19 years ago
Status: RESOLVED → REOPENED
(Reporter)

Comment 12

19 years ago
I tested this in 7-31-09 and 8-02-09 build.
I used http://kaze/url/ to test this.
When I went to http://kaze/url/  and changed Charset to euc-jp,the 3rd
directory name in Japanese is displayed correctly.

Then, when I click on the link in Japanese directory name,
the location bar displayed http://kaze/url/.../  and "Not Found" showed up.

This needs to be reopened.

Updated

19 years ago
Resolution: FIXED → ---

Comment 13

19 years ago
Clearing Fixed resolution due to reopen.

Updated

19 years ago
Status: REOPENED → ASSIGNED
Target Milestone: M9 → M10

Comment 14

19 years ago
This has a lot to do with the fact that nsIURI currently works with char* and
not PRUnichar* I would like to verify this again once we move the URL accessor
to all PRUnichar (sometime for M10). Marking it such.

Comment 15

19 years ago
*** Bug 12473 has been marked as a duplicate of this bug. ***

Updated

19 years ago
Blocks: 13449

Updated

19 years ago
Target Milestone: M10 → M12

Comment 16

19 years ago
Correcting target milestone.

Updated

18 years ago
Summary: Non-ASCII url cannot be found → [dogfood] Non-ASCII url cannot be found

Comment 17

18 years ago
PDT would like to know if the 4.x product allow non-ascii in URL?  Does a test
case rely on allowing 8bit?

Comment 18

18 years ago
In 4.x, we are able to resolve FTP URL links containing 8-bit characters but not HTTP URL links
Please note that this is not the same as inputting into the Location window in 8-bit
characters. We didn't support that in 4.x.

Comment 19

18 years ago
Kat's comment is true for Japanese, but not for Latin-1. In 4.x you can type
http://babel/5x_tests/misc/montse/αϊινσϊ.jpg and it will show the file and the
URL will be correct. We need this working for Latin-1

Comment 20

18 years ago
Sorry about that. We didn't esacpe multi-byte characters in 4.x but did so for single-byte 8-bit
charactesr in HTTP.

Comment 21

18 years ago
It turns out that with the current Mozilla build (10/20/99 Win32), Latin 1 URL is resolved to an existing page also.
It seems that both in 4.x and Mozilla, we are not doing anything to an URL which contains single-byte
8-bit characters in URL, i.e. just passing them through and it works. I think we should escape these single-byte
8-bit characters, however.

It is multi-byte characters which are not supported at this point in HTTP or FTP URLs in Mozilla (or in 4.x).

Updated

18 years ago
Summary: [dogfood] Non-ASCII url cannot be found → Non-ASCII url cannot be found

Comment 22

18 years ago
And now Kat is right; it's the typing (another bug) which is not working, but if
the file is selected it will show up. Removing dogfood.

Comment 23

18 years ago
Moving Assignee from gagan to warren since he is away.

Updated

18 years ago
Assignee: warren → momoi

Comment 24

18 years ago
What's the deal on this bug now? The link in the URL field above is stale (in
4.x too): http://babel/tests/browser/http-test/%3f%3f%3f

This one works fine: http://babel/5x_tests/misc/montse/αϊινσϊ.jpg
although I don't think that should be a valid url. Those characters should be
escaped to work shouldn't they (or is that not how things are in practice).

Reassigning back to i18n.

Updated

18 years ago

Comment 25

18 years ago
What needs to happen on this bug is:

1. Use the test case above. (Changed from the one on babel which
   cannot process multi-byte URL as it is US NT4)
2. There are 2 URL links on this test page. The last one
   is in Japanese. When you view this page under the Japanese
   (EUC-JP) encoding, you should see the status bar at the bottom
   display the escaped URL. Currently it shows 3 dots indicating
   that Mozilla is not able to escape multi-byte directory names.
   When the 3rd (JPN) link is properly escaped, the escaped part should
   look like the 2nd URL which shows the escaped EUC-JP name used in the
   3rd link. The 1st escaped example shows the escaped sequence of the
   same 3 characters in Shift_JIS.

3. Contrast this with the ftp protocol under 4.7:

   ftp://kaze/pub

   This page contains the same 3 directory names. Use 4.7 and move
   the cursor over the 3rd link. You see that 4.7 escapes this to
   the identical string as you see for the 2nd URL.

Assigning this to ftang for an assessment as to what needs to be
done. When we isolate the problem, please send it back to warren.

Updated

18 years ago
Assignee: momoi → ftang

Comment 26

18 years ago
I've determined that the URL parsing is pretty screwed up in this regard. The
way I think it should work is:

- SetSpec and SetRelativePath should take escaped strings (like in the
examples), and in the parsing process convert the escapes to their unescaped
representation (held inside the nsStdURL object).

- We should probably assert if we see a non-ascii character passed to
SetSpec/SetRelativePath. The caller should have converted to UTF-8 before
calling us (I think this is being done by the webshell).

- GetSpec should reconstruct and re-escape the URL string before returning it.
Note that this is tricky, because if you have a slash in a filename (could
happen) it should be careful to turn this into %2F rather than make it look
like a directory delimiter.

- The nsIURI accessors should return the unescaped representation of things.
That way if I say SetSpec("file:/c|/Program%20Files") and then call GetPath, I
should get back "c|/Program Files".

The FTP protocol must be doing escaping by hand. This should be handled by
nsStdURL.

Cc'ing valeski & andreas.

Comment 27

18 years ago
My two cents on this:

Yes, URL parsing is pretty much screwed up regarding escaping and multibyte
chars.

- There was a task to convert the URL accessors to PRUnichar. See bug 13453. It
is now marked invalid.

- nsStdURL does no escaping currently.

Why not store the URL as we get it (escaped or not) from whoever is calling
nsStdURL-functions? Who cares? Just note it on the URL, because one should never
escape an already escaped string or unescape an already unescaped string. And it
is a problem to definitly find out if a URL is already escaped or not. I think
we need to have a member variable for that.

The constructors or SetSpec or SetPath (or the others) should have an additional
parameter which tells if the given string is already escaped or not which is
then stored in the member-variable.

The get-accessors (like GetPath or GetSpec) should have an additional parameter
which gives the information if we want the spec/path escaped or unescaped and
let that be done by the accessors on the fly looking at the
escape-member-variable and doing the appropiate thing (copy and convert or just
copy).

The webshell would want to see the unescaped version to present the user his
native view of the URLs, internally (like sending the request to the server) we
would use the escaped version of the URL. I don't think it's true that we always
want to see the unescaped version.

Updated

18 years ago
Assignee: ftang → bobj

Comment 28

18 years ago
1. Let not mixed URL for different protocol in the same bug
2. "ftp protocol data to ftp URL conversion" and "ftp URL to ftp protocol data generateion" is done in the client side, not in
the server side. So, the clien code should do the right thing w/ it. The code should always URL escape the ftp data before
concatnate into ftp URL and unescape it from URL into ftp protocol data.
3. HTTP url generation is done in the server side. Therefore, the client code have no control for bad URL generation (which
mena the URL contains byte > 0x80). If http server do send some URL to client which contains bytes > 0x80 the cleint code
should URL escape it. But when the client access that URL, it whould not unescape it.

Reassign to bobj to find the right owner.

Updated

18 years ago
Status: NEW → ASSIGNED
Target Milestone: M12 → M13

Comment 29

18 years ago
Bulk move of all Necko (to be deleted component) bugs to new Networking

component.

Updated

18 years ago
Assignee: bobj → erik
Status: ASSIGNED → NEW
Target Milestone: M13 → M14

Comment 30

18 years ago
Reassigned to erik for M14

Updated

18 years ago
Status: NEW → ASSIGNED

Updated

18 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 19 years ago18 years ago
Resolution: --- → FIXED

Comment 31

18 years ago
Fixed by Andreas' changes that I just checked in.

Updated

18 years ago
Status: RESOLVED → REOPENED

Comment 32

18 years ago
reopened because of backout
(Reporter)

Updated

18 years ago
Resolution: FIXED → ---

Updated

18 years ago
Status: REOPENED → ASSIGNED

Updated

18 years ago
Blocks: 24854

Comment 33

18 years ago
Change platform and OS to ALL
OS: Windows 95 → All
Hardware: PC → All

Updated

18 years ago
Keywords: beta1

Comment 34

18 years ago
Putting on PDT+ radar for beta1.
Whiteboard: waiting for reporter to verify → [PDT+]waiting for reporter to verify

Comment 35

18 years ago
the changes Warren spoke of are in again, can someone with access to this server
take a look if this is fixed?
(Reporter)

Comment 36

18 years ago
I tested this in 2000020808 Win32 build.

The result was same as before.

When I went to http://kaze/url/  and changed Charset to euc-jp,the 3rd
directory name in Japanese is displayed correctly.

Then, when I click on the link in Japanese directory name,
the location bar displayed http://kaze/url/.../  and "Not Found" showed up.

Comment 37

18 years ago
The Japanese directory under the http test does not work in Nav4 and MSIE, so
it's OK if it doesn't work in Mozilla.

The Japanese directory under the ftp test works in Nav4, MSIE and Mozilla, but
it is displayed wrong in Mozilla and MSIE, while it is displayed OK in Nav4.

So, the only thing in this bug report that needs to be fixed is the display
of FTP directory and file names.

Comment 38

18 years ago
I agree that if FP folder & file names work OK, then we would at
least have a parity with our own earlier version, and that would be 
accpetable. ftang's comment on HTTP servers is well-taken.

Comment 39

18 years ago
One more thing. There are 2 files under the Japanese name directory
(3rd from the top) on the ftp server. The content of these 2 files should 
display on Mozilla. Currently, I don't see the 2 files at all.
It looks like the 2nd and 3rd directories are mixed up right now
and thus sometimes (not always) show the name of "testb.html" which
does not exist under the 3rd directory but the 2nd one. 

Comment 40

18 years ago
FTP and non-ASCII file names are relatively minor aspects on the Net today.
I believe we should remove the beta1 keyword and PDT+ approval.

Comment 41

18 years ago
Could someone please describe a little bit better how the urls are looking and
how they should look.

Comment 42

18 years ago
Removed beta1 and PDT+. Please re-enter beta1 if you would like the PDT team
to re-consider.
Keywords: beta1
Whiteboard: [PDT+]waiting for reporter to verify

Comment 43

18 years ago
Jud, please re-assign this to whoever owns the FTP and/or URL code. I don't
think this is a beta1-stopper.
Assignee: erik → valeski
Status: ASSIGNED → NEW
Target Milestone: M14 → M15

Comment 44

18 years ago
Moving to M16.
Target Milestone: M15 → M16

Updated

18 years ago
Whiteboard: [HELP WANTED]

Comment 45

18 years ago
I'm not following this. can someone in i18n take this on?

(Reporter)

Comment 46

18 years ago
Nominating to beta2.
Keywords: nsbeta2

Comment 47

18 years ago
On US Win95 on 4.72 if I type http://kaze:8000/url into the location bar
and hit return
 (1a) with the View|Character Set to either Japanese(EUC-JP) or 
     Japanese(Auto-Detect), then
     - the 3rd directory displays the kanji characters for "nihongo" correctly
     - clicking on that link does NOT work and I get the Not Found error page
     - the URL in the location bar displays: http://kaze:8000/url/“ú–{Œê/
 (2a) with the View|Character Set to Western(ISO-8859-1), then
     - the 3rd directory displays latin1 garbage: ÆüËܸì/
     - clicking on that link DOES work
     - the URL in the location bar displays: http://kaze:8000/url/ÆüËܸì/

On US Win95 running the 2000042109 build and typing http://kaze:8000/url into
the location bar and hitting return

 (1b) with the View|Character Set to either Japanese(EUC-JP) or 
     Japanese(Auto-Detect), then
     - the 3rd directory displays the kanji characters for "nihongo" correctly
     - clicking on that link does not work and I get the Not Found error page
     - the URL in the location bar displays:
    http://kaze:8000/url/%C3%A6%C2%97%C2%A5%C3%A6%C2%9C%C2%AC%C3%A8%C2%AA%C2%9E/
 (2b) with the View|Character Set to Western(ISO-8859-1), then
     - the 3rd directory displays latin1 garbage: ÆüËܸì/ (same as 4.72)
     - clicking on that link DOES work
     - the URL in the location bar displays:
http://kaze:8000/url/%C3%83%C2%86%C3%83%C2%BC%C3%83%C2%8B%C3%83%C2%9C%C3%82%C2%B
8%C3%83%C2%AC/

Additionally, if I paste the resulting URL from (2a) into the location bar:
    http://kaze:8000/url/ÆüËܸì/

Seamonkey also gets the Not Found page and the results in the location bar:
    http://kaze:8000/url/%C3%86%C3%BC%C3%8B%C3%9C%C2%B8%C3%AC/

Teruko, Does 4.72 behave differently on Ja Windows?

Comment 48

18 years ago
whoops.  cut & paste error in previous comment!
The comment for case (2b) fails to find the page and should read:

 (2b) with the View|Character Set to Western(ISO-8859-1), then
     - the 3rd directory displays latin1 garbage: ÆüËܸì/ (same as 4.72)
     - clicking on that link does NOT work and I get the Not Found error page
     - the URL in the location bar displays:
http://kaze:8000/url/%C3%83%C2%86%C3%83%C2%BC%C3%83%C2%8B%C3%83%C2%9C%C3%82%C2%B
8%C3%83%C2%AC/

Comment 49

18 years ago
The ftp problem should be split off into a separate bug.
As noted, the ftp links work, but are not displayed correctly (instead of
Japanese kanji characters you see Latin1 garbage).

Updated

18 years ago
Assignee: valeski → ftang
Target Milestone: M16 → M17

Comment 50

18 years ago
reassign this back to ftang. I think we need to change the code in Layout , but 
not in the necko. 

Comment 51

18 years ago
Similar problem as 30460. Patch available at http://warp/u/ftang/tmp/illurl.txt
Status: NEW → ASSIGNED

Comment 52

18 years ago
One thing I forget to say is the patch is depend on bug 37395.
per ftang/waqar/troy meeting- we should move the url fixing code into content 
sink so we don't need to convert/escape every time. Also, we argee to reassign 
this bug to layout.
Assignee: ftang → troy
Status: ASSIGNED → NEW
Component: Networking → Layout

Comment 53

18 years ago
*** Bug 38133 has been marked as a duplicate of this bug. ***

Comment 54

18 years ago
with Troy's departure, this is at risk for M17.  PDT team, is this required for 
beta2?
Assignee: troy → buster

Updated

18 years ago
Assignee: buster → waterson
Whiteboard: [HELP WANTED] → [nsbeta2+]

Comment 55

18 years ago
Putting on [nsbeta2+] radar for beta2 fix.   Sending over to waterson.
(Assignee)

Comment 56

18 years ago
It seems like the right thing to do in this case is convert/escape/mangle the 
URL in the anchor tag itself? This would also make sure that the 
correct thing happened if someone changed the "href" attribute using L1 
DOM APIs.

attinasi and I talked about keeping the resolved version of the URL in the 
anchor tag to deal with some style & performance issues: maybe this could just 
be an extension (or precursor) to that work?

Presumably we'd need to do this for other elements that had "href" properties, 
as well.

Comments?
Status: NEW → ASSIGNED

Comment 57

18 years ago
Some notes from Troy:
1) Frank's patch is expensive in terms of performance, especially because it 
includes dynamic allocation. We should be able to do much better.
2) We should be able to convert the URL to an immutable 8-bit ASCII string one 
time, probably at the time we process the attribute (or maybe lazily the first 
time we actually use the attribute.)  We would cache this converted immutable 
string and hand that to necko.
3) or, necko could just do the conversion on the fly.  but that brings us right 
back to performance problems.
(Assignee)

Comment 58

18 years ago
So I claim (2) is the right thing to do, and should be done by the anchor tag. 
We need to be able to handle L1 DOM updates, too, and the tag itself (not the 
content sink) is the only one that can do that.

Comment 59

18 years ago
Can we canonicalize it when we ASCII-ize it too? That would help in dealing with 
bug 29611 (which we reports that we spend too much time determining a link's 
state due to conversions.) I'm linking this bug to 29611 since it may help it.

Updated

18 years ago
Blocks: 29611
(Assignee)

Comment 60

18 years ago
Yeah, we should absolutely canonicalize it then. (See comments above...)
(Assignee)

Comment 61

18 years ago
Created attachment 9061 [details] [diff] [review]
changes to nsHTMLAnchorElement.cpp; compute & cache canonical URL in .href property
(Assignee)

Comment 62

18 years ago
Created attachment 9062 [details] [diff] [review]
changes to force canonical URI through nsILinkHandler::On[OverLink|LinkClick]()

Comment 63

18 years ago
Since Ftang's 11/30/99 comment on HTTP URL issue is on
target, I would only add an additional server example 
to contrast different servers on this problem.

The above URL: 

http://kaze:8000/url

points to a Netscape Enterprsie server 2.01. No Netscape 
Enterprise server supports (up to the current 4.1) 8-bit
path or file names according to the server admin document. 

Thus it send 8-bit URL portion without escaping.
How tro deal with this has been documented by ftang already.

An Apache sever on the other hand supports 8-bit names
and escapes them. See for example a nearly identical identical
example to the above on an Apache server:

http://mugen/url

There the Japanese name is escaped by the server properly --
do view source on the page -- and the directory can be easily
accessed.

On another issue: should we support UTF-8 URL? 
IE has a default option setting to this and then the user
can turn off this option. After discussing this issue with
Erik, I'm inclined to believe that UTF-8 URL has problems
and would not need to be supported at this juncture.

Comment 64

18 years ago
I shoud add that on the Apache server, Comm 4.x, IE5, 
and Netscape 6 all work OK with the Japanese path name.
(Assignee)

Updated

18 years ago
Depends on: 40461
(Assignee)

Comment 65

18 years ago
Ok, I talked to ftang yesterday, and here's what he thinks the right thing to 
do is as I understand it.

Problem. If an anchor tag's "href" attribute contains non-ASCII characters. 
URLs don't. So how do we properly escape the non-ASCII characters in the URL so 
that an ASCII string results?

Solution. Use the document's charset to return the href attribute to its 
original 8-bit encoding. Then URL escape (e.g., " " --> "%20") the 
non-ASCII printables. Then call nsIURI::Resolve with the document's base URL.

I've implemented it, and sure enough, it seems to make this test case work. It 
leaves a bad taste in warren's mouth, but I burned my tounge a while back and 
can't taste a thing.

Do we need to do this for *every* relative URI reference? Zoiks! Anyway, I've 
implemented a layout-specific version of NS_MakeAbsoluteURI() (maybe I should 
call it NS_MakeAbsoluteURIWithHRef or something) that takes a charset ID and 
does the heavy lifting. It's in a new file which I'll attach to the bug. I'll 
also attach new diffs to nsHTMLAnchorElement.cpp.

Comments on this approach?
(Assignee)

Comment 66

18 years ago
Created attachment 9172 [details]
nsHTMLUtils.cpp; contains layout-specific NS_MakeAbsoluteURI() implementation
(Assignee)

Comment 67

18 years ago
Created attachment 9175 [details] [diff] [review]
cleaned up changes to nsHTMLAnchorElement.cpp

Comment 68

18 years ago
chris: this looks great.  r=buster.

what about other URLs, like <img src=...>?  Do those need to be treated 
separately?

Comment 69

18 years ago
Looks good. Where do you want to put the "silly extra stuff"? If we put it in 
necko, it will introduce a dependency on i18n (although maybe we have one 
already). But maybe this is right anyway.

Any suggestions on how to specify the necko APIs (in nsNetUtil.h) so that it 
clear what you're supposed to be passing in? Maybe we just need a comment 
saying that the nsString is supposed to already have all this stuff done to it 
(charset encoded, escaped). Or maybe we should eliminate it in favor of the new 
thing you wrote.
(Assignee)

Comment 70

18 years ago
I am going to put the charset-specific version of NS_MakeAbsoluteURI() in 
layout; and I am going to change it to NS_MakeAbsoluteURIWithCharset() to avoid 
gratuitous name overloading.

Comment 71

18 years ago
I'd like to make it a method on nsIIOService provided we already depend on 
i18n. (I think we do for string bundles. - Gagan, Jud?)

Comment 72

18 years ago
is bug 40661 related?:
""files" datasource fails to open non-latin named directory"
(Assignee)

Comment 73

18 years ago
No. that bug is a dup of 28787

Comment 74

18 years ago
I had a quick look at the new MakeAbsoluteURIWithCharset method, and it uses
Unicode conversion routines. However, Necko does not appear to use the Unicode
converters. Necko may use string bundles, but they are in a separate DLL
(strres.dll), while the Unicode converters are in uc*.dll. I don't know how
important it is to keep Necko free of uc*.dll dependencies, but this is what I
found after a quick look.

Comment 75

18 years ago
Erik: Thanks for the info. Is there any plan to bundle all (many) of the intl 
dlls into one as we did for necko (to improve startup time, and reduce 
clutter)?

Comment 76

18 years ago
We have discussed that, but I don't know of any concrete plans in that area.
Frank?

Comment 77

18 years ago
I took a look at MakeAbsoluteURIWithCharset and it includes the following code
at the end:

    static const PRInt32 kEscapeEverything = nsIIOService::url_Forced - 1;

    nsXPIDLCString escaped;
    ioservice->Escape(spec.GetBuffer(), kEscapeEverything,
getter_Copies(escaped));

This usage of nsIIOService::url_Forced is certainly not what I wanted it to do
and I don't believe it does anything useful this way. This method is used to
escape a specific part of an URL not a whole URL. There are different rules for
every part.
(Assignee)

Comment 78

18 years ago
Could you suggest an alternative?

Comment 79

18 years ago
Do we really need to escape this stuff? What characters that can possibly damage
urlparsing can be expected from the charset conversion? Maybe you could use the
old nsEscape functions in xpcom/io as a replacement.
This issue is also relevant for simple XLinks, XPath, etc. "src" attributes.

When I implemented XPath and simple XLink, I asked ftang how to do this but he 
could not convince me what is the right thing to do :) So what I do there is 
just grab the Unicode string, AssignWithConversion() to nsCAutoString and pass 
it to the URI-creating objects. This seems to work in basic cases, like spaces, 
but I do not think it is the correct way.

What bugs me the most is this: suppose a document is in UTF-16 and a URL 
contains an illegal char that does not fit in ASCII. ftang said to 1) get the 
doc charset and use the nsIIOService ConvertAndEscape() function, but to me this 
seems like it cannot work for UTF-16. Or does the convert thing automatically 
convert UTF-16 into UTF-8 (which fits into char*)? How would we then know what 
to send back to server?

I also seem to have trouble understanding how do we escape multibyte UTF-8 URLs. 
If a two-byte character becomes two escaped URL chars, will the system still 
work? Do we send the server back the bits it gave us?
(Assignee)

Comment 81

18 years ago
andreas: how is using the "old" nsEscape different from what I'm doing now? Are 
there really differnt rules for escaping different parts of a URI? That doesn't 
seem right.

heikki: nsIUnicodeEncoder takes a UCS-2 string as input and returns the result 
in an 8-bit string as output. Presumably it does the right thing on UTF-16 to 
round-trip the UTF-16 bytes.

Comment 82

18 years ago
The main reason for escaping is to hide certain special characters from the
urlparser that would mislead the parser, like having a @ in a username or a / in
a filename or something similar. Depending on the position inside the url
different characters are special for the parser and that is what can be done
with the new escape functions. I have to tell it which part of the URL I want to
escape. Simply giving it every possible mask will not work. The old nsEscape
stuff does not look for a special part, it can be used to escape whole urls, but
it may escape to much or not much enough.

Comment 83

18 years ago
In the case of a UTF-16 document, it is not clear to me that the server is
expecting a URL-encoded (%XX) UTF-16 URL. In fact, some people working on these
issues in the standards bodies seem to be pushing UTF-8 as a convention in URLs.
See ftp://ftp.ietf.org/internet-drafts/draft-masinter-url-i18n-05.txt. However,
blindly converting *every* part of a URL to UTF-8 has bad consequences in
today's Web, as Microsoft discovered. They do not convert the "query" part (the
part after the question mark) to UTF-8. Also, they have a preference for the
part before the question mark. In most versions, they convert that part to UTF-8
but in the Korean and Taiwanese versions they found that they had to set the
default pref to "OFF" (i.e. no conversion to UTF-8), presumably because people
in those countries were using servers that expect something other than UTF-8
in those parts of the URLs.

For now, converting from PRUnichar to the doc's charset and then %-encoding is
the best thing to do, in my opinion. We should continue to follow the emerging
specs in this area, and possibly modify our implementation accordingly. UTF-16
documents are currently still quite rare, I think, but if we are really
concerned about this, my suggestion is to make a special case for the UTF-16
doc charset and convert to UTF-8 (instead of UTF-16).
(Assignee)

Comment 84

18 years ago
fix checked in.
Status: ASSIGNED → RESOLVED
Last Resolved: 18 years ago18 years ago
Resolution: --- → FIXED

Comment 85

18 years ago
Not sure how to verify this problem. Could the reporter please check this in the 
latest build ?
(Reporter)

Comment 86

18 years ago
I verified this in 2000-06-02-08 Win32, Mac, and Linux build.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.