Closed Bug 10373 Opened 21 years ago Closed 20 years ago

Non-ASCII url cannot be found


(Core :: Layout, defect, P2, blocker)






(Reporter: teruko, Assigned: waterson)




(Whiteboard: [nsbeta2+])


(4 files)

Above URL name is included Japanese directory.

When you go to the page, the Dos console says
"http://babel/tests/browser/http-test/%3f%3f%3f loaded sucessfully",
but in the Apprunner says "Not Found".

Step of reproduce
1. Go to the http://babel/tests/browser/http-test
2. Click on "??"

"Not Found" shows.

Tested 7-21-16-NECKO Win32 build and 7-22-08-M9 Win32 build.
Priority: P3 → P2
Target Milestone: M9
The loaded successfully is currently tied to document load status. it really has
no clue as to what the HTTP status was. I am not sure what the other problem you
are mentioning is... Is it expected to show the document (I verified in 4.6 it

There is a whole section on URL encoding that is missing right now from the
picture. But I want to confirm that the bug really is about encoding URLs.
As a maintainer of the Babel server cited above, I would like to
offer some additional facts of relevance. (Sorry, teruko, I forgot
to tell you about the Babel server's limitation described below.)

1. The directory name which ends the above URL is actually in Japanese
   and contain three 2-byte words. Unfortunately the server is running
   on an US Windows NT4 and thus mangles the multi-byte directory and
   file names. This is why you see 3 ? marks. Thus, even if you properly
   encode the URL, you will never find the directory.
2. Let me offer a sample on a Unix server which can handle 8-bit
   file names.


3. In the directory, you will find 3 sub-directories. The 3rd from top
   (if viewed with 4.6/4.7 or 5.0 under Japanese (EUC-JP) encoding)
   will show in Japanese.
4. The first 2 directories are named using the escaped URLs
   and the 2nd one actually represents the escaped URL version of
   the 3rd Japanese one. If esacping is working correctly, you should
   see the escaped URL matching that of the 2nd one when the cursor
   is perched on the 3rd directory. (Compare FTP URL below.)

Some issues:

A. Click on the 3rd directory and it fails. We seem to be
   escaping special characters like "%" but not 8-bit charcters.
   (4.6/4.7 and 5.0 w/necko are pretty much the same here.)
   We should fix this in 5.0

B. 4.6/4.7 actually escapes 8-bit path names in FTP url.
   For example, try the same 3 directory listing with the ftp
   protocol at the URL below. You can get inside the 3rd directory
   under Japanese (EUC-JP) encoding:


   And you can also see the escaped URL on the status bar when
   the cursor is on the 3rd directory.
   5.0 w/necko does not escape 8-bit names and cannot get inside this
   directory. We should fix this in 5.0


Are you planning on supporting escaping to both native encoding
and UTF-8 if a server indicates it can accept Unicode? I believe
there is a recent IETF draft on url-i18n which discusses
the UTF-8 implementation.
*** Bug 7399 has been marked as a duplicate of this bug. ***
*** Bug 7847 has been marked as a duplicate of this bug. ***
*** Bug 8333 has been marked as a duplicate of this bug. ***
*** Bug 8337 has been marked as a duplicate of this bug. ***
*** Bug 10429 has been marked as a duplicate of this bug. ***
Severity: major → blocker
10429 was gonna be marked a blocker, this the dup, blocker.
Closed: 21 years ago
Resolution: --- → FIXED
This should work ok now... Pl. verify.
Whiteboard: waiting for new build with fix to verify
Can you international types verify this? I have verified that you can have
spaces in your path, which was bug 10429
Whiteboard: waiting for new build with fix to verify → waiting for reporter to verify
QA Contact: paulmac → teruko
Yes. This should go to teruko now.
I tested this in 7-31-09 and 8-02-09 build.
I used http://kaze/url/ to test this.
When I went to http://kaze/url/  and changed Charset to euc-jp,the 3rd
directory name in Japanese is displayed correctly.

Then, when I click on the link in Japanese directory name,
the location bar displayed http://kaze/url/.../  and "Not Found" showed up.

This needs to be reopened.
Resolution: FIXED → ---
Clearing Fixed resolution due to reopen.
Target Milestone: M9 → M10
This has a lot to do with the fact that nsIURI currently works with char* and
not PRUnichar* I would like to verify this again once we move the URL accessor
to all PRUnichar (sometime for M10). Marking it such.
*** Bug 12473 has been marked as a duplicate of this bug. ***
Blocks: 13449
Target Milestone: M10 → M12
Correcting target milestone.
Summary: Non-ASCII url cannot be found → [dogfood] Non-ASCII url cannot be found
PDT would like to know if the 4.x product allow non-ascii in URL?  Does a test
case rely on allowing 8bit?
In 4.x, we are able to resolve FTP URL links containing 8-bit characters but not HTTP URL links
Please note that this is not the same as inputting into the Location window in 8-bit
characters. We didn't support that in 4.x.
Kat's comment is true for Japanese, but not for Latin-1. In 4.x you can type
http://babel/5x_tests/misc/montse/αϊινσϊ.jpg and it will show the file and the
URL will be correct. We need this working for Latin-1
Sorry about that. We didn't esacpe multi-byte characters in 4.x but did so for single-byte 8-bit
charactesr in HTTP.
It turns out that with the current Mozilla build (10/20/99 Win32), Latin 1 URL is resolved to an existing page also.
It seems that both in 4.x and Mozilla, we are not doing anything to an URL which contains single-byte
8-bit characters in URL, i.e. just passing them through and it works. I think we should escape these single-byte
8-bit characters, however.

It is multi-byte characters which are not supported at this point in HTTP or FTP URLs in Mozilla (or in 4.x).
Summary: [dogfood] Non-ASCII url cannot be found → Non-ASCII url cannot be found
And now Kat is right; it's the typing (another bug) which is not working, but if
the file is selected it will show up. Removing dogfood.
Moving Assignee from gagan to warren since he is away.
Assignee: warren → momoi
What's the deal on this bug now? The link in the URL field above is stale (in
4.x too): http://babel/tests/browser/http-test/%3f%3f%3f

This one works fine: http://babel/5x_tests/misc/montse/αϊινσϊ.jpg
although I don't think that should be a valid url. Those characters should be
escaped to work shouldn't they (or is that not how things are in practice).

Reassigning back to i18n.
What needs to happen on this bug is:

1. Use the test case above. (Changed from the one on babel which
   cannot process multi-byte URL as it is US NT4)
2. There are 2 URL links on this test page. The last one
   is in Japanese. When you view this page under the Japanese
   (EUC-JP) encoding, you should see the status bar at the bottom
   display the escaped URL. Currently it shows 3 dots indicating
   that Mozilla is not able to escape multi-byte directory names.
   When the 3rd (JPN) link is properly escaped, the escaped part should
   look like the 2nd URL which shows the escaped EUC-JP name used in the
   3rd link. The 1st escaped example shows the escaped sequence of the
   same 3 characters in Shift_JIS.

3. Contrast this with the ftp protocol under 4.7:


   This page contains the same 3 directory names. Use 4.7 and move
   the cursor over the 3rd link. You see that 4.7 escapes this to
   the identical string as you see for the 2nd URL.

Assigning this to ftang for an assessment as to what needs to be
done. When we isolate the problem, please send it back to warren.
Assignee: momoi → ftang
I've determined that the URL parsing is pretty screwed up in this regard. The
way I think it should work is:

- SetSpec and SetRelativePath should take escaped strings (like in the
examples), and in the parsing process convert the escapes to their unescaped
representation (held inside the nsStdURL object).

- We should probably assert if we see a non-ascii character passed to
SetSpec/SetRelativePath. The caller should have converted to UTF-8 before
calling us (I think this is being done by the webshell).

- GetSpec should reconstruct and re-escape the URL string before returning it.
Note that this is tricky, because if you have a slash in a filename (could
happen) it should be careful to turn this into %2F rather than make it look
like a directory delimiter.

- The nsIURI accessors should return the unescaped representation of things.
That way if I say SetSpec("file:/c|/Program%20Files") and then call GetPath, I
should get back "c|/Program Files".

The FTP protocol must be doing escaping by hand. This should be handled by

Cc'ing valeski & andreas.
My two cents on this:

Yes, URL parsing is pretty much screwed up regarding escaping and multibyte

- There was a task to convert the URL accessors to PRUnichar. See bug 13453. It
is now marked invalid.

- nsStdURL does no escaping currently.

Why not store the URL as we get it (escaped or not) from whoever is calling
nsStdURL-functions? Who cares? Just note it on the URL, because one should never
escape an already escaped string or unescape an already unescaped string. And it
is a problem to definitly find out if a URL is already escaped or not. I think
we need to have a member variable for that.

The constructors or SetSpec or SetPath (or the others) should have an additional
parameter which tells if the given string is already escaped or not which is
then stored in the member-variable.

The get-accessors (like GetPath or GetSpec) should have an additional parameter
which gives the information if we want the spec/path escaped or unescaped and
let that be done by the accessors on the fly looking at the
escape-member-variable and doing the appropiate thing (copy and convert or just

The webshell would want to see the unescaped version to present the user his
native view of the URLs, internally (like sending the request to the server) we
would use the escaped version of the URL. I don't think it's true that we always
want to see the unescaped version.
Assignee: ftang → bobj
1. Let not mixed URL for different protocol in the same bug
2. "ftp protocol data to ftp URL conversion" and "ftp URL to ftp protocol data generateion" is done in the client side, not in
the server side. So, the clien code should do the right thing w/ it. The code should always URL escape the ftp data before
concatnate into ftp URL and unescape it from URL into ftp protocol data.
3. HTTP url generation is done in the server side. Therefore, the client code have no control for bad URL generation (which
mena the URL contains byte > 0x80). If http server do send some URL to client which contains bytes > 0x80 the cleint code
should URL escape it. But when the client access that URL, it whould not unescape it.

Reassign to bobj to find the right owner.
Target Milestone: M12 → M13
Bulk move of all Necko (to be deleted component) bugs to new Networking

Assignee: bobj → erik
Target Milestone: M13 → M14
Reassigned to erik for M14
Closed: 21 years ago20 years ago
Resolution: --- → FIXED
Fixed by Andreas' changes that I just checked in.
reopened because of backout
Resolution: FIXED → ---
Blocks: 24854
Change platform and OS to ALL
OS: Windows 95 → All
Hardware: PC → All
Keywords: beta1
Putting on PDT+ radar for beta1.
Whiteboard: waiting for reporter to verify → [PDT+]waiting for reporter to verify
the changes Warren spoke of are in again, can someone with access to this server
take a look if this is fixed?
I tested this in 2000020808 Win32 build.

The result was same as before.

When I went to http://kaze/url/  and changed Charset to euc-jp,the 3rd
directory name in Japanese is displayed correctly.

Then, when I click on the link in Japanese directory name,
the location bar displayed http://kaze/url/.../  and "Not Found" showed up.

The Japanese directory under the http test does not work in Nav4 and MSIE, so
it's OK if it doesn't work in Mozilla.

The Japanese directory under the ftp test works in Nav4, MSIE and Mozilla, but
it is displayed wrong in Mozilla and MSIE, while it is displayed OK in Nav4.

So, the only thing in this bug report that needs to be fixed is the display
of FTP directory and file names.
I agree that if FP folder & file names work OK, then we would at
least have a parity with our own earlier version, and that would be 
accpetable. ftang's comment on HTTP servers is well-taken.
One more thing. There are 2 files under the Japanese name directory
(3rd from the top) on the ftp server. The content of these 2 files should 
display on Mozilla. Currently, I don't see the 2 files at all.
It looks like the 2nd and 3rd directories are mixed up right now
and thus sometimes (not always) show the name of "testb.html" which
does not exist under the 3rd directory but the 2nd one. 
FTP and non-ASCII file names are relatively minor aspects on the Net today.
I believe we should remove the beta1 keyword and PDT+ approval.
Could someone please describe a little bit better how the urls are looking and
how they should look.
Removed beta1 and PDT+. Please re-enter beta1 if you would like the PDT team
to re-consider.
Keywords: beta1
Whiteboard: [PDT+]waiting for reporter to verify
Jud, please re-assign this to whoever owns the FTP and/or URL code. I don't
think this is a beta1-stopper.
Assignee: erik → valeski
Target Milestone: M14 → M15
Moving to M16.
Target Milestone: M15 → M16
Whiteboard: [HELP WANTED]
I'm not following this. can someone in i18n take this on?

Nominating to beta2.
Keywords: nsbeta2
On US Win95 on 4.72 if I type http://kaze:8000/url into the location bar
and hit return
 (1a) with the View|Character Set to either Japanese(EUC-JP) or 
     Japanese(Auto-Detect), then
     - the 3rd directory displays the kanji characters for "nihongo" correctly
     - clicking on that link does NOT work and I get the Not Found error page
     - the URL in the location bar displays: http://kaze:8000/url/“ú–{Œê/
 (2a) with the View|Character Set to Western(ISO-8859-1), then
     - the 3rd directory displays latin1 garbage: ÆüËܸì/
     - clicking on that link DOES work
     - the URL in the location bar displays: http://kaze:8000/url/ÆüËܸì/

On US Win95 running the 2000042109 build and typing http://kaze:8000/url into
the location bar and hitting return

 (1b) with the View|Character Set to either Japanese(EUC-JP) or 
     Japanese(Auto-Detect), then
     - the 3rd directory displays the kanji characters for "nihongo" correctly
     - clicking on that link does not work and I get the Not Found error page
     - the URL in the location bar displays:
 (2b) with the View|Character Set to Western(ISO-8859-1), then
     - the 3rd directory displays latin1 garbage: ÆüËܸì/ (same as 4.72)
     - clicking on that link DOES work
     - the URL in the location bar displays:

Additionally, if I paste the resulting URL from (2a) into the location bar:

Seamonkey also gets the Not Found page and the results in the location bar:

Teruko, Does 4.72 behave differently on Ja Windows?
whoops.  cut & paste error in previous comment!
The comment for case (2b) fails to find the page and should read:

 (2b) with the View|Character Set to Western(ISO-8859-1), then
     - the 3rd directory displays latin1 garbage: ÆüËܸì/ (same as 4.72)
     - clicking on that link does NOT work and I get the Not Found error page
     - the URL in the location bar displays:
The ftp problem should be split off into a separate bug.
As noted, the ftp links work, but are not displayed correctly (instead of
Japanese kanji characters you see Latin1 garbage).
Assignee: valeski → ftang
Target Milestone: M16 → M17
reassign this back to ftang. I think we need to change the code in Layout , but 
not in the necko. 
Similar problem as 30460. Patch available at http://warp/u/ftang/tmp/illurl.txt
One thing I forget to say is the patch is depend on bug 37395.
per ftang/waqar/troy meeting- we should move the url fixing code into content 
sink so we don't need to convert/escape every time. Also, we argee to reassign 
this bug to layout.
Assignee: ftang → troy
Component: Networking → Layout
*** Bug 38133 has been marked as a duplicate of this bug. ***
with Troy's departure, this is at risk for M17.  PDT team, is this required for 
Assignee: troy → buster
Assignee: buster → waterson
Whiteboard: [HELP WANTED] → [nsbeta2+]
Putting on [nsbeta2+] radar for beta2 fix.   Sending over to waterson.
It seems like the right thing to do in this case is convert/escape/mangle the 
URL in the anchor tag itself? This would also make sure that the 
correct thing happened if someone changed the "href" attribute using L1 

attinasi and I talked about keeping the resolved version of the URL in the 
anchor tag to deal with some style & performance issues: maybe this could just 
be an extension (or precursor) to that work?

Presumably we'd need to do this for other elements that had "href" properties, 
as well.

Some notes from Troy:
1) Frank's patch is expensive in terms of performance, especially because it 
includes dynamic allocation. We should be able to do much better.
2) We should be able to convert the URL to an immutable 8-bit ASCII string one 
time, probably at the time we process the attribute (or maybe lazily the first 
time we actually use the attribute.)  We would cache this converted immutable 
string and hand that to necko.
3) or, necko could just do the conversion on the fly.  but that brings us right 
back to performance problems.
So I claim (2) is the right thing to do, and should be done by the anchor tag. 
We need to be able to handle L1 DOM updates, too, and the tag itself (not the 
content sink) is the only one that can do that.
Can we canonicalize it when we ASCII-ize it too? That would help in dealing with 
bug 29611 (which we reports that we spend too much time determining a link's 
state due to conversions.) I'm linking this bug to 29611 since it may help it.
Blocks: 29611
Yeah, we should absolutely canonicalize it then. (See comments above...)
Since Ftang's 11/30/99 comment on HTTP URL issue is on
target, I would only add an additional server example 
to contrast different servers on this problem.

The above URL: 


points to a Netscape Enterprsie server 2.01. No Netscape 
Enterprise server supports (up to the current 4.1) 8-bit
path or file names according to the server admin document. 

Thus it send 8-bit URL portion without escaping.
How tro deal with this has been documented by ftang already.

An Apache sever on the other hand supports 8-bit names
and escapes them. See for example a nearly identical identical
example to the above on an Apache server:


There the Japanese name is escaped by the server properly --
do view source on the page -- and the directory can be easily

On another issue: should we support UTF-8 URL? 
IE has a default option setting to this and then the user
can turn off this option. After discussing this issue with
Erik, I'm inclined to believe that UTF-8 URL has problems
and would not need to be supported at this juncture.
I shoud add that on the Apache server, Comm 4.x, IE5, 
and Netscape 6 all work OK with the Japanese path name.
Depends on: 40461
Ok, I talked to ftang yesterday, and here's what he thinks the right thing to 
do is as I understand it.

Problem. If an anchor tag's "href" attribute contains non-ASCII characters. 
URLs don't. So how do we properly escape the non-ASCII characters in the URL so 
that an ASCII string results?

Solution. Use the document's charset to return the href attribute to its 
original 8-bit encoding. Then URL escape (e.g., " " --> "%20") the 
non-ASCII printables. Then call nsIURI::Resolve with the document's base URL.

I've implemented it, and sure enough, it seems to make this test case work. It 
leaves a bad taste in warren's mouth, but I burned my tounge a while back and 
can't taste a thing.

Do we need to do this for *every* relative URI reference? Zoiks! Anyway, I've 
implemented a layout-specific version of NS_MakeAbsoluteURI() (maybe I should 
call it NS_MakeAbsoluteURIWithHRef or something) that takes a charset ID and 
does the heavy lifting. It's in a new file which I'll attach to the bug. I'll 
also attach new diffs to nsHTMLAnchorElement.cpp.

Comments on this approach?
chris: this looks great.  r=buster.

what about other URLs, like <img src=...>?  Do those need to be treated 
Looks good. Where do you want to put the "silly extra stuff"? If we put it in 
necko, it will introduce a dependency on i18n (although maybe we have one 
already). But maybe this is right anyway.

Any suggestions on how to specify the necko APIs (in nsNetUtil.h) so that it 
clear what you're supposed to be passing in? Maybe we just need a comment 
saying that the nsString is supposed to already have all this stuff done to it 
(charset encoded, escaped). Or maybe we should eliminate it in favor of the new 
thing you wrote.
I am going to put the charset-specific version of NS_MakeAbsoluteURI() in 
layout; and I am going to change it to NS_MakeAbsoluteURIWithCharset() to avoid 
gratuitous name overloading.
I'd like to make it a method on nsIIOService provided we already depend on 
i18n. (I think we do for string bundles. - Gagan, Jud?)
is bug 40661 related?:
""files" datasource fails to open non-latin named directory"
No. that bug is a dup of 28787
I had a quick look at the new MakeAbsoluteURIWithCharset method, and it uses
Unicode conversion routines. However, Necko does not appear to use the Unicode
converters. Necko may use string bundles, but they are in a separate DLL
(strres.dll), while the Unicode converters are in uc*.dll. I don't know how
important it is to keep Necko free of uc*.dll dependencies, but this is what I
found after a quick look.
Erik: Thanks for the info. Is there any plan to bundle all (many) of the intl 
dlls into one as we did for necko (to improve startup time, and reduce 
We have discussed that, but I don't know of any concrete plans in that area.
I took a look at MakeAbsoluteURIWithCharset and it includes the following code
at the end:

    static const PRInt32 kEscapeEverything = nsIIOService::url_Forced - 1;

    nsXPIDLCString escaped;
    ioservice->Escape(spec.GetBuffer(), kEscapeEverything,

This usage of nsIIOService::url_Forced is certainly not what I wanted it to do
and I don't believe it does anything useful this way. This method is used to
escape a specific part of an URL not a whole URL. There are different rules for
every part.
Could you suggest an alternative?
Do we really need to escape this stuff? What characters that can possibly damage
urlparsing can be expected from the charset conversion? Maybe you could use the
old nsEscape functions in xpcom/io as a replacement.
This issue is also relevant for simple XLinks, XPath, etc. "src" attributes.

When I implemented XPath and simple XLink, I asked ftang how to do this but he 
could not convince me what is the right thing to do :) So what I do there is 
just grab the Unicode string, AssignWithConversion() to nsCAutoString and pass 
it to the URI-creating objects. This seems to work in basic cases, like spaces, 
but I do not think it is the correct way.

What bugs me the most is this: suppose a document is in UTF-16 and a URL 
contains an illegal char that does not fit in ASCII. ftang said to 1) get the 
doc charset and use the nsIIOService ConvertAndEscape() function, but to me this 
seems like it cannot work for UTF-16. Or does the convert thing automatically 
convert UTF-16 into UTF-8 (which fits into char*)? How would we then know what 
to send back to server?

I also seem to have trouble understanding how do we escape multibyte UTF-8 URLs. 
If a two-byte character becomes two escaped URL chars, will the system still 
work? Do we send the server back the bits it gave us?
andreas: how is using the "old" nsEscape different from what I'm doing now? Are 
there really differnt rules for escaping different parts of a URI? That doesn't 
seem right.

heikki: nsIUnicodeEncoder takes a UCS-2 string as input and returns the result 
in an 8-bit string as output. Presumably it does the right thing on UTF-16 to 
round-trip the UTF-16 bytes.
The main reason for escaping is to hide certain special characters from the
urlparser that would mislead the parser, like having a @ in a username or a / in
a filename or something similar. Depending on the position inside the url
different characters are special for the parser and that is what can be done
with the new escape functions. I have to tell it which part of the URL I want to
escape. Simply giving it every possible mask will not work. The old nsEscape
stuff does not look for a special part, it can be used to escape whole urls, but
it may escape to much or not much enough.
In the case of a UTF-16 document, it is not clear to me that the server is
expecting a URL-encoded (%XX) UTF-16 URL. In fact, some people working on these
issues in the standards bodies seem to be pushing UTF-8 as a convention in URLs.
See However,
blindly converting *every* part of a URL to UTF-8 has bad consequences in
today's Web, as Microsoft discovered. They do not convert the "query" part (the
part after the question mark) to UTF-8. Also, they have a preference for the
part before the question mark. In most versions, they convert that part to UTF-8
but in the Korean and Taiwanese versions they found that they had to set the
default pref to "OFF" (i.e. no conversion to UTF-8), presumably because people
in those countries were using servers that expect something other than UTF-8
in those parts of the URLs.

For now, converting from PRUnichar to the doc's charset and then %-encoding is
the best thing to do, in my opinion. We should continue to follow the emerging
specs in this area, and possibly modify our implementation accordingly. UTF-16
documents are currently still quite rare, I think, but if we are really
concerned about this, my suggestion is to make a special case for the UTF-16
doc charset and convert to UTF-8 (instead of UTF-16).
fix checked in.
Closed: 20 years ago20 years ago
Resolution: --- → FIXED
Not sure how to verify this problem. Could the reporter please check this in the 
latest build ?
I verified this in 2000-06-02-08 Win32, Mac, and Linux build.
You need to log in before you can comment on or make changes to this bug.