10373 - Non-ASCII url cannot be found

Reporter

Description

•

25 years ago

Above URL name is included Japanese directory.

When you go to the page, the Dos console says
"http://babel/tests/browser/http-test/%3f%3f%3f loaded sucessfully",
but in the Apprunner says "Not Found".

Step of reproduce
1. Go to the http://babel/tests/browser/http-test
2. Click on "??"

"Not Found" shows.

Tested 7-21-16-NECKO Win32 build and 7-22-08-M9 Win32 build.

Teruko Kobayashi

Reporter

Updated

•

25 years ago

Priority: P3 → P2

Target Milestone: M9

Gagan

Comment 1

•

25 years ago

The loaded successfully is currently tied to document load status. it really has
no clue as to what the HTTP status was. I am not sure what the other problem you
are mentioning is... Is it expected to show the document (I verified in 4.6 it
doesn't)?

There is a whole section on URL encoding that is missing right now from the
picture. But I want to confirm that the bug really is about encoding URLs.

Katsuhiko Momoi

Comment 2

•

25 years ago

As a maintainer of the Babel server cited above, I would like to
offer some additional facts of relevance. (Sorry, teruko, I forgot
to tell you about the Babel server's limitation described below.)

1. The directory name which ends the above URL is actually in Japanese
   and contain three 2-byte words. Unfortunately the server is running
   on an US Windows NT4 and thus mangles the multi-byte directory and
   file names. This is why you see 3 ? marks. Thus, even if you properly
   encode the URL, you will never find the directory.
2. Let me offer a sample on a Unix server which can handle 8-bit
   file names.

   http://kaze:8000/url/

3. In the directory, you will find 3 sub-directories. The 3rd from top
   (if viewed with 4.6/4.7 or 5.0 under Japanese (EUC-JP) encoding)
   will show in Japanese.
4. The first 2 directories are named using the escaped URLs
   and the 2nd one actually represents the escaped URL version of
   the 3rd Japanese one. If esacping is working correctly, you should
   see the escaped URL matching that of the 2nd one when the cursor
   is perched on the 3rd directory. (Compare FTP URL below.)

Some issues:

A. Click on the 3rd directory and it fails. We seem to be
   escaping special characters like "%" but not 8-bit charcters.
   (4.6/4.7 and 5.0 w/necko are pretty much the same here.)
   We should fix this in 5.0

B. 4.6/4.7 actually escapes 8-bit path names in FTP url.
   For example, try the same 3 directory listing with the ftp
   protocol at the URL below. You can get inside the 3rd directory
   under Japanese (EUC-JP) encoding:

   ftp://kaze/pub/

   And you can also see the escaped URL on the status bar when
   the cursor is on the 3rd directory.
   5.0 w/necko does not escape 8-bit names and cannot get inside this
   directory. We should fix this in 5.0

Question:

Are you planning on supporting escaping to both native encoding
and UTF-8 if a server indicates it can accept Unicode? I believe
there is a recent IETF draft on url-i18n which discusses
the UTF-8 implementation.

Gagan

Comment 3

•

25 years ago

*** Bug 7399 has been marked as a duplicate of this bug. ***

Gagan

Comment 4

•

25 years ago

*** Bug 7847 has been marked as a duplicate of this bug. ***

Gagan

Comment 5

•

25 years ago

*** Bug 8333 has been marked as a duplicate of this bug. ***

Gagan

Comment 6

•

25 years ago

*** Bug 8337 has been marked as a duplicate of this bug. ***

Paul MacQuiddy

Comment 7

•

25 years ago

*** Bug 10429 has been marked as a duplicate of this bug. ***

Chris McAfee

Updated

•

25 years ago

Severity: major → blocker

Chris McAfee

Comment 8

•

25 years ago

10429 was gonna be marked a blocker, this the dup, blocker.

Gagan

Updated

•

25 years ago

Status: NEW → RESOLVED

Closed: 25 years ago

Resolution: --- → FIXED

Gagan

Comment 9

•

25 years ago

This should work ok now... Pl. verify.

Paul MacQuiddy

Updated

•

25 years ago

Whiteboard: waiting for new build with fix to verify

Paul MacQuiddy

Comment 10

•

25 years ago

Can you international types verify this? I have verified that you can have
spaces in your path, which was bug 10429

Paul MacQuiddy

Updated

•

25 years ago

Whiteboard: waiting for new build with fix to verify → waiting for reporter to verify

Katsuhiko Momoi

Updated

•

25 years ago

QA Contact: paulmac → teruko

Katsuhiko Momoi

Comment 11

•

25 years ago

Yes. This should go to teruko now.

Teruko Kobayashi

Reporter

Updated

•

25 years ago

Status: RESOLVED → REOPENED

Teruko Kobayashi

Reporter

Comment 12

•

25 years ago

I tested this in 7-31-09 and 8-02-09 build.
I used http://kaze/url/ to test this.
When I went to http://kaze/url/  and changed Charset to euc-jp,the 3rd
directory name in Japanese is displayed correctly.

Then, when I click on the link in Japanese directory name,
the location bar displayed http://kaze/url/.../  and "Not Found" showed up.

This needs to be reopened.

leger

Updated

•

25 years ago

Resolution: FIXED → ---

leger

Comment 13

•

25 years ago

Clearing Fixed resolution due to reopen.

Gagan

Updated

•

25 years ago

Status: REOPENED → ASSIGNED

Target Milestone: M9 → M10

Gagan

Comment 14

•

25 years ago

This has a lot to do with the fact that nsIURI currently works with char* and
not PRUnichar* I would like to verify this again once we move the URL accessor
to all PRUnichar (sometime for M10). Marking it such.

Warren Harris

Comment 15

•

25 years ago

*** Bug 12473 has been marked as a duplicate of this bug. ***

Warren Harris

Updated

•

25 years ago

Blocks: 13449

Gagan

Updated

•

25 years ago

Target Milestone: M10 → M12

Gagan

Comment 16

•

25 years ago

Correcting target milestone.

msanz

Updated

•

25 years ago

Summary: Non-ASCII url cannot be found → [dogfood] Non-ASCII url cannot be found

daver

Comment 17

•

25 years ago

PDT would like to know if the 4.x product allow non-ascii in URL?  Does a test
case rely on allowing 8bit?

Katsuhiko Momoi

Comment 18

•

25 years ago

In 4.x, we are able to resolve FTP URL links containing 8-bit characters but not HTTP URL links
Please note that this is not the same as inputting into the Location window in 8-bit
characters. We didn't support that in 4.x.

msanz

Comment 19

•

25 years ago

Kat's comment is true for Japanese, but not for Latin-1. In 4.x you can type
http://babel/5x_tests/misc/montse/αϊινσϊ.jpg and it will show the file and the
URL will be correct. We need this working for Latin-1

Katsuhiko Momoi

Comment 20

•

25 years ago

Sorry about that. We didn't esacpe multi-byte characters in 4.x but did so for single-byte 8-bit
charactesr in HTTP.

Katsuhiko Momoi

Comment 21

•

25 years ago

It turns out that with the current Mozilla build (10/20/99 Win32), Latin 1 URL is resolved to an existing page also.
It seems that both in 4.x and Mozilla, we are not doing anything to an URL which contains single-byte
8-bit characters in URL, i.e. just passing them through and it works. I think we should escape these single-byte
8-bit characters, however.

It is multi-byte characters which are not supported at this point in HTTP or FTP URLs in Mozilla (or in 4.x).

msanz

Updated

•

25 years ago

Summary: [dogfood] Non-ASCII url cannot be found → Non-ASCII url cannot be found

msanz

Comment 22

•

25 years ago

And now Kat is right; it's the typing (another bug) which is not working, but if
the file is selected it will show up. Removing dogfood.

leger

Comment 23

•

25 years ago

Moving Assignee from gagan to warren since he is away.

Warren Harris

Updated

•

25 years ago

Assignee: warren → momoi

Warren Harris

Comment 24

•

25 years ago

What's the deal on this bug now? The link in the URL field above is stale (in
4.x too): http://babel/tests/browser/http-test/%3f%3f%3f

This one works fine: http://babel/5x_tests/misc/montse/αϊινσϊ.jpg
although I don't think that should be a valid url. Those characters should be
escaped to work shouldn't they (or is that not how things are in practice).

Reassigning back to i18n.

Katsuhiko Momoi

Updated

•

25 years ago

URL: http://babel/tests/browser/http-test/... → http://kaze:8000/url/

Katsuhiko Momoi

Comment 25

•

25 years ago

What needs to happen on this bug is:

1. Use the test case above. (Changed from the one on babel which
   cannot process multi-byte URL as it is US NT4)
2. There are 2 URL links on this test page. The last one
   is in Japanese. When you view this page under the Japanese
   (EUC-JP) encoding, you should see the status bar at the bottom
   display the escaped URL. Currently it shows 3 dots indicating
   that Mozilla is not able to escape multi-byte directory names.
   When the 3rd (JPN) link is properly escaped, the escaped part should
   look like the 2nd URL which shows the escaped EUC-JP name used in the
   3rd link. The 1st escaped example shows the escaped sequence of the
   same 3 characters in Shift_JIS.

3. Contrast this with the ftp protocol under 4.7:

   ftp://kaze/pub

   This page contains the same 3 directory names. Use 4.7 and move
   the cursor over the 3rd link. You see that 4.7 escapes this to
   the identical string as you see for the 2nd URL.

Assigning this to ftang for an assessment as to what needs to be
done. When we isolate the problem, please send it back to warren.

Katsuhiko Momoi

Updated

•

25 years ago

Assignee: momoi → ftang

Warren Harris

Comment 26

•

25 years ago

I've determined that the URL parsing is pretty screwed up in this regard. The
way I think it should work is:

- SetSpec and SetRelativePath should take escaped strings (like in the
examples), and in the parsing process convert the escapes to their unescaped
representation (held inside the nsStdURL object).

- We should probably assert if we see a non-ascii character passed to
SetSpec/SetRelativePath. The caller should have converted to UTF-8 before
calling us (I think this is being done by the webshell).

- GetSpec should reconstruct and re-escape the URL string before returning it.
Note that this is tricky, because if you have a slash in a filename (could
happen) it should be careful to turn this into %2F rather than make it look
like a directory delimiter.

- The nsIURI accessors should return the unescaped representation of things.
That way if I say SetSpec("file:/c|/Program%20Files") and then call GetPath, I
should get back "c|/Program Files".

The FTP protocol must be doing escaping by hand. This should be handled by
nsStdURL.

Cc'ing valeski & andreas.

Andreas Otte

Comment 27

•

25 years ago

My two cents on this:

Yes, URL parsing is pretty much screwed up regarding escaping and multibyte
chars.

- There was a task to convert the URL accessors to PRUnichar. See bug 13453. It
is now marked invalid.

- nsStdURL does no escaping currently.

Why not store the URL as we get it (escaped or not) from whoever is calling
nsStdURL-functions? Who cares? Just note it on the URL, because one should never
escape an already escaped string or unescape an already unescaped string. And it
is a problem to definitly find out if a URL is already escaped or not. I think
we need to have a member variable for that.

The constructors or SetSpec or SetPath (or the others) should have an additional
parameter which tells if the given string is already escaped or not which is
then stored in the member-variable.

The get-accessors (like GetPath or GetSpec) should have an additional parameter
which gives the information if we want the spec/path escaped or unescaped and
let that be done by the accessors on the fly looking at the
escape-member-variable and doing the appropiate thing (copy and convert or just
copy).

The webshell would want to see the unescaped version to present the user his
native view of the URLs, internally (like sending the request to the server) we
would use the escaped version of the URL. I don't think it's true that we always
want to see the unescaped version.

Frank Tang

Updated

•

25 years ago

Assignee: ftang → bobj

Frank Tang

Comment 28

•

25 years ago

1. Let not mixed URL for different protocol in the same bug
2. "ftp protocol data to ftp URL conversion" and "ftp URL to ftp protocol data generateion" is done in the client side, not in
the server side. So, the clien code should do the right thing w/ it. The code should always URL escape the ftp data before
concatnate into ftp URL and unescape it from URL into ftp protocol data.
3. HTTP url generation is done in the server side. Therefore, the client code have no control for bad URL generation (which
mena the URL contains byte > 0x80). If http server do send some URL to client which contains bytes > 0x80 the cleint code
should URL escape it. But when the client access that URL, it whould not unescape it.

Reassign to bobj to find the right owner.

bobj

Updated

•

25 years ago

Status: NEW → ASSIGNED

Target Milestone: M12 → M13

leger

Comment 29

•

25 years ago

Bulk move of all Necko (to be deleted component) bugs to new Networking

component.

bobj

Updated

•

25 years ago

Assignee: bobj → erik

Status: ASSIGNED → NEW

Target Milestone: M13 → M14

bobj

Comment 30

•

25 years ago

Reassigned to erik for M14

Erik van der Poel

Updated

•

25 years ago

Status: NEW → ASSIGNED

Warren Harris

Updated

•

25 years ago

Status: ASSIGNED → RESOLVED

Closed: 25 years ago → 25 years ago

Resolution: --- → FIXED

Warren Harris

Comment 31

•

25 years ago

Fixed by Andreas' changes that I just checked in.

Andreas Otte

Updated

•

25 years ago

Status: RESOLVED → REOPENED

Andreas Otte

Comment 32

•

25 years ago

reopened because of backout

Teruko Kobayashi

Reporter

Updated

•

25 years ago

Resolution: FIXED → ---

Erik van der Poel

Updated

•

25 years ago

Status: REOPENED → ASSIGNED

Jim Roskind

Updated

•

25 years ago

Blocks: 24854

Frank Tang

Comment 33

•

25 years ago

Change platform and OS to ALL

OS: Windows 95 → All

Hardware: PC → All

Erik van der Poel

Updated

•

25 years ago

Keywords: beta1

leger

Comment 34

•

25 years ago

Putting on PDT+ radar for beta1.

Whiteboard: waiting for reporter to verify → [PDT+]waiting for reporter to verify

Andreas Otte

Comment 35

•

25 years ago

the changes Warren spoke of are in again, can someone with access to this server
take a look if this is fixed?

Teruko Kobayashi

Reporter

Comment 36

•

25 years ago

I tested this in 2000020808 Win32 build.

The result was same as before.

When I went to http://kaze/url/  and changed Charset to euc-jp,the 3rd
directory name in Japanese is displayed correctly.

Then, when I click on the link in Japanese directory name,
the location bar displayed http://kaze/url/.../  and "Not Found" showed up.

Erik van der Poel

Comment 37

•

25 years ago

The Japanese directory under the http test does not work in Nav4 and MSIE, so
it's OK if it doesn't work in Mozilla.

The Japanese directory under the ftp test works in Nav4, MSIE and Mozilla, but
it is displayed wrong in Mozilla and MSIE, while it is displayed OK in Nav4.

So, the only thing in this bug report that needs to be fixed is the display
of FTP directory and file names.

Katsuhiko Momoi

Comment 38

•

25 years ago

I agree that if FP folder & file names work OK, then we would at
least have a parity with our own earlier version, and that would be 
accpetable. ftang's comment on HTTP servers is well-taken.

Katsuhiko Momoi

Comment 39

•

25 years ago

One more thing. There are 2 files under the Japanese name directory
(3rd from the top) on the ftp server. The content of these 2 files should 
display on Mozilla. Currently, I don't see the 2 files at all.
It looks like the 2nd and 3rd directories are mixed up right now
and thus sometimes (not always) show the name of "testb.html" which
does not exist under the 3rd directory but the 2nd one.

Erik van der Poel

Comment 40

•

25 years ago

FTP and non-ASCII file names are relatively minor aspects on the Net today.
I believe we should remove the beta1 keyword and PDT+ approval.

Andreas Otte

Comment 41

•

25 years ago

Could someone please describe a little bit better how the urls are looking and
how they should look.

Erik van der Poel

Comment 42

•

25 years ago

Removed beta1 and PDT+. Please re-enter beta1 if you would like the PDT team
to re-consider.

Keywords: beta1

Whiteboard: [PDT+]waiting for reporter to verify

Erik van der Poel

Comment 43

•

25 years ago

Jud, please re-assign this to whoever owns the FTP and/or URL code. I don't
think this is a beta1-stopper.

Assignee: erik → valeski

Status: ASSIGNED → NEW

Target Milestone: M14 → M15

Warren Harris

Comment 44

•

24 years ago

Moving to M16.

Target Milestone: M15 → M16

Judson Valeski

Updated

•

24 years ago

Whiteboard: [HELP WANTED]

Judson Valeski

Comment 45

•

24 years ago

I'm not following this. can someone in i18n take this on?

Teruko Kobayashi

Reporter

Comment 46

•

24 years ago

Nominating to beta2.

Keywords: nsbeta2

bobj

Comment 47

•

24 years ago

On US Win95 on 4.72 if I type http://kaze:8000/url into the location bar
and hit return
 (1a) with the View|Character Set to either Japanese(EUC-JP) or 
     Japanese(Auto-Detect), then
     - the 3rd directory displays the kanji characters for "nihongo" correctly
     - clicking on that link does NOT work and I get the Not Found error page
     - the URL in the location bar displays: http://kaze:8000/url/“ú–{Œê/
 (2a) with the View|Character Set to Western(ISO-8859-1), then
     - the 3rd directory displays latin1 garbage: ÆüËÜ¸ì/
     - clicking on that link DOES work
     - the URL in the location bar displays: http://kaze:8000/url/ÆüËÜ¸ì/

On US Win95 running the 2000042109 build and typing http://kaze:8000/url into
the location bar and hitting return

 (1b) with the View|Character Set to either Japanese(EUC-JP) or 
     Japanese(Auto-Detect), then
     - the 3rd directory displays the kanji characters for "nihongo" correctly
     - clicking on that link does not work and I get the Not Found error page
     - the URL in the location bar displays:
    http://kaze:8000/url/%C3%A6%C2%97%C2%A5%C3%A6%C2%9C%C2%AC%C3%A8%C2%AA%C2%9E/
 (2b) with the View|Character Set to Western(ISO-8859-1), then
     - the 3rd directory displays latin1 garbage: ÆüËÜ¸ì/ (same as 4.72)
     - clicking on that link DOES work
     - the URL in the location bar displays:
http://kaze:8000/url/%C3%83%C2%86%C3%83%C2%BC%C3%83%C2%8B%C3%83%C2%9C%C3%82%C2%B
8%C3%83%C2%AC/

Additionally, if I paste the resulting URL from (2a) into the location bar:
    http://kaze:8000/url/ÆüËÜ¸ì/

Seamonkey also gets the Not Found page and the results in the location bar:
    http://kaze:8000/url/%C3%86%C3%BC%C3%8B%C3%9C%C2%B8%C3%AC/

Teruko, Does 4.72 behave differently on Ja Windows?

bobj

Comment 48

•

24 years ago

whoops.  cut & paste error in previous comment!
The comment for case (2b) fails to find the page and should read:

 (2b) with the View|Character Set to Western(ISO-8859-1), then
     - the 3rd directory displays latin1 garbage: ÆüËÜ¸ì/ (same as 4.72)
     - clicking on that link does NOT work and I get the Not Found error page
     - the URL in the location bar displays:
http://kaze:8000/url/%C3%83%C2%86%C3%83%C2%BC%C3%83%C2%8B%C3%83%C2%9C%C3%82%C2%B
8%C3%83%C2%AC/

bobj

Comment 49

•

24 years ago

The ftp problem should be split off into a separate bug.
As noted, the ftp links work, but are not displayed correctly (instead of
Japanese kanji characters you see Latin1 garbage).

Frank Tang

Updated

•

24 years ago

Assignee: valeski → ftang

Target Milestone: M16 → M17

Frank Tang

Comment 50

•

24 years ago

reassign this back to ftang. I think we need to change the code in Layout , but 
not in the necko.

Frank Tang

Comment 51

•

24 years ago

Similar problem as 30460. Patch available at http://warp/u/ftang/tmp/illurl.txt

Status: NEW → ASSIGNED

Frank Tang

Comment 52

•

24 years ago

One thing I forget to say is the patch is depend on bug 37395.
per ftang/waqar/troy meeting- we should move the url fixing code into content 
sink so we don't need to convert/escape every time. Also, we argee to reassign 
this bug to layout.

Assignee: ftang → troy

Status: ASSIGNED → NEW

Component: Networking → Layout

Frank Tang

Comment 53

•

24 years ago

*** Bug 38133 has been marked as a duplicate of this bug. ***

buster

Comment 54

•

24 years ago

with Troy's departure, this is at risk for M17.  PDT team, is this required for 
beta2?

Assignee: troy → buster

leger

Updated

•

24 years ago

Assignee: buster → waterson

Whiteboard: [HELP WANTED] → [nsbeta2+]

leger

Comment 55

•

24 years ago

Putting on [nsbeta2+] radar for beta2 fix.   Sending over to waterson.

Chris Waterson

Assignee

Comment 56

•

24 years ago

It seems like the right thing to do in this case is convert/escape/mangle the 
URL in the anchor tag itself? This would also make sure that the 
correct thing happened if someone changed the "href" attribute using L1 
DOM APIs.

attinasi and I talked about keeping the resolved version of the URL in the 
anchor tag to deal with some style & performance issues: maybe this could just 
be an extension (or precursor) to that work?

Presumably we'd need to do this for other elements that had "href" properties, 
as well.

Comments?

Status: NEW → ASSIGNED

buster

Comment 57

•

24 years ago

Some notes from Troy:
1) Frank's patch is expensive in terms of performance, especially because it 
includes dynamic allocation. We should be able to do much better.
2) We should be able to convert the URL to an immutable 8-bit ASCII string one 
time, probably at the time we process the attribute (or maybe lazily the first 
time we actually use the attribute.)  We would cache this converted immutable 
string and hand that to necko.
3) or, necko could just do the conversion on the fly.  but that brings us right 
back to performance problems.

Chris Waterson

Assignee

Comment 58

•

24 years ago

So I claim (2) is the right thing to do, and should be done by the anchor tag. 
We need to be able to handle L1 DOM updates, too, and the tag itself (not the 
content sink) is the only one that can do that.

Marc Attinasi

Comment 59

•

24 years ago

Can we canonicalize it when we ASCII-ize it too? That would help in dealing with 
bug 29611 (which we reports that we spend too much time determining a link's 
state due to conversions.) I'm linking this bug to 29611 since it may help it.

Marc Attinasi

Updated

•

24 years ago

Blocks: 29611

Chris Waterson

Assignee

Comment 60

•

24 years ago

Yeah, we should absolutely canonicalize it then. (See comments above...)

Chris Waterson

Assignee

Comment 61

•

24 years ago

Attached patch changes to nsHTMLAnchorElement.cpp; compute & cache canonical URL in .href property — Details — Splinter Review

Chris Waterson

Assignee

Comment 62

•

24 years ago

Attached patch changes to force canonical URI through nsILinkHandler::On[OverLink|LinkClick]() — Details — Splinter Review

Katsuhiko Momoi

Comment 63

•

24 years ago

Since Ftang's 11/30/99 comment on HTTP URL issue is on
target, I would only add an additional server example 
to contrast different servers on this problem.

The above URL: 

http://kaze:8000/url

points to a Netscape Enterprsie server 2.01. No Netscape 
Enterprise server supports (up to the current 4.1) 8-bit
path or file names according to the server admin document. 

Thus it send 8-bit URL portion without escaping.
How tro deal with this has been documented by ftang already.

An Apache sever on the other hand supports 8-bit names
and escapes them. See for example a nearly identical identical
example to the above on an Apache server:

http://mugen/url

There the Japanese name is escaped by the server properly --
do view source on the page -- and the directory can be easily
accessed.

On another issue: should we support UTF-8 URL? 
IE has a default option setting to this and then the user
can turn off this option. After discussing this issue with
Erik, I'm inclined to believe that UTF-8 URL has problems
and would not need to be supported at this juncture.

Katsuhiko Momoi

Comment 64

•

24 years ago

I shoud add that on the Apache server, Comm 4.x, IE5, 
and Netscape 6 all work OK with the Japanese path name.

Chris Waterson

Assignee

Updated

•

24 years ago

Depends on: 40461

Chris Waterson

Assignee

Comment 65

•

24 years ago

Ok, I talked to ftang yesterday, and here's what he thinks the right thing to 
do is as I understand it.

Problem. If an anchor tag's "href" attribute contains non-ASCII characters. 
URLs don't. So how do we properly escape the non-ASCII characters in the URL so 
that an ASCII string results?

Solution. Use the document's charset to return the href attribute to its 
original 8-bit encoding. Then URL escape (e.g., " " --> "%20") the 
non-ASCII printables. Then call nsIURI::Resolve with the document's base URL.

I've implemented it, and sure enough, it seems to make this test case work. It 
leaves a bad taste in warren's mouth, but I burned my tounge a while back and 
can't taste a thing.

Do we need to do this for *every* relative URI reference? Zoiks! Anyway, I've 
implemented a layout-specific version of NS_MakeAbsoluteURI() (maybe I should 
call it NS_MakeAbsoluteURIWithHRef or something) that takes a charset ID and 
does the heavy lifting. It's in a new file which I'll attach to the bug. I'll 
also attach new diffs to nsHTMLAnchorElement.cpp.

Comments on this approach?

Chris Waterson

Assignee

Comment 66

•

24 years ago

Attached file nsHTMLUtils.cpp; contains layout-specific NS_MakeAbsoluteURI() implementation — Details

Chris Waterson

Assignee

Comment 67

•

24 years ago

Attached patch cleaned up changes to nsHTMLAnchorElement.cpp — Details — Splinter Review

buster

Comment 68

•

24 years ago

chris: this looks great.  r=buster.

what about other URLs, like <img src=...>?  Do those need to be treated 
separately?

Warren Harris

Comment 69

•

24 years ago

Looks good. Where do you want to put the "silly extra stuff"? If we put it in 
necko, it will introduce a dependency on i18n (although maybe we have one 
already). But maybe this is right anyway.

Any suggestions on how to specify the necko APIs (in nsNetUtil.h) so that it 
clear what you're supposed to be passing in? Maybe we just need a comment 
saying that the nsString is supposed to already have all this stuff done to it 
(charset encoded, escaped). Or maybe we should eliminate it in favor of the new 
thing you wrote.

Chris Waterson

Assignee

Comment 70

•

24 years ago

I am going to put the charset-specific version of NS_MakeAbsoluteURI() in 
layout; and I am going to change it to NS_MakeAbsoluteURIWithCharset() to avoid 
gratuitous name overloading.

Warren Harris

Comment 71

•

24 years ago

I'd like to make it a method on nsIIOService provided we already depend on 
i18n. (I think we do for string bundles. - Gagan, Jud?)

R.K.Aa.

Comment 72

•

24 years ago

is bug 40661 related?:
""files" datasource fails to open non-latin named directory"

Chris Waterson

Assignee

Comment 73

•

24 years ago

No. that bug is a dup of 28787

Erik van der Poel

Comment 74

•

24 years ago

I had a quick look at the new MakeAbsoluteURIWithCharset method, and it uses
Unicode conversion routines. However, Necko does not appear to use the Unicode
converters. Necko may use string bundles, but they are in a separate DLL
(strres.dll), while the Unicode converters are in uc*.dll. I don't know how
important it is to keep Necko free of uc*.dll dependencies, but this is what I
found after a quick look.

Warren Harris

Comment 75

•

24 years ago

Erik: Thanks for the info. Is there any plan to bundle all (many) of the intl 
dlls into one as we did for necko (to improve startup time, and reduce 
clutter)?

Erik van der Poel

Comment 76

•

24 years ago

We have discussed that, but I don't know of any concrete plans in that area.
Frank?

Andreas Otte

Comment 77

•

24 years ago

I took a look at MakeAbsoluteURIWithCharset and it includes the following code
at the end:

    static const PRInt32 kEscapeEverything = nsIIOService::url_Forced - 1;

    nsXPIDLCString escaped;
    ioservice->Escape(spec.GetBuffer(), kEscapeEverything,
getter_Copies(escaped));

This usage of nsIIOService::url_Forced is certainly not what I wanted it to do
and I don't believe it does anything useful this way. This method is used to
escape a specific part of an URL not a whole URL. There are different rules for
every part.

Chris Waterson

Assignee

Comment 78

•

24 years ago

Could you suggest an alternative?

Andreas Otte

Comment 79

•

24 years ago

Do we really need to escape this stuff? What characters that can possibly damage
urlparsing can be expected from the charset conversion? Maybe you could use the
old nsEscape functions in xpcom/io as a replacement.

Heikki Toivonen (remove -bugzilla when emailing directly)

Comment 80

•

24 years ago

This issue is also relevant for simple XLinks, XPath, etc. "src" attributes.

When I implemented XPath and simple XLink, I asked ftang how to do this but he 
could not convince me what is the right thing to do :) So what I do there is 
just grab the Unicode string, AssignWithConversion() to nsCAutoString and pass 
it to the URI-creating objects. This seems to work in basic cases, like spaces, 
but I do not think it is the correct way.

What bugs me the most is this: suppose a document is in UTF-16 and a URL 
contains an illegal char that does not fit in ASCII. ftang said to 1) get the 
doc charset and use the nsIIOService ConvertAndEscape() function, but to me this 
seems like it cannot work for UTF-16. Or does the convert thing automatically 
convert UTF-16 into UTF-8 (which fits into char*)? How would we then know what 
to send back to server?

I also seem to have trouble understanding how do we escape multibyte UTF-8 URLs. 
If a two-byte character becomes two escaped URL chars, will the system still 
work? Do we send the server back the bits it gave us?

Chris Waterson

Assignee

Comment 81

•

24 years ago

andreas: how is using the "old" nsEscape different from what I'm doing now? Are 
there really differnt rules for escaping different parts of a URI? That doesn't 
seem right.

heikki: nsIUnicodeEncoder takes a UCS-2 string as input and returns the result 
in an 8-bit string as output. Presumably it does the right thing on UTF-16 to 
round-trip the UTF-16 bytes.

Andreas Otte

Comment 82

•

24 years ago

The main reason for escaping is to hide certain special characters from the
urlparser that would mislead the parser, like having a @ in a username or a / in
a filename or something similar. Depending on the position inside the url
different characters are special for the parser and that is what can be done
with the new escape functions. I have to tell it which part of the URL I want to
escape. Simply giving it every possible mask will not work. The old nsEscape
stuff does not look for a special part, it can be used to escape whole urls, but
it may escape to much or not much enough.

Erik van der Poel

Comment 83

•

24 years ago

In the case of a UTF-16 document, it is not clear to me that the server is
expecting a URL-encoded (%XX) UTF-16 URL. In fact, some people working on these
issues in the standards bodies seem to be pushing UTF-8 as a convention in URLs.
See ftp://ftp.ietf.org/internet-drafts/draft-masinter-url-i18n-05.txt. However,
blindly converting *every* part of a URL to UTF-8 has bad consequences in
today's Web, as Microsoft discovered. They do not convert the "query" part (the
part after the question mark) to UTF-8. Also, they have a preference for the
part before the question mark. In most versions, they convert that part to UTF-8
but in the Korean and Taiwanese versions they found that they had to set the
default pref to "OFF" (i.e. no conversion to UTF-8), presumably because people
in those countries were using servers that expect something other than UTF-8
in those parts of the URLs.

For now, converting from PRUnichar to the doc's charset and then %-encoding is
the best thing to do, in my opinion. We should continue to follow the emerging
specs in this area, and possibly modify our implementation accordingly. UTF-16
documents are currently still quite rare, I think, but if we are really
concerned about this, my suggestion is to make a special case for the UTF-16
doc charset and convert to UTF-8 (instead of UTF-16).

Chris Waterson

Assignee

Comment 84

•

24 years ago

fix checked in.

Status: ASSIGNED → RESOLVED

Closed: 25 years ago → 24 years ago

Resolution: --- → FIXED

Chris Petersen

Comment 85

•

24 years ago

Not sure how to verify this problem. Could the reporter please check this in the 
latest build ?

Teruko Kobayashi

Reporter

Comment 86

•

24 years ago

I verified this in 2000-06-02-08 Win32, Mac, and Linux build.

Status: RESOLVED → VERIFIED

changes to nsHTMLAnchorElement.cpp; compute & cache canonical URL in .href property 24 years ago Chris Waterson 3.98 KB, patch		Details \| Diff \| Splinter Review
changes to force canonical URI through nsILinkHandler::On[OverLink\|LinkClick]() 24 years ago Chris Waterson 4.02 KB, patch		Details \| Diff \| Splinter Review
nsHTMLUtils.cpp; contains layout-specific NS_MakeAbsoluteURI() implementation 24 years ago Chris Waterson 4.01 KB, text/plain		Details
cleaned up changes to nsHTMLAnchorElement.cpp 24 years ago Chris Waterson 3.92 KB, patch		Details \| Diff \| Splinter Review