Closed
Bug 10373
Opened 26 years ago
Closed 25 years ago
Non-ASCII url cannot be found
Categories
(Core :: Layout, defect, P2)
Core
Layout
Tracking
()
VERIFIED
FIXED
M17
People
(Reporter: teruko, Assigned: waterson)
References
()
Details
(Whiteboard: [nsbeta2+])
Attachments
(4 files)
3.98 KB,
patch
|
Details | Diff | Splinter Review | |
4.02 KB,
patch
|
Details | Diff | Splinter Review | |
4.01 KB,
text/plain
|
Details | |
3.92 KB,
patch
|
Details | Diff | Splinter Review |
Above URL name is included Japanese directory.
When you go to the page, the Dos console says
"http://babel/tests/browser/http-test/%3f%3f%3f loaded sucessfully",
but in the Apprunner says "Not Found".
Step of reproduce
1. Go to the http://babel/tests/browser/http-test
2. Click on "??"
"Not Found" shows.
Tested 7-21-16-NECKO Win32 build and 7-22-08-M9 Win32 build.
Reporter | ||
Updated•26 years ago
|
Priority: P3 → P2
Target Milestone: M9
The loaded successfully is currently tied to document load status. it really has
no clue as to what the HTTP status was. I am not sure what the other problem you
are mentioning is... Is it expected to show the document (I verified in 4.6 it
doesn't)?
There is a whole section on URL encoding that is missing right now from the
picture. But I want to confirm that the bug really is about encoding URLs.
Comment 2•26 years ago
|
||
As a maintainer of the Babel server cited above, I would like to
offer some additional facts of relevance. (Sorry, teruko, I forgot
to tell you about the Babel server's limitation described below.)
1. The directory name which ends the above URL is actually in Japanese
and contain three 2-byte words. Unfortunately the server is running
on an US Windows NT4 and thus mangles the multi-byte directory and
file names. This is why you see 3 ? marks. Thus, even if you properly
encode the URL, you will never find the directory.
2. Let me offer a sample on a Unix server which can handle 8-bit
file names.
http://kaze:8000/url/
3. In the directory, you will find 3 sub-directories. The 3rd from top
(if viewed with 4.6/4.7 or 5.0 under Japanese (EUC-JP) encoding)
will show in Japanese.
4. The first 2 directories are named using the escaped URLs
and the 2nd one actually represents the escaped URL version of
the 3rd Japanese one. If esacping is working correctly, you should
see the escaped URL matching that of the 2nd one when the cursor
is perched on the 3rd directory. (Compare FTP URL below.)
Some issues:
A. Click on the 3rd directory and it fails. We seem to be
escaping special characters like "%" but not 8-bit charcters.
(4.6/4.7 and 5.0 w/necko are pretty much the same here.)
We should fix this in 5.0
B. 4.6/4.7 actually escapes 8-bit path names in FTP url.
For example, try the same 3 directory listing with the ftp
protocol at the URL below. You can get inside the 3rd directory
under Japanese (EUC-JP) encoding:
ftp://kaze/pub/
And you can also see the escaped URL on the status bar when
the cursor is on the 3rd directory.
5.0 w/necko does not escape 8-bit names and cannot get inside this
directory. We should fix this in 5.0
Question:
Are you planning on supporting escaping to both native encoding
and UTF-8 if a server indicates it can accept Unicode? I believe
there is a recent IETF draft on url-i18n which discusses
the UTF-8 implementation.
Updated•26 years ago
|
Severity: major → blocker
Comment 8•26 years ago
|
||
10429 was gonna be marked a blocker, this the dup, blocker.
Updated•26 years ago
|
Whiteboard: waiting for new build with fix to verify
Comment 10•26 years ago
|
||
Can you international types verify this? I have verified that you can have
spaces in your path, which was bug 10429
Updated•26 years ago
|
Whiteboard: waiting for new build with fix to verify → waiting for reporter to verify
Updated•26 years ago
|
QA Contact: paulmac → teruko
Comment 11•26 years ago
|
||
Yes. This should go to teruko now.
Reporter | ||
Updated•26 years ago
|
Status: RESOLVED → REOPENED
Reporter | ||
Comment 12•26 years ago
|
||
I tested this in 7-31-09 and 8-02-09 build.
I used http://kaze/url/ to test this.
When I went to http://kaze/url/ and changed Charset to euc-jp,the 3rd
directory name in Japanese is displayed correctly.
Then, when I click on the link in Japanese directory name,
the location bar displayed http://kaze/url/.../ and "Not Found" showed up.
This needs to be reopened.
Comment 13•26 years ago
|
||
Clearing Fixed resolution due to reopen.
Comment 14•26 years ago
|
||
This has a lot to do with the fact that nsIURI currently works with char* and
not PRUnichar* I would like to verify this again once we move the URL accessor
to all PRUnichar (sometime for M10). Marking it such.
Comment 15•26 years ago
|
||
*** Bug 12473 has been marked as a duplicate of this bug. ***
Comment 16•26 years ago
|
||
Correcting target milestone.
Summary: Non-ASCII url cannot be found → [dogfood] Non-ASCII url cannot be found
Comment 17•26 years ago
|
||
PDT would like to know if the 4.x product allow non-ascii in URL? Does a test
case rely on allowing 8bit?
Comment 18•26 years ago
|
||
In 4.x, we are able to resolve FTP URL links containing 8-bit characters but not HTTP URL links
Please note that this is not the same as inputting into the Location window in 8-bit
characters. We didn't support that in 4.x.
Comment 19•26 years ago
|
||
Kat's comment is true for Japanese, but not for Latin-1. In 4.x you can type
http://babel/5x_tests/misc/montse/αϊινσϊ.jpg and it will show the file and the
URL will be correct. We need this working for Latin-1
Comment 20•26 years ago
|
||
Sorry about that. We didn't esacpe multi-byte characters in 4.x but did so for single-byte 8-bit
charactesr in HTTP.
Comment 21•26 years ago
|
||
It turns out that with the current Mozilla build (10/20/99 Win32), Latin 1 URL is resolved to an existing page also.
It seems that both in 4.x and Mozilla, we are not doing anything to an URL which contains single-byte
8-bit characters in URL, i.e. just passing them through and it works. I think we should escape these single-byte
8-bit characters, however.
It is multi-byte characters which are not supported at this point in HTTP or FTP URLs in Mozilla (or in 4.x).
Summary: [dogfood] Non-ASCII url cannot be found → Non-ASCII url cannot be found
Comment 22•26 years ago
|
||
And now Kat is right; it's the typing (another bug) which is not working, but if
the file is selected it will show up. Removing dogfood.
Comment 23•26 years ago
|
||
Moving Assignee from gagan to warren since he is away.
Updated•26 years ago
|
Assignee: warren → momoi
Comment 24•26 years ago
|
||
What's the deal on this bug now? The link in the URL field above is stale (in
4.x too): http://babel/tests/browser/http-test/%3f%3f%3f
This one works fine: http://babel/5x_tests/misc/montse/αϊινσϊ.jpg
although I don't think that should be a valid url. Those characters should be
escaped to work shouldn't they (or is that not how things are in practice).
Reassigning back to i18n.
Updated•26 years ago
|
Comment 25•26 years ago
|
||
What needs to happen on this bug is:
1. Use the test case above. (Changed from the one on babel which
cannot process multi-byte URL as it is US NT4)
2. There are 2 URL links on this test page. The last one
is in Japanese. When you view this page under the Japanese
(EUC-JP) encoding, you should see the status bar at the bottom
display the escaped URL. Currently it shows 3 dots indicating
that Mozilla is not able to escape multi-byte directory names.
When the 3rd (JPN) link is properly escaped, the escaped part should
look like the 2nd URL which shows the escaped EUC-JP name used in the
3rd link. The 1st escaped example shows the escaped sequence of the
same 3 characters in Shift_JIS.
3. Contrast this with the ftp protocol under 4.7:
ftp://kaze/pub
This page contains the same 3 directory names. Use 4.7 and move
the cursor over the 3rd link. You see that 4.7 escapes this to
the identical string as you see for the 2nd URL.
Assigning this to ftang for an assessment as to what needs to be
done. When we isolate the problem, please send it back to warren.
Updated•26 years ago
|
Assignee: momoi → ftang
Comment 26•26 years ago
|
||
I've determined that the URL parsing is pretty screwed up in this regard. The
way I think it should work is:
- SetSpec and SetRelativePath should take escaped strings (like in the
examples), and in the parsing process convert the escapes to their unescaped
representation (held inside the nsStdURL object).
- We should probably assert if we see a non-ascii character passed to
SetSpec/SetRelativePath. The caller should have converted to UTF-8 before
calling us (I think this is being done by the webshell).
- GetSpec should reconstruct and re-escape the URL string before returning it.
Note that this is tricky, because if you have a slash in a filename (could
happen) it should be careful to turn this into %2F rather than make it look
like a directory delimiter.
- The nsIURI accessors should return the unescaped representation of things.
That way if I say SetSpec("file:/c|/Program%20Files") and then call GetPath, I
should get back "c|/Program Files".
The FTP protocol must be doing escaping by hand. This should be handled by
nsStdURL.
Cc'ing valeski & andreas.
Comment 27•26 years ago
|
||
My two cents on this:
Yes, URL parsing is pretty much screwed up regarding escaping and multibyte
chars.
- There was a task to convert the URL accessors to PRUnichar. See bug 13453. It
is now marked invalid.
- nsStdURL does no escaping currently.
Why not store the URL as we get it (escaped or not) from whoever is calling
nsStdURL-functions? Who cares? Just note it on the URL, because one should never
escape an already escaped string or unescape an already unescaped string. And it
is a problem to definitly find out if a URL is already escaped or not. I think
we need to have a member variable for that.
The constructors or SetSpec or SetPath (or the others) should have an additional
parameter which tells if the given string is already escaped or not which is
then stored in the member-variable.
The get-accessors (like GetPath or GetSpec) should have an additional parameter
which gives the information if we want the spec/path escaped or unescaped and
let that be done by the accessors on the fly looking at the
escape-member-variable and doing the appropiate thing (copy and convert or just
copy).
The webshell would want to see the unescaped version to present the user his
native view of the URLs, internally (like sending the request to the server) we
would use the escaped version of the URL. I don't think it's true that we always
want to see the unescaped version.
Updated•26 years ago
|
Assignee: ftang → bobj
Comment 28•26 years ago
|
||
1. Let not mixed URL for different protocol in the same bug
2. "ftp protocol data to ftp URL conversion" and "ftp URL to ftp protocol data generateion" is done in the client side, not in
the server side. So, the clien code should do the right thing w/ it. The code should always URL escape the ftp data before
concatnate into ftp URL and unescape it from URL into ftp protocol data.
3. HTTP url generation is done in the server side. Therefore, the client code have no control for bad URL generation (which
mena the URL contains byte > 0x80). If http server do send some URL to client which contains bytes > 0x80 the cleint code
should URL escape it. But when the client access that URL, it whould not unescape it.
Reassign to bobj to find the right owner.
Comment 29•26 years ago
|
||
Bulk move of all Necko (to be deleted component) bugs to new Networking
component.
Comment 30•26 years ago
|
||
Reassigned to erik for M14
Updated•26 years ago
|
Status: NEW → ASSIGNED
Updated•26 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 26 years ago → 26 years ago
Resolution: --- → FIXED
Comment 31•26 years ago
|
||
Fixed by Andreas' changes that I just checked in.
Updated•26 years ago
|
Status: RESOLVED → REOPENED
Comment 32•26 years ago
|
||
reopened because of backout
Reporter | ||
Updated•26 years ago
|
Resolution: FIXED → ---
Updated•26 years ago
|
Status: REOPENED → ASSIGNED
Comment 34•25 years ago
|
||
Putting on PDT+ radar for beta1.
Whiteboard: waiting for reporter to verify → [PDT+]waiting for reporter to verify
Comment 35•25 years ago
|
||
the changes Warren spoke of are in again, can someone with access to this server
take a look if this is fixed?
Reporter | ||
Comment 36•25 years ago
|
||
I tested this in 2000020808 Win32 build.
The result was same as before.
When I went to http://kaze/url/ and changed Charset to euc-jp,the 3rd
directory name in Japanese is displayed correctly.
Then, when I click on the link in Japanese directory name,
the location bar displayed http://kaze/url/.../ and "Not Found" showed up.
Comment 37•25 years ago
|
||
The Japanese directory under the http test does not work in Nav4 and MSIE, so
it's OK if it doesn't work in Mozilla.
The Japanese directory under the ftp test works in Nav4, MSIE and Mozilla, but
it is displayed wrong in Mozilla and MSIE, while it is displayed OK in Nav4.
So, the only thing in this bug report that needs to be fixed is the display
of FTP directory and file names.
Comment 38•25 years ago
|
||
I agree that if FP folder & file names work OK, then we would at
least have a parity with our own earlier version, and that would be
accpetable. ftang's comment on HTTP servers is well-taken.
Comment 39•25 years ago
|
||
One more thing. There are 2 files under the Japanese name directory
(3rd from the top) on the ftp server. The content of these 2 files should
display on Mozilla. Currently, I don't see the 2 files at all.
It looks like the 2nd and 3rd directories are mixed up right now
and thus sometimes (not always) show the name of "testb.html" which
does not exist under the 3rd directory but the 2nd one.
Comment 40•25 years ago
|
||
FTP and non-ASCII file names are relatively minor aspects on the Net today.
I believe we should remove the beta1 keyword and PDT+ approval.
Comment 41•25 years ago
|
||
Could someone please describe a little bit better how the urls are looking and
how they should look.
Comment 42•25 years ago
|
||
Removed beta1 and PDT+. Please re-enter beta1 if you would like the PDT team
to re-consider.
Keywords: beta1
Whiteboard: [PDT+]waiting for reporter to verify
Comment 43•25 years ago
|
||
Jud, please re-assign this to whoever owns the FTP and/or URL code. I don't
think this is a beta1-stopper.
Assignee: erik → valeski
Status: ASSIGNED → NEW
Target Milestone: M14 → M15
Updated•25 years ago
|
Whiteboard: [HELP WANTED]
Comment 45•25 years ago
|
||
I'm not following this. can someone in i18n take this on?
Comment 47•25 years ago
|
||
On US Win95 on 4.72 if I type http://kaze:8000/url into the location bar
and hit return
(1a) with the View|Character Set to either Japanese(EUC-JP) or
Japanese(Auto-Detect), then
- the 3rd directory displays the kanji characters for "nihongo" correctly
- clicking on that link does NOT work and I get the Not Found error page
- the URL in the location bar displays: http://kaze:8000/url/“ú–{Œê/
(2a) with the View|Character Set to Western(ISO-8859-1), then
- the 3rd directory displays latin1 garbage: ÆüËܸì/
- clicking on that link DOES work
- the URL in the location bar displays: http://kaze:8000/url/ÆüËܸì/
On US Win95 running the 2000042109 build and typing http://kaze:8000/url into
the location bar and hitting return
(1b) with the View|Character Set to either Japanese(EUC-JP) or
Japanese(Auto-Detect), then
- the 3rd directory displays the kanji characters for "nihongo" correctly
- clicking on that link does not work and I get the Not Found error page
- the URL in the location bar displays:
http://kaze:8000/url/%C3%A6%C2%97%C2%A5%C3%A6%C2%9C%C2%AC%C3%A8%C2%AA%C2%9E/
(2b) with the View|Character Set to Western(ISO-8859-1), then
- the 3rd directory displays latin1 garbage: ÆüËܸì/ (same as 4.72)
- clicking on that link DOES work
- the URL in the location bar displays:
http://kaze:8000/url/%C3%83%C2%86%C3%83%C2%BC%C3%83%C2%8B%C3%83%C2%9C%C3%82%C2%B
8%C3%83%C2%AC/
Additionally, if I paste the resulting URL from (2a) into the location bar:
http://kaze:8000/url/ÆüËܸì/
Seamonkey also gets the Not Found page and the results in the location bar:
http://kaze:8000/url/%C3%86%C3%BC%C3%8B%C3%9C%C2%B8%C3%AC/
Teruko, Does 4.72 behave differently on Ja Windows?
Comment 48•25 years ago
|
||
whoops. cut & paste error in previous comment!
The comment for case (2b) fails to find the page and should read:
(2b) with the View|Character Set to Western(ISO-8859-1), then
- the 3rd directory displays latin1 garbage: ÆüËܸì/ (same as 4.72)
- clicking on that link does NOT work and I get the Not Found error page
- the URL in the location bar displays:
http://kaze:8000/url/%C3%83%C2%86%C3%83%C2%BC%C3%83%C2%8B%C3%83%C2%9C%C3%82%C2%B
8%C3%83%C2%AC/
Comment 49•25 years ago
|
||
The ftp problem should be split off into a separate bug.
As noted, the ftp links work, but are not displayed correctly (instead of
Japanese kanji characters you see Latin1 garbage).
Updated•25 years ago
|
Assignee: valeski → ftang
Target Milestone: M16 → M17
Comment 50•25 years ago
|
||
reassign this back to ftang. I think we need to change the code in Layout , but
not in the necko.
Comment 51•25 years ago
|
||
Similar problem as 30460. Patch available at http://warp/u/ftang/tmp/illurl.txt
Status: NEW → ASSIGNED
Comment 52•25 years ago
|
||
One thing I forget to say is the patch is depend on bug 37395.
per ftang/waqar/troy meeting- we should move the url fixing code into content
sink so we don't need to convert/escape every time. Also, we argee to reassign
this bug to layout.
Assignee: ftang → troy
Status: ASSIGNED → NEW
Component: Networking → Layout
Comment 53•25 years ago
|
||
*** Bug 38133 has been marked as a duplicate of this bug. ***
Comment 54•25 years ago
|
||
with Troy's departure, this is at risk for M17. PDT team, is this required for
beta2?
Assignee: troy → buster
Comment 55•25 years ago
|
||
Putting on [nsbeta2+] radar for beta2 fix. Sending over to waterson.
Assignee | ||
Comment 56•25 years ago
|
||
It seems like the right thing to do in this case is convert/escape/mangle the
URL in the anchor tag itself? This would also make sure that the
correct thing happened if someone changed the "href" attribute using L1
DOM APIs.
attinasi and I talked about keeping the resolved version of the URL in the
anchor tag to deal with some style & performance issues: maybe this could just
be an extension (or precursor) to that work?
Presumably we'd need to do this for other elements that had "href" properties,
as well.
Comments?
Status: NEW → ASSIGNED
Comment 57•25 years ago
|
||
Some notes from Troy:
1) Frank's patch is expensive in terms of performance, especially because it
includes dynamic allocation. We should be able to do much better.
2) We should be able to convert the URL to an immutable 8-bit ASCII string one
time, probably at the time we process the attribute (or maybe lazily the first
time we actually use the attribute.) We would cache this converted immutable
string and hand that to necko.
3) or, necko could just do the conversion on the fly. but that brings us right
back to performance problems.
Assignee | ||
Comment 58•25 years ago
|
||
So I claim (2) is the right thing to do, and should be done by the anchor tag.
We need to be able to handle L1 DOM updates, too, and the tag itself (not the
content sink) is the only one that can do that.
Comment 59•25 years ago
|
||
Can we canonicalize it when we ASCII-ize it too? That would help in dealing with
bug 29611 (which we reports that we spend too much time determining a link's
state due to conversions.) I'm linking this bug to 29611 since it may help it.
Assignee | ||
Comment 60•25 years ago
|
||
Yeah, we should absolutely canonicalize it then. (See comments above...)
Assignee | ||
Comment 61•25 years ago
|
||
Assignee | ||
Comment 62•25 years ago
|
||
Comment 63•25 years ago
|
||
Since Ftang's 11/30/99 comment on HTTP URL issue is on
target, I would only add an additional server example
to contrast different servers on this problem.
The above URL:
http://kaze:8000/url
points to a Netscape Enterprsie server 2.01. No Netscape
Enterprise server supports (up to the current 4.1) 8-bit
path or file names according to the server admin document.
Thus it send 8-bit URL portion without escaping.
How tro deal with this has been documented by ftang already.
An Apache sever on the other hand supports 8-bit names
and escapes them. See for example a nearly identical identical
example to the above on an Apache server:
http://mugen/url
There the Japanese name is escaped by the server properly --
do view source on the page -- and the directory can be easily
accessed.
On another issue: should we support UTF-8 URL?
IE has a default option setting to this and then the user
can turn off this option. After discussing this issue with
Erik, I'm inclined to believe that UTF-8 URL has problems
and would not need to be supported at this juncture.
Comment 64•25 years ago
|
||
I shoud add that on the Apache server, Comm 4.x, IE5,
and Netscape 6 all work OK with the Japanese path name.
Assignee | ||
Comment 65•25 years ago
|
||
Ok, I talked to ftang yesterday, and here's what he thinks the right thing to
do is as I understand it.
Problem. If an anchor tag's "href" attribute contains non-ASCII characters.
URLs don't. So how do we properly escape the non-ASCII characters in the URL so
that an ASCII string results?
Solution. Use the document's charset to return the href attribute to its
original 8-bit encoding. Then URL escape (e.g., " " --> "%20") the
non-ASCII printables. Then call nsIURI::Resolve with the document's base URL.
I've implemented it, and sure enough, it seems to make this test case work. It
leaves a bad taste in warren's mouth, but I burned my tounge a while back and
can't taste a thing.
Do we need to do this for *every* relative URI reference? Zoiks! Anyway, I've
implemented a layout-specific version of NS_MakeAbsoluteURI() (maybe I should
call it NS_MakeAbsoluteURIWithHRef or something) that takes a charset ID and
does the heavy lifting. It's in a new file which I'll attach to the bug. I'll
also attach new diffs to nsHTMLAnchorElement.cpp.
Comments on this approach?
Assignee | ||
Comment 66•25 years ago
|
||
Assignee | ||
Comment 67•25 years ago
|
||
Comment 68•25 years ago
|
||
chris: this looks great. r=buster.
what about other URLs, like <img src=...>? Do those need to be treated
separately?
Comment 69•25 years ago
|
||
Looks good. Where do you want to put the "silly extra stuff"? If we put it in
necko, it will introduce a dependency on i18n (although maybe we have one
already). But maybe this is right anyway.
Any suggestions on how to specify the necko APIs (in nsNetUtil.h) so that it
clear what you're supposed to be passing in? Maybe we just need a comment
saying that the nsString is supposed to already have all this stuff done to it
(charset encoded, escaped). Or maybe we should eliminate it in favor of the new
thing you wrote.
Assignee | ||
Comment 70•25 years ago
|
||
I am going to put the charset-specific version of NS_MakeAbsoluteURI() in
layout; and I am going to change it to NS_MakeAbsoluteURIWithCharset() to avoid
gratuitous name overloading.
Comment 71•25 years ago
|
||
I'd like to make it a method on nsIIOService provided we already depend on
i18n. (I think we do for string bundles. - Gagan, Jud?)
Comment 72•25 years ago
|
||
is bug 40661 related?:
""files" datasource fails to open non-latin named directory"
Assignee | ||
Comment 73•25 years ago
|
||
No. that bug is a dup of 28787
Comment 74•25 years ago
|
||
I had a quick look at the new MakeAbsoluteURIWithCharset method, and it uses
Unicode conversion routines. However, Necko does not appear to use the Unicode
converters. Necko may use string bundles, but they are in a separate DLL
(strres.dll), while the Unicode converters are in uc*.dll. I don't know how
important it is to keep Necko free of uc*.dll dependencies, but this is what I
found after a quick look.
Comment 75•25 years ago
|
||
Erik: Thanks for the info. Is there any plan to bundle all (many) of the intl
dlls into one as we did for necko (to improve startup time, and reduce
clutter)?
Comment 76•25 years ago
|
||
We have discussed that, but I don't know of any concrete plans in that area.
Frank?
Comment 77•25 years ago
|
||
I took a look at MakeAbsoluteURIWithCharset and it includes the following code
at the end:
static const PRInt32 kEscapeEverything = nsIIOService::url_Forced - 1;
nsXPIDLCString escaped;
ioservice->Escape(spec.GetBuffer(), kEscapeEverything,
getter_Copies(escaped));
This usage of nsIIOService::url_Forced is certainly not what I wanted it to do
and I don't believe it does anything useful this way. This method is used to
escape a specific part of an URL not a whole URL. There are different rules for
every part.
Assignee | ||
Comment 78•25 years ago
|
||
Could you suggest an alternative?
Comment 79•25 years ago
|
||
Do we really need to escape this stuff? What characters that can possibly damage
urlparsing can be expected from the charset conversion? Maybe you could use the
old nsEscape functions in xpcom/io as a replacement.
This issue is also relevant for simple XLinks, XPath, etc. "src" attributes.
When I implemented XPath and simple XLink, I asked ftang how to do this but he
could not convince me what is the right thing to do :) So what I do there is
just grab the Unicode string, AssignWithConversion() to nsCAutoString and pass
it to the URI-creating objects. This seems to work in basic cases, like spaces,
but I do not think it is the correct way.
What bugs me the most is this: suppose a document is in UTF-16 and a URL
contains an illegal char that does not fit in ASCII. ftang said to 1) get the
doc charset and use the nsIIOService ConvertAndEscape() function, but to me this
seems like it cannot work for UTF-16. Or does the convert thing automatically
convert UTF-16 into UTF-8 (which fits into char*)? How would we then know what
to send back to server?
I also seem to have trouble understanding how do we escape multibyte UTF-8 URLs.
If a two-byte character becomes two escaped URL chars, will the system still
work? Do we send the server back the bits it gave us?
Assignee | ||
Comment 81•25 years ago
|
||
andreas: how is using the "old" nsEscape different from what I'm doing now? Are
there really differnt rules for escaping different parts of a URI? That doesn't
seem right.
heikki: nsIUnicodeEncoder takes a UCS-2 string as input and returns the result
in an 8-bit string as output. Presumably it does the right thing on UTF-16 to
round-trip the UTF-16 bytes.
Comment 82•25 years ago
|
||
The main reason for escaping is to hide certain special characters from the
urlparser that would mislead the parser, like having a @ in a username or a / in
a filename or something similar. Depending on the position inside the url
different characters are special for the parser and that is what can be done
with the new escape functions. I have to tell it which part of the URL I want to
escape. Simply giving it every possible mask will not work. The old nsEscape
stuff does not look for a special part, it can be used to escape whole urls, but
it may escape to much or not much enough.
Comment 83•25 years ago
|
||
In the case of a UTF-16 document, it is not clear to me that the server is
expecting a URL-encoded (%XX) UTF-16 URL. In fact, some people working on these
issues in the standards bodies seem to be pushing UTF-8 as a convention in URLs.
See ftp://ftp.ietf.org/internet-drafts/draft-masinter-url-i18n-05.txt. However,
blindly converting *every* part of a URL to UTF-8 has bad consequences in
today's Web, as Microsoft discovered. They do not convert the "query" part (the
part after the question mark) to UTF-8. Also, they have a preference for the
part before the question mark. In most versions, they convert that part to UTF-8
but in the Korean and Taiwanese versions they found that they had to set the
default pref to "OFF" (i.e. no conversion to UTF-8), presumably because people
in those countries were using servers that expect something other than UTF-8
in those parts of the URLs.
For now, converting from PRUnichar to the doc's charset and then %-encoding is
the best thing to do, in my opinion. We should continue to follow the emerging
specs in this area, and possibly modify our implementation accordingly. UTF-16
documents are currently still quite rare, I think, but if we are really
concerned about this, my suggestion is to make a special case for the UTF-16
doc charset and convert to UTF-8 (instead of UTF-16).
Assignee | ||
Comment 84•25 years ago
|
||
fix checked in.
Status: ASSIGNED → RESOLVED
Closed: 26 years ago → 25 years ago
Resolution: --- → FIXED
Comment 85•25 years ago
|
||
Not sure how to verify this problem. Could the reporter please check this in the
latest build ?
Reporter | ||
Comment 86•25 years ago
|
||
I verified this in 2000-06-02-08 Win32, Mac, and Linux build.
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•