Closed Bug 43852 Opened 24 years ago Closed 24 years ago

"Send URLs as UTF-8" not working

Categories

(Core :: Internationalization, defect, P3)

x86
Windows 98
defect

Tracking

()

VERIFIED FIXED
mozilla0.9

People

(Reporter: bill, Assigned: nhottanscp)

References

()

Details

(Keywords: helpwanted)

Mozilla build M14 seems to work sending URLs as UTF-8, to our 
Internationalized domain name service at http://www.nunames.nu/eu-lang-test.htm. 
But only if you type the URL into the browser's address form directly (or you 
copy and paste it) - not if you click on a link. 

The way it works is, if you type the Multilingual URL into the browser window, 
using localized non-UTF-8 encoding (say your keyboard and OS encoding is for 
ISO-8859-1, for example), the Mozilla M14 browser will convert that URL into 
UTF-8 and send the request to the name server to be resolved. For 
www.åreskutan.nu it does not display the UTF-8 in the browser window (which 
should be "www.Ã¥reskutan.nu" in encoded form) but displays 
www.%c3%a5reskutan.nu instead. Even though the browser displays these % 
encodings, it actually sends the UTF-8 to the name server query, and this name 
will resolve in our system using M14. Under the "rule of least astonishment", it 
would be nice if it actually displayed the local language keyboard encoding to 
the user, in this case, the originally typed IS-8859-1, or www.åreskutan.nu 

But using M14, if you have a link on your page (as we do at 
http://www.nunames.nu/eu-lang-test.htm) and you click on the link instead, it 
correctly displays the utf-8 encoding at the bottom left side of the browser 
(where it says "contacting  http://www.Ã¥reskutan.nu"), so it seems to be able 
to make the correct conversion to UTF-8 from a link which uses ISO-8859-1. But 
it does not actually send that UTF-8 to the resolver and the query does not work 
as a result. In this case, just as in the previous one, the UTF-8 encoding does 
*not* display in the browser window. But this time, it displays a different 
series of % encodings:

http://www.%c3%83%c2%a5reskutan.nu

And this is the UTF-8 it actually sends: "www.Ã¥reskutan.nu", which does not 
resolve (since the actual name we are serving is encoded as www.Ã¥reskutan.nu.)

When using Mozilla build M15 (the same as NN 6 Beta, I believe), it also 
correctly displays the utf-8 encoding at the bottom left side of the browser 
(where it says "contacting  http://www.Ã¥reskutan.nu"), so it also seems to be 
able to make the correct conversion from a link which uses ISO-8859-1. But it 
does not actually send *any* UTF-8 to the resolver, but sends the following % 
type encoding to the name server: "www.%c3ƒ%c2%a5reskutan.nu" as ASCII, which 
has nothing to do with the correct UTF-8 conversion it initially made, as far as 
I can see, is not actually sent as any kind of UTF-8, and is rejected by our 
system.

Two variations of a broken browser, I'd say.

Bill Semich
.NU Domain
Teruko, please try to reproduce this and confirm if reproducible.
Please also check 4.x behavior.
bill@mail.nic.nu, could you list up the problems? Also, please try newer builds 
(M16 or later).
Using today's build win32 M17, type http://www.åreskutan.nu in the url bar and 
hit return sends UTF-8 query "http://www.%D0%93%D2%90reskutan.nu/".
I am not sure what is broken.
I sent the following message to people listed in this bug report
in a reponse to Bill. I will repeat it here for the record. 
I think we are doing the right thing mostly. But there is one
spec-related issue. We can return UTF-8 URLs in case the URL links on a
web page are not going to the same server as the page itself.
In that case, we don't have to be bound by the page/server charset.

For this small improvement, I will confirm the bug.

====
It seems to me that the current Mozilla is behaving more or less correctly with 
regard to returning/sending the URL. To summarize the current
behavior, 

1. In the location bar, there is no way we can assume the charset which the 
target server requires, so we default to UTF-8 in case the URL
entered contains 8-bit data. 

2. For web pages, if the pages are marked with the meta-charset (or if the 
server sends the charset info with the page), then we return the URL
in that charset. 

I think we are following these 2 basic principle described above currently. 
Your web pages are marked as follows: 

A. http://www.nunames.nu/eu-lang-test.htm  (Windows-1252) 

   Therefore we return Latin 1 encoding in sending back the URL. 

B. http://www.nunames.nu/NUregistryJP.htm (UTF-8) 

  Therefore we send UTF-8 URL back to the server -- I confirmed 
  that we indeed do this on this page. 

C. http://www.nunames.nu/lldemo (Has no charset info) 

   Therefore we will send back in whatever charset the user has 
   selected in the Character Coding menu, or the default browser 
   view charset. 

I think these are more or less correct but there is probably one 
improvement we can make. 

If the links on the page are going to different servers than the one 
which is hosting the page, then we probably do not have to follow the 
charset of the page in sending the URL from a link. I can think of 
returning such URLs in UTF-8. 

Perhaps we can make this bug into making such an improvement. 
What do you think -- people on this list?
Other than this, I don't see much else we can do. 
====
Status: UNCONFIRMED → NEW
Ever confirmed: true
IE5 has a preference (on by default?) Tools|Internet Options...|Advanced
  [x] Always sendURLs as UTF-8

How does IE5 behave with and without this enabled?

Related bugs:
  bug 42898 iDNS support 
  bug 42899 IURI support 
FYI, there are unresolved issues with unicode canonicalization/normalization
and "case" folding with regards to iDNS.
Assignee: nhotta → ftang
Reassing to ftang.
Bob wrote:
> 
> IE5 has a preference (on by default?) Tools|Internet Options...|Advanced
>   [x] Always sendURLs as UTF-8

I received the following email from a Microsoft employee a while ago:

Subject: Re: The .nu domain's experiment with 8859-1encoded domain names.
Date: Mon, 10 Jan 2000 13:16:55 -0800
From: "Chris Wendt" <christw@microsoft.com>
To: "Erik van der Poel" <erik@netscape.com>, "Karlsson Kent - keka" <keka@im.se>
CC: <hostmaster@mail.nic.nu>, <duerst@w3.org>, <markdavis@ispchannel.com>,
    <mark.davis@us.ibm.com>, <goldsmith@apple.com>, <chrispr@microsoft.com>,
    <ftang@netscape.com>, <presnick@qualcomm.com>, <henrik.sviden@idg.se>

> > IE 5 can, apparently, always use Unicode/UTF-8 in (all of)
> > the URL, if set properly, already.

(all of) is not correct. Only in the part which comes before the first
question mark '?'.

> What does "if set properly" mean, exactly? How does IE5 deal with HTML
> forms in non-UTF-8 encodings when submitting them?

"If set properly" means that the advanced option "Always send URLs in UTF-8"
is ON. It is ON by default except for the Korean and Traditional Chinese
localized version (major globalization fauxpas, I agree :-(()

The query part (behind the first '?') is encoded in the encoding of the
document bearing the <form> or in the client machine's default code page if
the query is not submitted from a FORM. Clent code can override the default
setting for non-FORM queries as you can see in the IE5 autosearch feature
where the autosearch query is ALWAYS UTF-8.

If any part of the URL is pre-escaped when IE gets it, i.e. by the HTML
author, there will be no change applied.

I think we should look at the domain names without consideration of queries.

> (1) The Location field (URL bar) where users type the URL via keyboard.
> (2) Links in HTML pages <A HREF="...">
> For (1), we can convert the string typed by the user to UTF-8 before
> sending the domain name to the server.
>
> But for (2), what do you suggest? Should we convert it to UTF-8?

Definitely the same for both cases.
Kat, why should we treat URLs that go back to the original server differently
from URLs that go to other servers? Does some spec say this?
I don't think there is an RFC which defines that.
However, when we parse an server path (URL) which
is not escaped by the server itself, we do something like
what we are doing, i.e. assume the encoding of the
document and then escape it -- for the part below the host name
level. I think we discussed this issue in:

http://bugzilla.mozilla.org/show_bug.cgi?id=10373

So I am not surprised by what we are doing for the
domain name part of it. 

My concern for distinguishing the original server vs. some
other server is motivated by the same consideration, but
I am not sure if that is the best thing to do. That is
should we distinguish how to deal with the domain name part 
from the rest of the server paths? 
In the absence of the real standard we can agree on, I think
we can only agree on the best practice.
The approach that Mozilla has taken when the existing browsers do not adhere to
the specs is to implement both, and switch between them based on the "Quirks
Mode" and "Standard Mode". So I guess one possibility here is to follow the
draft in Standard Mode, and follow some mixture of Nav4/MSIE in Quirks Mode.
The draft is ftp://ftp.ietf.org/internet-drafts/draft-masinter-url-i18n-05.txt.
nhotta- I think you are the P person for URL issue in our current matrix.
Reassign back to nhotta.
We probably need to discuss what we should do with this bug.
Assignee: ftang → nhotta
Status: NEW → ASSIGNED
Keywords: helpwanted
*** Bug 49939 has been marked as a duplicate of this bug. ***
*** Bug 55303 has been marked as a duplicate of this bug. ***
Target Milestone: --- → Future
I told Mozilla 0.7 to load http://%e2%88%ae.cr.yp.to. That domain (with
three 8-bit characters in place of %e2%88%ae, of course) has an address
in DNS, namely 131.193.178.181. Try ``dig contourcname.cr.yp.to'' and
you'll see, among other things, the relevant A record.

Mozilla gave me error 804b001e, the same error that it gives for
nonexistent.cr.yp.to, and said that the host wasn't found. I had
expected it to find the host without trouble.

Positive note: The not-found dialog box had a UTF-8 display of the name.
Negative note: The ``Resolving host'' display had an ISO-8859-1 display
of the name. I would have been disappointed in that behavior even if
ISO-8859-1 had been my default character set; domain names should be
displayed the same way throughout the world.
On my Windows2000, WinAPI WSAAsyncGetHostByName (in nsDNSService.cpp) is called
with a host name in UTF-8, and it returns a success.

I also got the same error even with 131.193.178.181.
>and it returns a success.
I mean calling the API succeeded but I got the error dialog which says the name
was not found.
>I also got the same error even with 131.193.178.181.
Not the same error, I got a page which says "file does not exist" but no dialog 
appeared.

BTW, the following URLs (mentioned in the original report) are working with NS6.
http://www.%C3%B6resundsregionen.nu/
http://www.%e7%99%bb%e9%8c%b2%e6%89%80.nu/
I am not sure what is special about http://%e2%88%ae.cr.yp.to.

I've created an index.html now. If you connect to 131.193.178.181 and do
GET http://%e2%88%ae.cr.yp.to HTTP/1.1, you'll see it. But Mozilla says
the host isn't found.

Perhaps this is a UNIX-specific problem. The BIND DNS client library
chokes on unusual characters; does Mozilla still use it?
Target Milestone: Future → mozilla0.9
The issue originally filed is resolved. The remaining problem is specific to one 
site, it can be filed separately. Actually, I cannot connect to 131.193.178.181.
The original problem is fixed. 
Please file a separate bug for http://%e2%88%ae.cr.yp.to, but I see 
131.193.178.181 does not work either.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Changed QA contact to andreasb@netscape.com.  Andreas, please talk with nhotta 
how to verify this.
QA Contact: teruko → andreasb
Original problem verified fixed in the following builds:
* 20010313 Linux
* 20010312 Win98
* 20010228 MacOS 9.1
Fix uncovered url display problems, reporting new bugs for this.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.