Closed Bug 102656 Opened 23 years ago Closed 23 years ago

IDN support for HREF IDN-links

Categories

(Core :: Internationalization, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla0.9.9

People

(Reporter: wil, Assigned: nhottanscp)

References

()

Details

(Keywords: intl)

Attachments

(2 files, 12 obsolete files)

3.01 KB, patch
jst
: review+
darin.moz
: superreview+
Details | Diff | Splinter Review
3.01 KB, patch
Details | Diff | Splinter Review
ToLowerCase
(http://lxr.mozilla.org/mozilla/source/netwerk/base/src/nsURLHelper.cpp#176)
is being called to downcase the hostnames for the internal representation 
of "host" in nsIURI.
This creates a problem for some internationalized domain names in their 
native encoding (like Shift-JIS, Big5).
Is there a reason why it is done this way?
This is also true in nsStdURLParser and nsNoAuthURLParser.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: nsAuthURLParser downcases hostnames → IDN support for HREF IDN-links
Another approach is to prevent the native encoding (like Shift-JIS, Big5) host
name and use UTF-8 instead. I think keeping the native encoding in nsIURI is not
a good idea, parsing is difficult, also the host eventualy needs to be unicode
in order to create ACE (ASCII Compatibe Encodeing which is needed for IDN). So,
we want to convert to unicode earlier instead of changing the parser to handle
the native encodings.
This can be done by changing NS_MakeAbsoluteURIWithCharset()in nsHTMLUtils.cpp
where URI for HREF is processed. In the function, the URI is in UCS2, then it is
converted to a doucment charset. Instead, we can keep the host name as unicode
(UTF-8). 
William, could you attach your patch for nsHTMLUtils.cpp?
In the function, there is ASCII check for URI spec. Also, some cases already use
UTF-8 URI. So, we only need to set UTF-8 host name when the URI is converted to
a doucment charset.

http://lxr.mozilla.org/seamonkey/source/content/shared/src/nsHTMLUtils.cpp#149

149 if (encoder) {
150 // Got the encoder: let's party.

I will update the patch within a day or two.
I agree with Naoki that nsIURI should be given a UTF8 hostname in the first place.
Is there any other entry-point for creating a URL other than 
NS_MAkeAbsoluteURIWithCharset?


* CC'ing Darin.
so, let me make sure i understand things correctly... if the standard url
implementation simply avoids lowercasing the hostname, then we should be OK, right?
oh.. wait.. there is one more problem.  namely that of IPv6 address literals
which can contain ':' chars.  we currently strchr for ':' and then wrap the 
hostname in []'s.  this is defined in RFC 2732.  won't this be a problem if
hostname is UTF-8?  or, perhaps i just need to use some UTF-8 aware strchr??
or, use nsCRT::IsAscii first??
Darin: I'm not sure I understand your concerns.
If we avoid lowercasing the hostname, it would work but it's not an ideal solution 
since the hostname would be in local encoding. Many interfaces are still using C 
strings which means that the hostnames are passed in its raw local charset.

IPv6 hostnames are be in ASCII, so encoding it into UTF8 should have no effect.
There is also no problem with ':' clashing since if ':' appears in UTF8 then it
is the bona fide ':'.
ok, thanks for explaining why we don't need to worry about embedded colons.

but, what i still don't understand is why removing the code which lowercase's
the hostname is not sufficient.  doesn't a UTF-8 encoded string carry all the
information needed to perform the IDN conversion?  and isn't that all we really
need at the networking layer?
If the hostname in URI was UTF8 encoded, ToLowerCase wouldn't have any effect 
on it, and we would be fine. But it is not, it is converted to the document
charset in NS_MakeAbsoluteURI. And since it is represented as a C string 
it is just as good sequence of raw bytes.
It works with i-DNS.net's software because we use guesswork to "detect" the 
encoding, but would not be elegant. But that is really just legacy support for 
non unicode-aware applications.
ok, so it sounds to me as though we can/should leave necko alone, and
concentrate on making sure that all clients of nsIURI pass in UTF-8 hostnames. 
right?
Are there many such places that instantiates nsIURI's?
I wouldn't exclude the possibility of having nsStdURL::SetSpec doing IDN 
conversion so that other places (like the current HTTP and DNS modules) do not
need to worry about it. So, nsIURI::GetHost gives the ACE hostname, nsIURI::GetSpec
gives the URL string with UTF8 hostname, and a separate interface for getting
the UTF8 hostname for display (such as the status bar).
Is that reasonable, or is that silly?
well, i'm not sure if GetSpec should return a UTF-8 hostname or an ACE encoded
hostname.  i say this because when speaking to a HTTP proxy, we need to send the
spec with an ACE encoded hostname.  perhaps nsIURI is not the right stopgap for
IDN conversion then.  i'm really not sure.
Oh yes, forgot about proxies. So let's forget about doing IDN in nsIURI, at least
for now.
ok
Attachment #54705 - Attachment is obsolete: true
Attached patch FIxed aResult leak, thanks jst. (obsolete) — Splinter Review
you can also use nsIIOService::extractUrlPart to parse out the hostname of the
UTF8 spec.  this would eliminate the need to create a second uri object.
Comment on attachment 56549 [details] [diff] [review]
fixed using nsIIOService::ExtractUrlPart as suggested by Darin. Made into a separate function.

>Index: content/shared/src/nsHTMLUtils.cpp


> 
>+/*
>+ * Extracts the hostname part of the given targetSpec,
>+ * encodes it to UTF8, and replace it in-place.
>+ */
>+nsresult
>+ConvertHostnameToUTF8(nsCAutoString& targetSpec,
>+                      const nsString& uSpec,
>+                      nsIIOService *aIOService)
>+{
>+  nsresult rv;
>+
>+  if (!aIOService) {
>+    nsCOMPtr<nsIIOService> serv;
>+    serv = do_GetIOService(&rv);
>+    if (NS_FAILED(rv)) return rv;
>+    aIOService = serv.get();
>+  }

this block of code won't work... the destructor for |serv| would have already
been called by the time you exit this block of code.  you need to give the 
nsCOMPtr declaration function scope.


>+  // Try creating a URI from the original targetSpec
>+  nsCOMPtr<nsIURI> uri;
>+  NS_NewURI(getter_AddRefs(uri), targetSpec);
>+  // This will fail for non-absolutely URIs, in that case we just ignore it.
>+  if (!uri) return NS_OK;

capturing rv is better (in the XPCOM sense) than checking for a non-null uri.


>+
>+  // convert the original spec into UTF8
>+  nsCAutoString specUTF8 = NS_ConvertUCS2toUTF8(uSpec);
>+  // Extract the UTF8 hostname
>+  nsXPIDLCString hostUTF8;
>+  rv = aIOService->ExtractUrlPart(specUTF8.get(), nsIIOService::url_Host, NULL, NULL, getter_Copies(hostUTF8));
>+  NS_ASSERTION(NS_SUCCEEDED(rv), "Unable to extract Hostname part from URL");
>+  if (NS_FAILED(rv)) return rv;
>+  // replace the original hostname with hostUTF8
>+  uri->SetHost(hostUTF8.get());
>+
>+  nsXPIDLCString tmp;
>+  uri->GetSpec(getter_Copies(tmp));
>+  // replace the targetSpec
>+  targetSpec = tmp;
>+  return NS_OK;
>+}
> 
> nsresult
> NS_MakeAbsoluteURIWithCharset(char* *aResult,
>@@ -165,6 +204,9 @@
>           encoder->Finish(p, &len);
>           p[len] = 0;
>           spec += p;


what if |aSpec| contains only ascii characters?  adding an IsAscii check
would probably be an optimization here, since NS_NewURI is expensive.

>+
>+          // iDNS support: encode the hostname in the URL to UTF8
>+          ConvertHostnameToUTF8(spec, aSpec, aIOService);
>           
>           if (p != buf)
>             delete[] p;
>@@ -174,8 +216,8 @@
>           // by default.
>           spec = NS_ConvertUCS2toUTF8(aSpec.get());
>         }
>-
>       }
>+
>       // Now we need to URL-escape the string. 
>       // XXX andreas.otte has warned that using the nsIIOService::Escape
>       // method in this way may be too conservative (e.g., it won't
Attachment #56549 - Flags: needs-work+
Thanks Darin!

> what if |aSpec| contains only ascii characters?  adding an IsAscii check
> would probably be an optimization here, since NS_NewURI is expensive.

There is an IsAscii check at the beginning of NS_MakeAbsoluteURIWithCharset()
so this function will only be called when (!IsAscii(aSpec.get()).
Attachment #55232 - Attachment is obsolete: true
Attachment #55882 - Attachment is obsolete: true
Attachment #56549 - Attachment is obsolete: true
does |spec| need to be a nsCAutoString?  why not just declare it |char *spec| to 
avoid an extra strcpy/strdup in ConvertHostnameToUTF8?
It's already an nsCAutoString in NS_MakeAbsoluteURIWithCharset, so just pass it
to ConvertHostnameToUTF8 so that it can be updated.
If ConvertHostnameToUTF8 uses |char *| wouldn't it be messy as well? I don't
get it, please elaborate.
well, when you call GetSpec, you end up malloc'ing a copy of the url spec, but
then you copy this string into |nsCAutoString &targetSpec| passed in by the
user.  does the caller need this to be a nsCAutoString, or could they use
nsXPIDLCString?  if the caller could be modified to use nsXPIDLCString then you
could avoid making the additional string copy.  make sense?
Ok, now I see where the extra strdup is. But still don't think that the caller 
can be (easily) modified to use nsXPIDLCString because it uses many methods in
nsCString (eg. AssignWithConversion, FindChar) and uses NS_ConvertUCS2toUTF8,
which is derived from nsCAutoString.

Naoki, any idea?

I'm still trying to understand the entire hierarchy of string classes, wished 
the inheritance tree was simpler.
I am not sure, changing the caller seems to involve a lot of changes. 
How about using nsXPIDLCString as an argument and remove the string copy out of
ConvertHostnameToUTF8? Later, the caller may change to use nsXPIDLCString.

BTW, is ConvertHostnameToUTF8 a local function?
yeah, i agree with nhotta: if you can't easily change the caller, then at least
make the caller responsible for the additional strdup.
Yes it's a local function, at least for now.
Comment on attachment 56810 [details] [diff] [review]
Here's a version of ConvertHostnameToUTF8 that takes an nsXPIDLCString.

>Index: content/shared/src/nsHTMLUtils.cpp

>+nsresult
>+ConvertHostnameToUTF8(nsXPIDLCString& targetSpec,

this is fine, but you could also pass |char **targetSpec|, and then use
getter_Copies at the callsite of ConvertHostnameToUTF8.

either way, sr=darin
Attachment #56810 - Flags: superreview+
Yep I think that's a good idea, I'll upload another soon.
hang on, if sr=darin, then r= who?
Johnny did say that he would sr= if I get an r=, please advise.
Attached patch targetSpec => char ** (obsolete) — Splinter Review
I think having two sr= is just fine, no r= needed for that case.
William, could you reassign the bug to yourself since you are actually working on?
I can help check in the file.
Reassigning to myself.
Assignee: neeti → william.tan
Comment on attachment 56867 [details] [diff] [review]
targetSpec => char **

sr=darin (you can use this as an r= if you prefer)
Attachment #56867 - Flags: superreview+
Attachment #56618 - Attachment is obsolete: true
Attachment #56810 - Attachment is obsolete: true
Is 0.9.6 a realistic target for this? 
If you have reviews and ready for check in, please get an approval and check in,
otherwise move to 0.9.7.
Comment on attachment 56867 [details] [diff] [review]
targetSpec => char **

r/sr=jst
Attachment #56867 - Flags: review+
Keywords: mozilla0.9.6-
move to 0.9.7
Target Milestone: --- → mozilla0.9.7
Have you run performance tests on this patch?  IIRC,
NS_MakeAbsoluteURIWithCharset is on one of the most performance-critical paths
in page loading since it is used to canonicalize URIs for lookup in global
history to determine whether links should be colored as visited or unvisited.
If that's the case, the string usage here looks unacceptably slow.  You could
improve it a bit by using the NS_ConvertUCS2toUTF8 inline rather than copying a
second time by assigning the result into an nsCAutoString.
william: also, what about checking if the hostname is ASCII?  if the hostname
were actually ASCII (padded with zero bytes) wouldn't we be able to skip this
entire function call?
What dbaron says. You ought to understand the performance impact of this,
especially on link-riddled pages (go try some LXR docs, e.g.).
The code is only executed when the URL contains non ASCII.
OK, but it would still be good not to make it too slow (e.g., by fixing the
extra copy I mentioned above).
I agree. Willam, please make the change, thanks.
my point was that while the URL may contain non-ascii characters, the hostname
may only contain ascii characters.  it might be a performance win to just scan
the hostname before going through the trouble of instantiating a new nsStdURL.

so, it might be worth it to move the call to ExtractUrlPart above the call to
NS_NewURI.  you could even capture just the hostname offset and length (ie.
don't capture the hostname as a nsXPIDLCString), then if this section of the
UTF-8 spec contains non-ascii characters, you could continue on by calling
NS_NewURI, etc.

but, moreover it seems really unfortunate to have to make an extra copy of the
spec just to support iDNS.  isn't there a better way?  seems like we really need
a way of parsing a unicode URL.  then we could locate the unicode hostname
and check if it is ascii.  if it is ascii, then there is no need to do any more
work.  unfortunately, we don't have a unicode URL parser.

i feel like i should have looked more closely at the patch in context, because
it really is adding substantial work to NS_MakeAbsoluteWithCharset just to
support an edge case.  let's try to come up with something that isn't so costly
for non-IDN links.
what's the format of |spec| when ConvertHostnameToUTF8 is called?  can we use it
to determine if the hostname is really just ascii?
This will only be triggerd if the href contains non-ASCII data tho, so the
normal case will not be any different from what's on the trunk today. With that
said, if there's a faster way to do this I'm all for doing that...
While we're at it, we should remove the IsAscii() method from nsHTMLUtils.cpp
and simply call nsCRT::IsAscii().
aren't non-ascii url paths more common than non-ascii hostnames?
Thanks to all who commented. A patch is underway.

> You could improve it a bit by using the NS_ConvertUCS2toUTF8 inline rather 
> than copying a second time by assigning the result into an nsCAutoString.

Oh I missed that. Will do.

> my point was that while the URL may contain non-ascii characters, the 
> hostname may only contain ascii characters.  it might be a performance win 
> to just scan the hostname before going through the trouble of instantiating
> a new nsStdURL.

Agreed. I'll address this one too.

> so, it might be worth it to move the call to ExtractUrlPart above the call 
> to NS_NewURI.  you could even capture just the hostname offset and 
> length (ie. don't capture the hostname as a nsXPIDLCString), then if this
> section of the UTF-8 spec contains non-ascii characters, you could continue
> on by calling NS_NewURI, etc.

Ok.

> but, moreover it seems really unfortunate to have to make an extra copy of 
> the spec just to support iDNS.  isn't there a better way?  seems like we 
> really need a way of parsing a unicode URL.  then we could locate the 
> unicode hostname and check if it is ascii.  if it is ascii, then there is 
> no need to do any more work.  unfortunately, we don't have a unicode URL
> parser.

> i feel like i should have looked more closely at the patch in context, 
> because it really is adding substantial work to NS_MakeAbsoluteWithCharset 
> just to support an edge case.  let's try to come up with something that 
> isn't so costly for non-IDN links.

Yeah there should be a better way of doing it. I'm especially concerned about
the other dozens of calls to NS_NewURI. When nsIURI uses unicode internally
then there'd be less troubles. But I think there would be a lot to change.

> what's the format of |spec| when ConvertHostnameToUTF8 is called?  can we use
it to determine if the hostname is really just ascii?

It is basically whatever in the HREF attribute. So it could be relative or
absolute. Are you suggesting an alternative to using ExtractUrlPart?

> While we're at it, we should remove the IsAscii() method from nsHTMLUtils.cpp
and simply call nsCRT::IsAscii()

Agreed. Shall I do it in the same patch?
Status: NEW → ASSIGNED
* construct NS_ConvertUCS2toUTF8 directly.
* additional check for ascii hostname (even though url may be non-ascii) before

  calling NS_NewURI
* removed IsAscii() in favor of nsCRT::IsAscii()
  (the context diff is screwed up, so it looks like ConvertHostnameToUTF8
replaced IsAscii. sorry)

We could have used ExtractUrlPart on the |char *spec| to save the UTF8 
conversion if the hostname is ascii. But i figured ExtractUrlPart is fairly
expensive as well, (for parsing anything beyond the url_Scheme).
Please advise.
Attachment #56867 - Attachment is obsolete: true
Comment on attachment 57310 [details] [diff] [review]
Incorporated many of the comments give yesterday.

don't you still need to UTF8'ize the hostname if it is absolute?
s/it/the spec/
> don't you still need to UTF8'ize the hostname if it is absolute?
The rationale is that there may be lots of non-absolute links, so we can save
on the constructing an NS_ConvertUCS2toUTF8. Does it make sense, or is it 
unnecessary?
you can have relative URLs of the form

//hostname/foo/bar.html
Sorry, I wasn't aware of that. In that case it that step would be pointless, I
can snip it.
Darin, could you please clarify which "it" do you mean in "s/it/the spec/"?
Thanks.
Nevermind, I get it now. You're referring to your previous comment :)
Attachment #57310 - Attachment is obsolete: true
Comment on attachment 57363 [details] [diff] [review]
removed the call to ExtractScheme(), thanks Darin.

r/sr=darin
Attachment #57363 - Flags: review+
Thanks everyone.
Darin: could you please check it in for me? Should I email Asa for approval?
william: you still need another r= or sr= on your latest patch.  then if you
want this for 0.9.6, you'll need to send mail to drivers@mozilla.org for
approval... though i doubt they'll take it.
0.9.6 would be too tight, i can wait.
Johnny, would you mind taking a look at the latest patch for me? TIA.
- In ConvertHostnameToUTF8(), |if| and its statement on different lines, i.e.:

+  if (nsCRT::IsAscii(hostUTF8.get())) return NS_ERROR_ABORT;

on one line is a no no, fix all those.

Also spread the code out a bit to make it more readable.

- In NS_MakeAbsoluteURIWithCharset():

+          if (NS_SUCCEEDED(rv))
+              spec = newSpec.get();

Loose the .get(), all that does is slow things down.

With that, r/sr=jst
Attached patch cleanup as suggested by Johnny. (obsolete) — Splinter Review
Thanks Johnny.
Does this look better now?
Attachment #57363 - Attachment is obsolete: true
Yeah, looks way better, but I saw one more thing when looking over it again...

Returning an error from ConvertHostnameToUTF8() just because the host name is
ASCII is just wrong, if it's ASCII, return the host name as is and return NS_OK.
Make the code that calls this method return the error code to the caller in case
ConvertHostnameToUTF8() really fails. Don't attempt to encode non-errors in
error codes...

sr=jst with that fixed, sorry for not catching this the first time.
If hostname is ascii, 2 extra strdup are needed though.
Moved the "delete[] p" portion up so that it doesn't leak in case 
ConvertHostnameToUTF8 fails.
Attachment #58090 - Attachment is obsolete: true
If you think the double strdup is a problem then make the code not do that, but
don't piggyback on the error code :-)
I agree that piggybacking on the error code is wrong.
This alternative piggybacks on the (spec==targetSpec) property,
please suggest any alternative idea you have.

But patch 58093 is still the cleanest way IMHO, can we live with the extra
strdups, Darin?
this looks pretty yuck to me... i mean, now targetSpec is sometimes freshly
malloc'd, and other times it's just a weak reference to an existing string.

how about returning NS_OK and *targetSpec = nsnull?
Agreed. Why didn't I think of that...
Attachment #58187 - Attachment is obsolete: true
Comment on attachment 58528 [details] [diff] [review]
*targetSpec = nsnull when hostname is ascii, as suggested by Darin.

r/sr=darin, with one change:

if (newSpec.get())

should be

if (!newSpec.IsEmpty())
Attachment #58528 - Flags: superreview+
Do I submit another patch for the change, and then bug you for it again?
I'm just confused about how things should go...
william: no need to submit another patch... just get another reviewer to look at
your code, and then make sure to make the change i suggested before checking in.

thx!
What Darin said, and move the:

+    *targetSpec = nsnull;

from inside the IsAscii() check to the top of the method to make sure you never
return w/o initializing the out parameter.

r=jst
Attachment #58528 - Flags: review+
Darin, could you checkin for me please? I haven't an account.
I'd just upload this anyway to make your job easier (hopefully it doesn't cause

even more trouble :)
Thanks!
Attachment #58093 - Attachment is obsolete: true
sure, i'll check this in once the tree opens.
Blocks: 102701
fixed-on-trunk
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
Typing IDN in the URL bar is currently broken (see bug 42898). Is this bug (IDN
links in a document) also broken?
It does not work after I applied the fix for bug 42898, so I reopen this.
With the fix of bug 42898, IDN from the URL bar works but from HREF it is
escaped. Even with the other patch in bug 42898 which remove the escape
(#64946), the link still not resolved.
Status: RESOLVED → REOPENED
Keywords: intl
Resolution: FIXED → ---
In the examples in the page,
http://playground.i-dns.net/~wil/moztest/
I see at least one link is ACE encoded but the link is not found. I am not sure
the links in the page are updated.
Willam, could you applied the patches in bug 42898 and check if this bug really
exists or it's the test URL problem?
Keywords: nsbeta1
Darin: Thanks for the fix to bug 42898. will try to update my sources and verify
the patch.
With darin's patch to remove the nsStdEscape call in nsHTMLUtils:
http://bugzilla.mozilla.org/attachment.cgi?id=64616&action=view
It works.
thx wil...

-> me (to get the patch landed)
Assignee: william.tan → darin
Status: REOPENED → NEW
thanks darin, you'll be landing the patch in 42898 right?

me -> work on display issue.
are you referring to attachment 64616 [details] [diff] [review]?  if so, then yes.
Status: NEW → ASSIGNED
Priority: -- → P2
Target Milestone: mozilla0.9.7 → mozilla0.9.9
this bug does not belong to the networking module.  i believe that necko is
doing everything correctly now in terms of supporting IDN.  this bug needs to be
fixed outside of necko.

-> nhotta
Assignee: darin → nhotta
Status: ASSIGNED → NEW
Component: Networking → Internationalization
William, could you try the latest build? I think this is fixed by darin's check
in (attachment 64616 [details] [diff] [review]).
Status: NEW → ASSIGNED
no, if you just hover over a IDN link, you'll get garbage in the status bar. 
however, that may be considered a new bug.  it's up to you :-)
The mouse over problem is bug 81024, mark this as fixed.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago23 years ago
Resolution: --- → FIXED
Verified fixed on Linux.
qa to william...
QA Contact: benc → william.tan
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: