nsStandardURL should automatically convert ACE hostnames to UTF8

VERIFIED FIXED in mozilla1.8alpha2

Status

()

Core
Networking
VERIFIED FIXED
14 years ago
13 years ago

People

(Reporter: Darin Fisher, Assigned: Darin Fisher)

Tracking

(Blocks: 1 bug)

Trunk
mozilla1.8alpha2
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [IDN])

Attachments

(1 attachment, 1 obsolete attachment)

7.71 KB, patch
Biesinger
: review+
dbaron
: superreview+
Details | Diff | Splinter Review
(Assignee)

Description

14 years ago
nsStandardURL should automatically convert ACE hostnames to UTF8.

If given an ACE hostname, nsStandardURL (::BuildNormalizedSpec, etc.) should
convert the ACE hostname to UTF8 and expose the UTF8 value via its host,
hostPort, prePath, and spec attributes.

We can make this efficient by using nsIIDNService::IsACE
(Assignee)

Updated

14 years ago
Status: NEW → ASSIGNED
Whiteboard: [IDN]
Target Milestone: --- → mozilla1.8alpha
(Assignee)

Updated

14 years ago
Blocks: 237820
[ This comment is in UTF-8. ]
It's worth noting that the "Host: " field in the "Details" part of the cookie
confirmation dialog can currently show up in ACE, e.g., for http://www.öko.de/ .
(Assignee)

Updated

14 years ago
Blocks: 238046
(Assignee)

Comment 2

14 years ago
dbaron: yeah, i think that deserves a separate bug though.  we need to go
through cookies and add IsACE + conversions to UTF-8 in places that expose
hostnames to UI.  i think this bug is about making sure that any nsStandardURL
constructed with an ACE hostname normalizes the URL's hostname field to the
UTF-8 form.  in particular this bug addresses some of the ACE "ui-leakage" that
can occur when a site redirects the browser to a URL with an IDN hostname.

patching coming up...
(Assignee)

Comment 3

14 years ago
Created attachment 147854 [details] [diff] [review]
v1 ptach

This seems to do the trick.
(Assignee)

Updated

14 years ago
Attachment #147854 - Flags: superreview?(dbaron)
Attachment #147854 - Flags: review?(jshin)
(Assignee)

Updated

14 years ago
Blocks: 243498
Comment on attachment 147854 [details] [diff] [review]
v1 ptach

My one issue with this patch is that it changes things to that if there's no
IDN service or if gIDNService->Normalize fails (which can happen in response to
user input), mHostEncoding gets set to ASCII even when the encoding isn't
ascii.	That might be a problem.  (We don't want a URL that doesn't meet the
Bidi requirements for IDN normalization to be corrupted -- we want to allow the
user to correct it when typing in the URL bar.	And perhaps other potential
issues.)
(Assignee)

Comment 5

14 years ago
That's an excellent point dbaron.  I will revise the patch.  Thanks!
(Assignee)

Updated

14 years ago
Attachment #147854 - Attachment is obsolete: true
Attachment #147854 - Flags: superreview?(dbaron)
Attachment #147854 - Flags: review?(jshin)
(Assignee)

Comment 6

14 years ago
Created attachment 150317 [details] [diff] [review]
v1.1 patch

Here's a better patch.	It's not great.  I'm not terribly happy with
NormalizeIDN since it now modifies mHostEncoding directly.  I'd prefer if it
were a real subroutine with no side-effects, but anyways it already was pretty
specific to the callsites.  It only really exists as a way to share some code,
so I decided to just accept the side-effect of setting mHostEncoding.  Maybe
there is a better way, but this seems to do the trick.	No need for too much
fancy code factoring here.
(Assignee)

Updated

14 years ago
Attachment #150317 - Flags: superreview?(dbaron)
Attachment #150317 - Flags: review?(cbiesinger)
Attachment #150317 - Flags: superreview?(dbaron) → superreview+
(Assignee)

Updated

14 years ago
Target Milestone: mozilla1.8alpha1 → mozilla1.8alpha2
Comment on attachment 150317 [details] [diff] [review]
v1.1 patch

+    PRBool   EscapeIPv6(const char *host, nsCString &result);
+    PRBool   NormalizeIDN(const nsCSubstring &host, nsCString &result);

maybe you could document what NormalizeIDN does? especially the meaning of the
return value... it seems it returns PR_TRUE when it "did something"?

+    // NOTE: As a side-effect this function sets mHostEncoding.

maybe the encoding should be an out parameter (by reference)?


+    nsCAutoString hostBuf;
+    if (EscapeIPv6(host, hostBuf)) {
+	 host = hostBuf.get();
+	 len = hostBuf.Length();
+    }
+    else if (NormalizeIDN(flat, hostBuf)) {
+	 host = hostBuf.get();
+	 len = hostBuf.Length();
     }

hm...

why does this function (SetHost) not call the ACString& version of
ReplaceSegment? Then you can get rid of these assignments to host and len, and
maybe pass escapedHost directly to ReplaceSegment; although you'd have to
assign host to hostBuf in the else-case.

r=me with or without these changes.
Attachment #150317 - Flags: review?(cbiesinger) → review+
(Assignee)

Comment 8

14 years ago
> maybe you could document what NormalizeIDN does? especially the meaning of the
> return value... it seems it returns PR_TRUE when it "did something"?

yeah, i have it on my todo list to document all the functions.  it returns true
when it puts something in the result buffer.


> maybe the encoding should be an out parameter (by reference)?

i almost did this, but it seemed overkill given that the function already knows
a lot about the callsite.  also mHostEncoding is a 2-bit wide bit-field, so i
can't pass it into NormalizeIDN.  i'd have to pass in a temporary value, and
then i'd have to assign that temporary into mHostEncoding.  it didn't seem worth
it to me.


> why does this function (SetHost) not call the ACString& version of
> ReplaceSegment? <snip>
> although you'd have to assign host to hostBuf in the else-case.

exactly.  i'm avoiding the buffer copy.  the else-case is the most common case.
(Assignee)

Comment 9

14 years ago
fixed-on-trunk
Status: ASSIGNED → RESOLVED
Last Resolved: 14 years ago
Resolution: --- → FIXED
(Assignee)

Updated

13 years ago
Blocks: 199229

Comment 11

13 years ago
NormalizeIDN() contains the comment "normalization could result in an ASCII only
hostname".  That's true, but there's a further implication that is neglected: 
The resulting ASCII-only hostname could be an ACE.  I assume the intention of
NormalizeIDN() is never to return an ACE (unless the show_punycode pref forces it).

The ToUnicode operation in RFC-3490 handles this case.  The mozilla source code
does not seem to contain an implementation ToUnicode.  The comment for
convertACEtoUTF8() says "This is the ToUnicode operation", but it's not: it's
missing the first two steps, and it operates on entire domain names, whereas
ToUnicode operates on individual labels.  Hmmm, if a domain name contains some
ACEs and at least one bogo-ACE (something that begins with the ACE prefix but is
not a valid ACE), none of the valid ACEs will get converted.

It's not clear to me what is the underlying intention behind NormalizeIDN().  If
the input is "FooBar", then it will be left untouched, in mixed-case, but if the
input has an accent over the 'a', then it will be Nameprep'd and hence forced to
all-lower-case.  If the goal is a canonical form, then you need to lowercase
even pure ASCII inputs.  If the goal is not a canonical form, then why bother
with Nameprep for non-ASCII inputs?  Or is the goal a semi-canonical form that
can be compared using a comparison function that knows the ASCII case pairs but
no others?

If one goal is human readability, then you want to avoid returning an ACE.  The
ToUnicode operation is designed to accomplish exactly that: it takes any domain
label, in any form, and always returns an equivalent label that is not an ACE. 
If, in addition to being non-ACE, you also want a canonical form (that is, one
that can be compared using a byte-wise comparison), then you want to apply both
ToUnicode and Nameprep, in either order (the result will be the same).  If
Nameprep is applied before ToUnicode, then the first two steps of ToUnicode will
make no difference and can be optimized away.

By the way, the comment for isACE() is somewhat misleading (as is the name of
the function).  As implemented, it does not check whether the input is an ACE
(or is "ACE encoded"); it checks for the presence of the ACE prefix.  All ACEs
begin with the ACE prefix, but not all strings beginning with the ACE prefix are
ACEs (as stated in section 5 of RFC-3490).  I understand that this function is
intended only for performance optimization, so it's okay to calculate a loose
boundary around the set of ACEs, but it would be good for the comment to more
accurately describe the behavior.
You need to log in before you can comment on or make changes to this bug.