Closed Bug 237819 Opened 21 years ago Closed 21 years ago

nsStandardURL should automatically convert ACE hostnames to UTF8

Categories

(Core :: Networking, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla1.8alpha2

People

(Reporter: darin.moz, Assigned: darin.moz)

References

Details

(Whiteboard: [IDN])

Attachments

(1 file, 1 obsolete file)

nsStandardURL should automatically convert ACE hostnames to UTF8. If given an ACE hostname, nsStandardURL (::BuildNormalizedSpec, etc.) should convert the ACE hostname to UTF8 and expose the UTF8 value via its host, hostPort, prePath, and spec attributes. We can make this efficient by using nsIIDNService::IsACE
Status: NEW → ASSIGNED
Whiteboard: [IDN]
Target Milestone: --- → mozilla1.8alpha
Blocks: 237820
[ This comment is in UTF-8. ] It's worth noting that the "Host: " field in the "Details" part of the cookie confirmation dialog can currently show up in ACE, e.g., for http://www.öko.de/ .
Blocks: 238046
dbaron: yeah, i think that deserves a separate bug though. we need to go through cookies and add IsACE + conversions to UTF-8 in places that expose hostnames to UI. i think this bug is about making sure that any nsStandardURL constructed with an ACE hostname normalizes the URL's hostname field to the UTF-8 form. in particular this bug addresses some of the ACE "ui-leakage" that can occur when a site redirects the browser to a URL with an IDN hostname. patching coming up...
Attached patch v1 ptach (obsolete) — Splinter Review
This seems to do the trick.
Attachment #147854 - Flags: superreview?(dbaron)
Attachment #147854 - Flags: review?(jshin)
Blocks: 243498
Comment on attachment 147854 [details] [diff] [review] v1 ptach My one issue with this patch is that it changes things to that if there's no IDN service or if gIDNService->Normalize fails (which can happen in response to user input), mHostEncoding gets set to ASCII even when the encoding isn't ascii. That might be a problem. (We don't want a URL that doesn't meet the Bidi requirements for IDN normalization to be corrupted -- we want to allow the user to correct it when typing in the URL bar. And perhaps other potential issues.)
That's an excellent point dbaron. I will revise the patch. Thanks!
Attachment #147854 - Attachment is obsolete: true
Attachment #147854 - Flags: superreview?(dbaron)
Attachment #147854 - Flags: review?(jshin)
Attached patch v1.1 patchSplinter Review
Here's a better patch. It's not great. I'm not terribly happy with NormalizeIDN since it now modifies mHostEncoding directly. I'd prefer if it were a real subroutine with no side-effects, but anyways it already was pretty specific to the callsites. It only really exists as a way to share some code, so I decided to just accept the side-effect of setting mHostEncoding. Maybe there is a better way, but this seems to do the trick. No need for too much fancy code factoring here.
Attachment #150317 - Flags: superreview?(dbaron)
Attachment #150317 - Flags: review?(cbiesinger)
Attachment #150317 - Flags: superreview?(dbaron) → superreview+
Target Milestone: mozilla1.8alpha1 → mozilla1.8alpha2
Comment on attachment 150317 [details] [diff] [review] v1.1 patch + PRBool EscapeIPv6(const char *host, nsCString &result); + PRBool NormalizeIDN(const nsCSubstring &host, nsCString &result); maybe you could document what NormalizeIDN does? especially the meaning of the return value... it seems it returns PR_TRUE when it "did something"? + // NOTE: As a side-effect this function sets mHostEncoding. maybe the encoding should be an out parameter (by reference)? + nsCAutoString hostBuf; + if (EscapeIPv6(host, hostBuf)) { + host = hostBuf.get(); + len = hostBuf.Length(); + } + else if (NormalizeIDN(flat, hostBuf)) { + host = hostBuf.get(); + len = hostBuf.Length(); } hm... why does this function (SetHost) not call the ACString& version of ReplaceSegment? Then you can get rid of these assignments to host and len, and maybe pass escapedHost directly to ReplaceSegment; although you'd have to assign host to hostBuf in the else-case. r=me with or without these changes.
Attachment #150317 - Flags: review?(cbiesinger) → review+
> maybe you could document what NormalizeIDN does? especially the meaning of the > return value... it seems it returns PR_TRUE when it "did something"? yeah, i have it on my todo list to document all the functions. it returns true when it puts something in the result buffer. > maybe the encoding should be an out parameter (by reference)? i almost did this, but it seemed overkill given that the function already knows a lot about the callsite. also mHostEncoding is a 2-bit wide bit-field, so i can't pass it into NormalizeIDN. i'd have to pass in a temporary value, and then i'd have to assign that temporary into mHostEncoding. it didn't seem worth it to me. > why does this function (SetHost) not call the ACString& version of > ReplaceSegment? <snip> > although you'd have to assign host to hostBuf in the else-case. exactly. i'm avoiding the buffer copy. the else-case is the most common case.
fixed-on-trunk
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Blocks: 199229
NormalizeIDN() contains the comment "normalization could result in an ASCII only hostname". That's true, but there's a further implication that is neglected: The resulting ASCII-only hostname could be an ACE. I assume the intention of NormalizeIDN() is never to return an ACE (unless the show_punycode pref forces it). The ToUnicode operation in RFC-3490 handles this case. The mozilla source code does not seem to contain an implementation ToUnicode. The comment for convertACEtoUTF8() says "This is the ToUnicode operation", but it's not: it's missing the first two steps, and it operates on entire domain names, whereas ToUnicode operates on individual labels. Hmmm, if a domain name contains some ACEs and at least one bogo-ACE (something that begins with the ACE prefix but is not a valid ACE), none of the valid ACEs will get converted. It's not clear to me what is the underlying intention behind NormalizeIDN(). If the input is "FooBar", then it will be left untouched, in mixed-case, but if the input has an accent over the 'a', then it will be Nameprep'd and hence forced to all-lower-case. If the goal is a canonical form, then you need to lowercase even pure ASCII inputs. If the goal is not a canonical form, then why bother with Nameprep for non-ASCII inputs? Or is the goal a semi-canonical form that can be compared using a comparison function that knows the ASCII case pairs but no others? If one goal is human readability, then you want to avoid returning an ACE. The ToUnicode operation is designed to accomplish exactly that: it takes any domain label, in any form, and always returns an equivalent label that is not an ACE. If, in addition to being non-ACE, you also want a canonical form (that is, one that can be compared using a byte-wise comparison), then you want to apply both ToUnicode and Nameprep, in either order (the result will be the same). If Nameprep is applied before ToUnicode, then the first two steps of ToUnicode will make no difference and can be optimized away. By the way, the comment for isACE() is somewhat misleading (as is the name of the function). As implemented, it does not check whether the input is an ACE (or is "ACE encoded"); it checks for the presence of the ACE prefix. All ACEs begin with the ACE prefix, but not all strings beginning with the ACE prefix are ACEs (as stated in section 5 of RFC-3490). I understand that this function is intended only for performance optimization, so it's okay to calculate a loose boundary around the set of ACEs, but it would be good for the comment to more accurately describe the behavior.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: