Closed Bug 368989 Opened 13 years ago Closed 13 years ago

Value of nsIEffectiveTLDService.getEffectiveTLDLength meaningless in JavaScript


(Core :: Networking, defect, critical)

Not set





(Reporter: ecfbugzilla, Assigned: dwitte)



(Whiteboard: [has patch][has reviews])


(1 file, 5 obsolete files)

nsIEffectiveTLDService.getEffectiveTLDLength() uses UTF-8 and the returned TLD length also means UTF-8 characters. Now when one calls from JavaScript the conversion to UTF-8 is done automatically by XPConnect - so the length returned refers to a string we don't have. Example:

var tld 
    = Components.classes[";1"]

alert(tld.getEffectiveTLDLength(""));  // returns 2
alert(tld.getEffectiveTLDLength("domain.ру"));  // returns 4

One would of course expect this to return 2 in both cases. Requesting blocking1.9 since this issue limits the usefulness of effective TLD service considerably.
Flags: blocking1.9?
Blocks: 331510
Severity: normal → critical
Hmm. :-(

If we are going to rejig this interface, can I reopen the idea of changing the name? Publically and politically, we can't call it the "effective TLD" - the registrars will throw a fit and not cooperate with us. And we need their help to maintain the list.

My current idea is the "public prefix service" - does that sound OK?

Surely that's "suffix" rather than prefix? I'd suggest "public domain" service perhaps.

It's not even a very good interface from native code, since it can take an IDN domain name and return a length based on the ACE encoding which doesn't match the input.

I'd like to get two string-returning scriptable methods, one returning the "public domain" (e.g. "") and the other the best-guess effective domain (e.g. "" returned from ""). consumers could always calculate the latter from the former, but I suspect it's going to be a common enough desire we might as well build it into the service and debug the code only once.
The current API is as usable from javascript as from anywhere else, you just have to start with a nsIURI and use the hostASCII version to make sure it's converted the same way the effective service will convert it.
At the very least that should be documented...
(In reply to comment #3)
Yes, taking the detour through asciiHost will probably work. It doesn't make this a good interface however.
Dan: I guess it is a suffix, the way domains are normally written. So "Public Suffix Service". 

So your methods would be something like:

"Public Domain Service" has an unfortunate ambiguity because of the other meanings of "Public Domain" :-(

something like that. The length is useful only if we're really concerned about the perf/bloat impact of the string copies (which we may well be in places), and it's a little confusing that it may be a length relative to a string different from the one you passed into it.

Then we have to figure out if the returned string is an IDN domain or punycoded, and is it always one or the other or does it follow the IDN whitelist?
Blocking at least for correct documentation, if not a better API.
Flags: blocking1.9? → blocking1.9+
There is bug 368700 and bug 368702 depending on how this bug is dealt with - I didn't want to check in a consumer for nsIEffectiveTLDService if the interface is going to be refactored. But since I don't see any movement here - how should I proceed? I could check in anyway or change the patches to use the one-dot rule instead of nsIEffectiveTLDService as it is currently common in the source base.
Wladimir, can you take this bug?
Assignee: nobody → trev.moz
So I guess what should be done here is adding getEffectiveTLD and getEffectiveDomain methods, as well as marking getEffectiveTLDLength as [noscript] because of its ambiguousness when called from JavaScript (I think it should be kept for performance reasons however). Unfortunately, I doubt that I will have time for this.
Attached patch Proposed patch (obsolete) — Splinter Review
This does what I described in the previous comment. It also changes the testcase from bug 368702 to use getEffectiveTLD and getEffectiveDomain since getEffectiveTLDLength is no longer available.
Attachment #264676 - Flags: superreview?(dveditz)
Attachment #264676 - Flags: review?(darin.moz)
A similar patch is in bug 367446.
Wladimir: can you and Dave (bug 367446) cooperate to work out what we need to do? 

Whiteboard: [has patch] needs r?darin,dveditz
Target Milestone: --- → mozilla1.9alpha5
Attachment #264676 - Flags: review?(darin.moz) → review?(cbiesinger)
Dave agreed with me that development should continue in this bug. I will add truncation of output parameters that I forgot in my patch, otherwise the two patches are pretty much identical with exception of comments and variable naming. Bug 367446 comment 10 needs to be addressed as well.
Duplicate of this bug: 367446
Comment on attachment 264676 [details] [diff] [review]
Proposed patch

Clearing review requests awaiting an updated patch.

1) need to address bug 367446 comment 10 (getBaseDomain() needs a numeric "depth" argument. Current code assumes "1").

2) renaming is now or never. I don't feel strongly about the service name--especially given the PITA of moving files--but the new methods need better names. I'm happy enough with Gerv's names from comment 6

3) It would make more sense for these methods to take a nsIURI instead of a string domain name. I suspect the vast majority of the callers will already have a nsIURI and have to call getHost, and then we have to worry that each callsite will correctly drill into the innermost URI in cases like jar: uri's (offline apps, etc) before reading the host. Centralizing would be safer. Plus the URIs should already be normalized and we could skip that step.

If we think enough places might just have a string host then maybe we need two sets of interfaces, one normalizing a string, one taking the innermost host out of a URI, and both feeding into shared guts.

4) should we get rid of the GetEffectiveTLDLength method? I think it's just asking for trouble having people use it.

5) we might not actually need both getPublicSuffix() and getBaseDomain() -- if we add the "depth" argument they come out to the same thing. getBaseDomain(nsIURI, 0) is public, getBaseDomain(nsIURI, 1) is the basic private one. 

On second thought getPublicSuffix(nsIURI) would add readability to code that uses it, doesn't hurt to keep it.
Attachment #264676 - Attachment is obsolete: true
Attachment #264676 - Flags: superreview?(dveditz)
Attachment #264676 - Flags: review?(cbiesinger)
getEffectiveTLDLength should remain as a scriptable method. Otherwise this change will make it even harder to get the effective sub-domain (i.e. host minus base domain.)

(In reply to comment #5)
> (In reply to comment #3)
> Yes, taking the detour through asciiHost will probably work. It doesn't make
> this a good interface however.

It does. See bug 367446 comment 7.
I would argue for a single method, as Dan outlines in comment #5, taking an nsIURI and an integer. It's rather hard to figure out what best to call it, though. I'd be happy with getPublicSuffix(), but we might consider e.g. getPublicSuffixPlus(1, nsIURI), which returns the public suffix + 1.

If people want the effective TLD length, they can call length() on the string returned by getPublicSuffixPlus(0, nsIURI). The function which returns a number for the length is broken.


Daniel Veditz:

3) I am not so sure about making parameter an nsIURI. The one existing consumer is document.domain setter - and it doesn't have an nsIURI, neither do I see a meaningful way to construct one. Same goes for cookie handlers once they start using this service. Adblock Plus will probably have the same problem when using this interface.

This means having two methods with the only difference being the type of the input parameter. Is this really a good solution?

4) It should be fine if we have getBaseDomainLength marked as [noscript]. There is little point using this method from JavaScript anyway, but it is safe to use from C++, and it can improve performance there (saves you string copy/compare, see document.domain setter).

5) I don't think that "getPublicSuffixPlus" actually makes it more clear what the following parameter means. I am leaning towards getBaseDomain as the only method, this is the most consistent solution.
Attached patch Patch v2 (obsolete) — Splinter Review
I decided not to change hostname parameter type here, I rather prefer not to make this patch more complicated than necessary. Please file a follow-up bug if you want this fixed. Other than that - methods have been renamed and an integer parameter for the number of additional domain parts has been added.
Attachment #266257 - Flags: superreview?(dveditz)
Attachment #266257 - Flags: review?(cbiesinger)
Whiteboard: [has patch] needs r?darin,dveditz → [has patch] needs r?biesi,dveditz
Target Milestone: mozilla1.9alpha5 → mozilla1.9alpha6
Moving to A6 as this probably won't happen in time.
Blocks: 366797, 367799, 378668
(In reply to comment #20)
> meaningful way to construct one. Same goes for cookie handlers once they start
> using this service. Adblock Plus will probably have the same problem when 

actually, in cookies we always have an nsIURI to work with, so for us i think dveditz' solution of having two methods is pragmatic. getting things right should always come first, and you should encourage people to use the preferred method. (e.g. you could call the nsIURI version getBaseDomain, and the string version getBaseDomainFromString.) at the very least, if you stick with the string method you should provide a comment in the idl showing consumers exactly how to extract the string from their nsIURI properly.

i'm also in agreement with gerv re not having the length method; i think the cost of the string copy (which will often be into a stack-based string c.f. nsCAutoString) is negligible, and it makes the interface more sane. in this case, i think neatness of the interface wins over micro-optimization.

thanks for working on this; i'm currently working on a patch to use this stuff in cookie code (bug 385299). i hope you get it in for alpha!
Blocks: 385299
also, how will the aAdditionalParts parameter work in the case of ip addresses (bug 364129) - will it work as expected and give back the original ip address?
Yes, an IP address is considered to be one single TLD, the aAdditionalParts is irrelevant in this case since there is nothing to add.
i have another question about aAdditionalParts - say i give getBaseDomain a string like "" and ask for 1 additional part, expecting something back like "". it'll give me back "", and say i'm using that to check if a different URI "" is a subdomain of "". my test will now pass when it should've failed. if the consumers are deriving these strings from real URI's, this should never happen, but the getBaseDomain method takes a string so you should be prepared for anything. you could also imagine this case when asking for e.g. 5 levels on a 3-level URI - the point is that the consumer expects 5 levels and they didn't get it, which i think we should tell them.

there's no way for the consumer to check that what they got back was actually what they expected, short of re-parsing the domain string to check that the number of levels they asked for was what they got. would it be sensible to return a specific failure code if they asked for more levels than the URI contains? e.g. NS_ERROR_INSUFFICIENT_DOMAIN_LEVELS. consumers could then check for this specific error and respond accordingly. an alternative would be to make aAdditionalParts an inout, and return the actual number of parts found.

it may also make sense to have a similar error code for ip addresses, e.g. NS_ERROR_IP_ADDRESS. (in cookie code, we need to special-case ip's before we even call getBaseDomain, so this would help us there, and i think it makes sense in general.)

wladimir, dveditz, do you have any comments/ideas?
punting remaining a6 bugs to b1, all of these shipped in a5, so we're at least no worse off by doing so.
Target Milestone: mozilla1.9alpha6 → mozilla1.9beta1
no response from Wladimir -> taking bug. patch forthcoming.
Assignee: trev.moz → dwitte
Target Milestone: mozilla1.9 M8 → M1
Attached patch patch v3 (obsolete) — Splinter Review
this builds on Wladimir's work (patch v2) and addresses comment 17 (adding an nsIURI method) and comment 26. this patch also has a fix for bug 364129 rolled in (IP address handling), since it's all related.

specifically, what this patch does:
1) reworks the nsIEffectiveTLDService interface to have getBaseDomain() and getBaseDomainFromURI() methods. the latter uses NS_GetInnermostURI and GetHost.
2) defines two new error codes, NS_ERROR_INSUFFICIENT_DOMAIN_LEVELS and NS_ERROR_HOST_IS_IP_ADDRESS, to provide consumers with specific failure information. (see comment 26).
3) per bug 364129, we check for IP addresses and throw an error. in that bug, there was talk of simply returning the whole IP address as the eTLD, without failing. i think it's safer to error out, because we really don't know what the consumer wants to do with an IP address (e.g. in cookies we need to special-case it - it's useless having it silently succeed). having a special failure code here avoids the cost of the consumer checking it a second time.
4) changes the normalization semantics. i looked at the code in urlbarBindings.xml, where it works around etld's normalization of the host string by reverse engineering the result and re-applying it, and figured it's probably better to do this in the service itself. so, we normalize the host to do the hash lookup, but then we apply the result to the original unnormalized host string. thus, the result we return will be directly comparable to whatever the consumer parsed in, normalized or not.
5) updates the two consumers of eTLD in the codebase (though one is disabled).
6) updates unit tests.

biesi, dveditz, comments and suggestions appreciated! specifically on the nsIURI-related stuff, because you probably know better than me what to do there.
Attachment #266257 - Attachment is obsolete: true
Attachment #274575 - Flags: superreview?(dveditz)
Attachment #274575 - Flags: review?(cbiesinger)
Attachment #266257 - Flags: superreview?(dveditz)
Attachment #266257 - Flags: review?(cbiesinger)
Comment on attachment 274575 [details] [diff] [review]
patch v3

>+++ browser/base/content/urlbarBindings.xml	31 Jul 2007 04:43:43 -0000
>+              // get the etld of the host, and parse off the subdomain portion from the
>+              // beginning of the string
>+              // e.g. host = "" -> subdomain = "foo." and etld = ""
>+              var etld = this.tldService.getBaseDomain(host, 0);
>+              var subdomain = host.substring(0, host.length - etld.length);
>+              if (subdomain.length > 0) {
>+                this._subDomain.setAttribute("value", subdomain);
>+                host = etld;

This should detect the effective domain, not the tld, so getBaseDomain should be called with a 1 as the second argument. Also, shouldn't there be a try/catch for IP addresses?
Target Milestone: M1 → mozilla1.9 M8
Comment on attachment 274575 [details] [diff] [review]
patch v3


you should have an @throws for each of those

+     * @param   aHost       The hostname to be analyzed, in UTF-8
+    AUTF8String getBaseDomain(in ACString aHost, in PRUint32 aAdditionalParts);

aHost should be AUTF8String, so that JS actually can pass UTF-8

+     * Helper method for dealing with nsIURI's.

personally I'd avoid the apostrophe by writing "nsIURI objects" :)

+  if (beginIter != endIter) {

might be a better idea to check for an empty string at the start of the function and return NS_ERROR_INVALID_ARG then

+    do_check_eq(e.result, 0x804b0050); // NS_ERROR_INSUFFICIENT_DOMAIN_LEVELS

if you defined constants for these errors, you wouldn't need the comments :)

+    etld = tld.getBaseDomain("::", 0);

I'd use ::ffff: here, as that's the standard way for mapping IPv4 addresses to IPv6
Attachment #274575 - Flags: review?(cbiesinger) → review+
Attachment #274575 - Flags: superreview?(dveditz)
Attached patch patch v3.1 (obsolete) — Splinter Review
patch updated per comments. dveditz, any chance you can get to this soon?
Attachment #274575 - Attachment is obsolete: true
Attachment #279395 - Flags: superreview?(dveditz)
Attachment #279395 - Flags: review+
Comment on attachment 279395 [details] [diff] [review]
patch v3.1

>+     *         when the sum of the aAdditionalParts parameter and the number of
>+     *         subdomain levels in the effective TLD are greater than the total number
>+     *         of subdomain levels in aHost

I'm dubious about this aspect of the new API, will someone wanting "2" care if there's only 1 to be had?

>+    AUTF8String getBaseDomain(in AUTF8String aHost, in PRUint32 aAdditionalParts);
>+    AUTF8String getBaseDomainFromURI(in nsIURI aURI, in PRUint32 aAdditionalParts);

suggestion: reverse these so getBaseDomain() takes a URI and getBaseDomainFromHost() or getBaseDomainFromStr() takes the string. Just a thought since we're trying to encourage the use of nsIURIs rather than have people try to pass around string specs and the like.

>+  // take the unnormalized original domain string, and count off the number
>+  // of required domains. this avoids issues related to returning a normalized
>+  // version of the host string, which the consumer might try to compare to
>+  // other strings which haven't been similarly normalized.

Clever, but in an unnormalized host dots might not be 0x2e which could lead to miscounts (I guess throwing the insufficient levels error, or including too many levels in the returned domain). This is why I'm minusing the review.

I'd be more comfortable if you passed back normalized base domains. Anyone starting with a nsIURI will have a normalized host, what situation did you think people would run into trouble with a normalized return value?
Attachment #279395 - Flags: superreview?(dveditz) → superreview-
Attached patch v3.2 (obsolete) — Splinter Review
(In reply to comment #33)
thanks for the comments... here's an updated patch. i have some questions below:

> I'm dubious about this aspect of the new API, will someone wanting "2" care if
> there's only 1 to be had?

well, i'm not too fussed either way... my argument for adding this is given in comment 26. an alternative (as suggested in that comment) would be to make aAdditionalParts an inout and return success, so consumers who care can check that what they wanted was what they got.

> suggestion: reverse these so getBaseDomain() takes a URI and
> getBaseDomainFromHost() or getBaseDomainFromStr() takes the string. Just a
> thought since we're trying to encourage the use of nsIURIs rather than have
> people try to pass around string specs and the like.

ok; done.

> I'd be more comfortable if you passed back normalized base domains. Anyone
> starting with a nsIURI will have a normalized host, what situation did you
> think people would run into trouble with a normalized return value?

good point about the dots. the reason i did this was after looking at what consumer code in browser/base/content/urlbarBindings.xml was doing (see patch v3.1); it's basically doing the same thing to apply the effective tld length to its original string, although in their case it's coming from a URI and will be normalized. we can just let the consumer worry about normalization and encoding.

i don't think we can assume nsIURI's will be utf8 either; per it's possible to end up with an ACE-encoded string. so, i've added a check for ACE encoding into the normalization function. this will benefit getBaseDomainFromHost() as well. do we need to check for url escaped hosts or anything else there?

also, do we really need to keep track of a trailing dot on the host, and add it back at the end? i'm not sure why we need it...

   // add on the trailing dot, if applicable
+  if (trailingDot)
+    aBaseDomain.Append('.');
Attachment #279395 - Attachment is obsolete: true
Attachment #280305 - Flags: superreview?(dveditz)
Target Milestone: mozilla1.9 M8 → mozilla1.9 M9
Blocks: 397655
Any progress on this at all?  We need to get this in ASAP if we are going to make it into beta.  
waiting on sr=dveditz, he said he should be able to get to it within a few days.
Whiteboard: [has patch] needs r?biesi,dveditz → [has patch] needs sr?dveditz
Blocks: 223895
(In reply to comment #34)
> although in their case it's coming from a URI and will be normalized.
> we can just let the consumer worry about normalization and encoding.

Additional consumers are as likely as not to be people writing their first add-on who don't even know IDN exists let alone know they're expected to deal with it. Letting the consumer worry about it is a dangerous approach.
Comment on attachment 280305 [details] [diff] [review]

Bugzilla ate my homework :-(
I reviewed this over the weekend, but apparently it got lost. I'll try to remember my comments

I had assumed the nsIURI form would be the base form rather than having it just forward to the string form. That means the string form may or may not be normalized the same way as other code and that's the kind of difference that can lead to security errors.

Also today I overheard sdwilsh on IRC ask for a default arg for getBaseDomain(), it made me prefer the suggested split getPublicSuffix()/getBaseDomain() suggested several comments back. That makes the proposed getBaseDomain(uri,0) equivalent to this patch's getBaseDomain(uri,1) which could then be an [optional] arg in the IDL. I really prefer the clarity of having two separate names to having people try to remember what 0 or 1 mean. Most people would want either getPublicSuffix() or plain getBaseDomain(), the very few who need extra bits could pass the extra arg. getBaseDomain() would throw if given a purely public suffix.

>               var asciiHost = this._uri.asciiHost;
>+                               this.tldService.getBaseDomainFromHost(asciiHost, 0)
>+                                              .split(".").length - 1;

Doesn't look like you need asciiHost anymore, you can call getBaseDomain(this._uri, 0). Do you need to worry this might throw now? Above this chunk you try to strip out IP addresses, would it make more sense just to call this and catch it throwing the IP address error rather than try to duplicate that code with regexp's that might not capture the rules?

>Index: content/html/document/src/nsHTMLDocument.cpp
>+  if (current.Length() > aDomain.Length() &&
>+      StringEndsWith(current, aDomain, nsCaseInsensitiveStringComparator()) &&

I guess this isn't your problem per se, but setting document.domain may require different code depending on whether the browser supports IDN for the TLD or not. This can differ by browser vendor and by updates to the IDN TLD whitelist.

Site www.mü
  document.domain = "www.mü"        // works
  document.domain = "" // throws

  document.domain = "www.mü"        // throws
  document.domain = "" // works

But setting document.location  works with any of those four forms (the resulting URI gets normalized).

Seems like document.domain ought to normalize first, then do these compares. At least the "current" document.domain is guaranteed to start out normalized so it should match our eTLD list.

Ok, so the eTLD list is in UTF-8 so we need to match when we compare. Alright, let's leave it alone then.

>Index: netwerk/dns/public/nsIEffectiveTLDService.idl
>+    /**
>+     * Returns the effective TLD of a host string. Otherwise identical to getBaseDomain().
>+     * When determining the effective TLD, the hostname will be converted to UTF8
>+     * if necessary and normalized using nsIIDNService::normalize, which follows
>+     * RFC 3454. getBaseDomain() will fail if the hostname contains characters that
                -----^^^^^ getBaseDomainFromHost()

>+     * @see     getBaseDomainFromHost()
             ------^^^^^^^ getBaseDomain()

>Index: netwerk/dns/src/nsEffectiveTLDService.cpp
>@@ -107,100 +108,157 @@ nsEffectiveTLDService::~nsEffectiveTLDSe
>+nsEffectiveTLDService::GetBaseDomain(nsIURI     *aURI,
>+  return GetBaseDomainFromHost(host, aAdditionalParts, aBaseDomain);

As mentioned I thought the nsIURI form should be the preferred form so we can ensure consistent normalization

>+// nsEffectiveTLDService::GetBaseDomainInternal



(but I think getPublicSuffix() would be useful, and getBaseDomain() would make more sense if the default value were "" rather than "com")
Attachment #280305 - Flags: superreview?(dveditz) → superreview+
The normalizing code lowercases before looking things up, but effective_tld_names.dat contains some suffixes with upper-case (particularly the .it ccTLD, but there are a few others). Those aren't going to match, are they?
Thanks for digging into this, everyone.  :)
Comment on attachment 280305 [details] [diff] [review]

requesting approval to get this in for M9. i think it's worth it since developers might want to use this interface, and having it updated in beta will potentially get it more testing. it's also better to ship with the revised interface than something obsolete.

risk here is minimal since we have only two consumers, one of which is switched off.
Attachment #280305 - Flags: approvalM9?
Comment on attachment 280305 [details] [diff] [review]

nevermind, apparently i've got autoapproval!
Attachment #280305 - Flags: approvalM9?
Comment on attachment 280305 [details] [diff] [review]

> nsEffectiveTLDService::NormalizeHostname(nsCString &aHostname)
>+  if (IsASCII(aHostname)) {
>+    PRBool isACE;
>+    if (NS_SUCCEEDED(mIDNService->IsACE(aHostname, &isACE)) && isACE)
>+      return mIDNService->ConvertACEtoUTF8(aHostname, aHostname);
>+    ToLowerCase(aHostname);
>+    return NS_OK;
>+  }
>   return mIDNService->Normalize(aHostname, aHostname);

If the URI is ACE then it doesn't go through ToLowerCase(), nor does it get the lowercasing nsIDNService::Normalize() will do.

ConvertACEtoUTF8 will never return a failure code, but it can hit errors and is advertised as returning its input unchanged. I thought I stepped through once and got just a '.' back out though.
Please land and resolve ASAP.
I'll be landing this today.
Whiteboard: [has patch] needs sr?dveditz → [has patch][has reviews]
Attached patch v3.3Splinter Review
patch as checked in
Attachment #280305 - Attachment is obsolete: true
i changed the interface (to getPublicSuffix/getBaseDomain, with optional parameter) per dveditz' suggestion, fixed the ACE bug in the normalization method, and fixed the various nits. there are two remaining issues highlighted in dveditz' comment, which i'll file followup bugs for. briefly, these are:

1) normalization. the TLD file is in UTF8, but URI's can be UTF8 or ACE depending on whitelisting rules. these normalization details are buried deep in nsStandardURL, and there's no public interface that exposes them, to my knowledge. so we either have to re-normalize URI's to UTF8 for comparison (as this patch does), or expose this whitelist-based normalization so that it can be applied to the TLD file on read-in. i'll investigate the latter in the followup bug.

2) nsHTMLDocument::SetDomain() failing in some cases due to working with non-normalized strings. easy to fix!
Closed: 13 years ago
Resolution: --- → FIXED
issue 2) above is bug 400552.
for issue 1) above, filed bug 402008 and bug 402013.
Blocks: abp
You need to log in before you can comment on or make changes to this bug.