Closed Bug 331510 Opened 14 years ago Closed 14 years ago

Add knowledge of subdomains to necko (create nsEffectiveTLDService)

Categories

(Core :: Networking, enhancement)

enhancement
Not set

Tracking

()

RESOLVED FIXED
mozilla1.8.1

People

(Reporter: darin.moz, Assigned: pamg.bugs)

References

Details

Attachments

(3 files, 5 obsolete files)

Add knowledge of TLDs to necko

To deal with bugs like bug 319643, bug 154496, bug 142179 and of course bug 263931 and bug 252342, it would help if Firefox had some local knowledge of TLDs.  This is something we have avoided doing in the past because we did not want to hardcode knowledge of TLDs in the browser.  However, with the advent of auto-update (and binary patching), it may make sense to ship the browser with a hard-coded list of TLDs.  Of course, some users are not able to benefit from auto-update (readonly installations), so a solution along these lines is not quite perfect.  It may be good enough, however.
Target Milestone: --- → mozilla1.8.1
Yngve Pettersen has just published two internet drafts which try and solve this problem without having hard-coded TLD knowledge in the browser. If we were to unite behind one or both of these methods, might that be a better solution?

The former details what they are doing now; the latter what they would like to do in future.

The drafts are located at:
  http://www.ietf.org/internet-drafts/draft-pettersen-dns-cookie-validate-00.txt
  http://www.ietf.org/internet-drafts/draft-pettersen-subtld-structure-00.txt 

Gerv
I'm familiar with the former, but I'm too concerned with reliability and DNS performance to go that route.  When I investigated that solution a couple years back there were quite a few ccTLDs that had DNS entries (e.g., "co.tv").

The globally managed TLD list is very interesting.  We could implement that today, and have our products fetch the list from Mozilla.  They could do so asynchronously, purely as an update mechanism.  If we used the same format as described in that file, then it would be possible for Mozilla to setup a CNAME or a HTTP redirect in the future to cause browsers to fetch from the global location.
Darin: there may be a few TLDs which have DNS entries, but if it's only a small minority, then a) Opera's solution is a great improvement, and b) once we both implement it, we could do a bit of evangelism. This solution does have the advantage of being fairly low maintenance.

Still, IMO either would do, and both are better than doing our own thing.

Gerv
> Darin: there may be a few TLDs which have DNS entries, but if it's only a 
> small minority, then a) Opera's solution is a great improvement...

Remember that DNS can be very slow, especially for non-existant domains.  NXDOMAIN results from DNS queries are often not cached.  This solution might work really well if you have the right kind of DNS cache between you and the internet, but it could also be extremely bad for some users.  That's just my assertion based on experience, so I could be wrong.

Gerv, what's wrong with the suggestion I made in comment #2 about having mozilla temporarily host the global TLD list?
Nothing's wrong with your suggestion. If we use Opera's format in a way which makes it possible to migrate to a global list, we aren't "doing our own thing".

Do you want to contact Yngve and chat about it? His email address is yngve@opera.com.

Gerv
That makes a lot of sense. As long as we write our internals to expect a file in the format described in Yngve's second proposal, exactly where the file or files files come from and when they're fetched (one monolithic file with updates, or individual files as a new TLD URI is encountered) should be straightforward to adjust.
Summary: Add knowledge of TLDs to necko → Add knowledge of subdomains to necko
One monolithic file makes most sense to me. One maintainer who takes email updates from community members and/or TLD operators will do a much better job than relying on 200+ overworked TLD-operator-employed peons.

Yngve's examples are a bit abstract. Has anyone checked that his syntax covers all cases? For example, the Japanese arrangements are notoriously complex - see bug 252342, comment 31.

Gerv
As I read it, Yngve's proposal currently doesn't provide a single file with all TLDs in it.  They're hosted either by individual TLD maintainers or at a single site, but in any case divided into files by TLD.

With the second option, it would be simple to gather a single file with all the TLDs.  With the first, there's no way to know what TLDs exist a priori, so we'd have to grab the files for a new TLD as we encountered it in a URI.  (Of course, we could seed with the most common ones.)

Anyway, yes, it looks to me like the syntax supports the complications of .jp, as long as someone who understands them well sets up the file properly.  I wouldn't want to volunteer for that job...
I wonder... is there an existing mini-language we can repurpose, rather than invent a new one?

Gerv
Yngve reports that his proposals are still highly preliminary, and that the final file format may end up changing significantly before the dust settles.

Given that, I'd say we should go ahead and write our TLD/sub-domain service with some easily handled, hard-coded (but incrementally updateable) list of the domains we know about for now, and keep an eye on this standard for future incorporation.

So, I'm thinking of simply pulling Jo's list at http://www.neuhaus.com/domaincheck/domain_list.htm (see bug 319643) into a file.  A source file that contains only an array of suffixes would be a good compromise between ease of implementation (since it's a temporary solution) and ease of maintenance (by people who don't have to read C++).
Yes, let's go ahead with that.  Gerv?
No longer blocks: 154496
We've had this problem for years - is there any reason it's suddenly become urgent to fix? I know Yngve's proposal is preliminary, but surely we can work with him to firm it up?

http://www.neuhaus.com/domaincheck/domain_list.htm is out of date (for example, it doesn't include .me.uk). What's our error behaviour going to be for errors of omission and commission?

http://wiki.mozilla.org/TLD_List seems like a better list, and more well-maintained. If one person can get that far in a relatively short time, I think having one central list maintained by us makes most sense. The amount of data per TLD is so small, I think one file (rather than 200 files) is better.

I also think it should be updated using an HTTP request every e.g. 30 days (like Yngve's proposal) rather than using Firefox update, which some distributors disable in favour of their own package management mechanisms.

Gerv
Gerv: The impetus is Places.  It wants a solution that would allow it to provide favorable groupings in the history UI.  Places currently has a very short hardcoded list that you would probably prefer not to look at ;-)  I've also wanted to fix the cookie problem for a long time.  I'm happy to shoot for a list that we download periodically from mozilla.org servers.
Also, in parallel I want to do whatever I can to help Yngve get his proposal through.  But, I don't think we can wait for that.
It's partly a question of how long we think Yngve's proposal will take to gel, combined with how often TLDs change, how many people would be affected by those changes, and how many people have update disabled.

If the proposal is likely to be done fairly soon (whatever that might mean), then spending a lot of time on our own solution wouldn't be worthwhile -- as you say,  the problem's been around for years.  The longer we predict that it'll take for a standardized set of files to be in place, or at least for their format to be finalized, the more it's worth doing now even though it might well need to be scrapped later.

Personally, I'm inclined to be optimistic, expect that we'll have a real standard before too long, and consequently put in a fairly stop-gap solution now.  But I may be suffering from idealism.
Pam: Agreed, but shipping FF2 w/ a hard-coded list of subdomains might be considered harmful over time.  Since we cannot auto-update everybody, we probably do have to build a system to download a fresh file periodically.  This is not hard to do in the Mozilla environment.
WHATWG's local storage APIs need this information (for roughly the same reasons cookies do), so it'd be great to at least define a Necko API for it soonish.
(The WHATWG design doesn't require such a list in the way cookies does, but yes, it could be made better if it has access to such a list.)

Is the solution being proposed here able to cope with domains like uk.com and dyndns.org, or would those still be susceptible to attack? If we address this I'd like us to do a thorough job and include such domains, although that would make the list less "formal".
That would be one reason to keep the list under our control, even if we did use a common format. Pseudo-registrars like uk.com could apply to have their domains added to the list.

Gerv
Darin: Why can't we auto-update everybody? Because people turn the auto-update off? 

Think about it this way. Because of folk who disable auto-update, we write a new piece of code that background downloads this list. That code is now a new complex entity that needs to be maintained and bugfixed. And privacy folk will no doubt howl for the moon and want it turned off. 

Since right now we're "super buggy", what's the disadvantage of being only "partially buggy" for new tlds that get created since the user's last install?
The consequences of having an out-of-date TLD list could hardly be worse than the consequences of not being up to date with security fixes.
Correct me if I'm wrong, but I don't believe SeaMonkey supports auto-updates.
beng, roc: you guys make compelling arguments.  hmm...
SeaMonkey may not support auto-updates but it could, and in any case roc's argument still holds that out of date users are at worse risk from the security updates they don't have than they are from a stale TLD list. They will eventually update.

Note that such a list will never solve the cookie problem 100% as it also applies to a lesser extent to any DNS organization (e.g. hackedlab.school.edu setting a school.edu cookie to influence app at registration.school.edu). We need to implement Cookie2 and some way to expose the domain/path/etc information to web apps (such as turn document.cookie into an array of individual cookie objects, with a setter and toString to handle compatible uses; or a new document.cookies array if that's better).
Using auto-updating files is sounding like the way to go, then.  To allow local or individual modifications, we can look for a TLD list in the user's profile first, then fall back to a default file if none is found.

Summarizing from the thread on moz.dev.platform, the planned file format is

- One TLD or "TLD-like" subdomain is listed per line.
- An asterisk * wildcard matches any valid sequence of characters, and may only appear as an entire level of a domain.
- An exclamation mark ! at the start of a line indicates an exception to a previously encountered rule, where that subdomain should be used instead.
- The last matching rule in the list is the one that will be used.
- If no item in the list matches a given hostname, the level-1 domain is considered its TLD.
- A line is only considered up to the first whitespace, leaving the rest available for comments.
- Any line beginning with # is treated as a comment.

For example:

com  # 1st-level domains are not needed in the list, but may be included for completeness
uk
*.uk
be
ac.be
jp
ac.jp
...
*.hokkaido.jp  # hosts in .hokkaido.jp can't set cookies below level 4...
*.tokyo.jp
...
!metro.tokyo.jp
!pref.hokkaido.jp  # ...except hosts in pref.hokkaido.jp, which can set level 3
...
!city.shizuoka.jp

The primary points of difference between this and Yngve's proposal (see Comment #1) are that subdomains are required to be listed one per line here, that the TLDs must be explicit since they're all in one file, and that his wildcards of the form *1 would be expressed as *.*.

Pam: a bit late, perhaps, to say so, but I like your proposal. It seems like the simplest thing that will do the job, but no simpler.

Gerv
Thanks, Gerv.  It's not too late -- I got distracted by searchbox improvements for a2, so I haven't done more than barely start this, but it's still high on my list.
Then it may be worth pointing out that Yngve is now probably changing to an XML-based format for flexibility. It may be worth the two of you syncing up again quickly - either one will persuade the other, or at least if you agree to differ you will do so understanding all the issues.

Gerv
In email, Yngve reports that although he has plans to switch to an XML-based format, they're presently stalled by author time, IETF organizational issues, and pending technical design questions.  He doesn't have a new preliminary format, and he doesn't sound optimistic that anything new will be available in the near future.

So I'm going to continue with the format described above, and trust that we can either switch our internal parser to whatever wider standard comes into existence in the indeterminate future or write a converter to produce the format we want.
Sounds good to me.
Attached patch First shot at it (obsolete) — Splinter Review
See http://wiki.mozilla.org/Gecko:TLD_Service for a description of the final expected file format and parsing behavior.
Attachment #223520 - Flags: review?(darin)
Comment on attachment 223520 [details] [diff] [review]
First shot at it

netwerk/dns/public/nsITLDService.idl

shouldn't this use AUTF8String, given IDN?
I think we are making a nomenclature mistake here.

In every other context I've ever encountered it, "TLD" (Top Level Domain) means solely .uk, .com, .museum etc. - i.e. a single label. Here, we have overloaded it to mean "multi-label domain name which is permitted to have cookies set on it", which is something entirely different. Each dot-separated part of a hostname is known as a "level", and so "top level" should mean exactly that.

I would suggest you remove all references to "TLD" from the names of the files, functions and interfaces. I'm not certain about what the best replacement terminology is (without using references specific to cookies), but perhaps "Privately Controlled Domain" is a good name for what you get when you strip a hostname down as far as you can without getting into those bits of the DNS which are available for public registration?

getMinPrivatelyControlledLengthForHost()?

This isn't very good - I'm sure we can think of better - but I think using "TLD" terminology is a mistake.

Gerv
My second question is about the file format. I thought that the "!" notation meant "not", as in logical not, and that made great sense to me. However, the documentation currently says that it means "important", with no sense of the inversed meaning. So if you have:

*.hokkaido.jp

This means "you can't set cookies for domains one level below hokkaido.jp". Then if you have:

!pref.hokkaido.jp

This means "you can't set cookies for the domain pref.hokkaido.jp, and that's really important". In other words, this rule is redundant given the previous one.

I'm fairly sure that's not what's meant, but I think the spec is very unclear on this point.

Gerv
I agree with comment #34.

How about "Public Domain Suffix", abbreviated to PDS?
The problem with Public Domain Suffix is that it's not the length of the Public part of the Domain that the API returns, it's the length of the public part plus the first private label - i.e. the bit you _can_ set a cookie for.

Gerv
How about simply "effective TLD".  That makes clear that it's not a true TLD, but it's also a name that people will be able to understand without any additional explanation.  "Public Domain Suffix" is subject to confusion with "Public-Domain Suffix", and most people would have to have it explained to know what it was.

(In reply to comment #37)
> The problem with Public Domain Suffix is that it's not the length of the Public
> part of the Domain that the API returns, it's the length of the public part
> plus the first private label - i.e. the bit you _can_ set a cookie for.

It should be the length of the public part.  If you give it "foo.com", it returns 3, for "com".  If it's not doing that for some cases, that's a bug I need to fix.
(In reply to comment #35)
> My second question is about the file format. I thought that the "!" notation
> meant "not", as in logical not, and that made great sense to me. However, the
> documentation currently says that it means "important", with no sense of the
> inversed meaning. So if you have:
> 
> *.hokkaido.jp
> 
> This means "you can't set cookies for domains one level below hokkaido.jp".
> Then if you have:
> 
> !pref.hokkaido.jp
> 
> This means "you can't set cookies for the domain pref.hokkaido.jp, and that's
> really important". In other words, this rule is redundant given the previous
> one.

The ! has two simultaneous meanings: both "not" (*.hokkaido.jp matches any subdomain of hokkaido.jp, except that it does not match pref.hokkaido.jp) and "important" in the CSS sense (the *.hokkaido.jp rule is overridden by the pref.hokkaido.jp rule).  I'll make the documentation clearer.
I misread the spec; ignore my objection to the name Public Domain Suffix.

Pam: I think "effective TLD" would still suffer from being confused with a TLD, which these things are not. The amount of explaining which is necessary would be the same or more than that required by "Public Domain Suffix". I do think we need to disentangle the idea of a Top Level Domain from the idea of a domain in which cookies can or cannot be set - even if it means renaming all the files, wiki pages and API calls. Sorry :-)

So can we do better than Public Domain Suffix? How about Public Registration Suffix, i.e. the suffix above which public registrations are permitted? Or even just "Public Suffix"?

Gerv
About the ! - having two meanings seems somewhat complex. I had rather assumed that the file worked in a "cascade" order, a la CSS - i.e. that later rules overrode earlier rules. Would it be possible to make it work that way instead? I think that would be both easier to understand and more flexible.

Gerv
So how about the obvious compromise: Public-Level Domain, PLD? :-)
I'd buy that :-)

Gerv
Taking a cue from the Summary of this bug, what about "Public Subdomain"?  "Public Domain"?

(In reply to comment #41)
> About the ! - having two meanings seems somewhat complex. I had rather assumed
> that the file worked in a "cascade" order, a la CSS - i.e. that later rules
> overrode earlier rules. Would it be possible to make it work that way instead?
> I think that would be both easier to understand and more flexible.

It's not really that ! has two meanings, as that you can view it in two different ways that completely overlap: every "not" rule is also an "important" override.  If that's confusing, then forget about the "important" part.  I've already changed the wiki to read "exception" instead of "important" throughout, and I'll be changing the source too.  The fact that one rule is an exception to another rule is the, er, important part.  The other wording was just a side effect of how my brain works; forget I ever mentioned it. :)

It would be possible for the rules to cascade based on their position in the file, but it would be slower, and the flexibility it would permit is overkill.  Even the complexity of .jp is covered by allowing individual exceptions to single-level-wildcard rules.

Most of the bullet points in the wiki are an attempt to document the behavior thoroughly in case someone puts in an oddball file sometime and wonders what's going on.  In practice, a real subdomain description file should be much more straightforward than that page might imply.
I'd like to retain the concept of either "Level" or "Prefix"... A "subdomain", again, is usually thought to be a single level - e.g. "I own the subdomain gerv in the TLD .net". (I'm less sure about this one, though.) So, in descending order of preference:

Public-Level Domain
Public Registration Suffix
Public Domain
Public Subdomain
...
Top Level Domain

But to be honest, any of them would do except that last one. Gotta go into hospital now; I'm sure you'll cope without me :-)

Gerv
Comment on attachment 223520 [details] [diff] [review]
First shot at it

>Index: netwerk/dns/public/nsITLDService.idl

>+interface nsITLDService : nsISupports
...
>+    PRInt32 getHostTLDLength(in ACString aHostname);

The ACString should be a AUTF8String instead since aHostname may
be non-ASCII (aka UTF-8).

I think it would be nice to define a getHostTLD helper function as
well that returns the TLD as a string:

      AUTF8String getHostTLD(in AUTF8String aHostname);


>Index: netwerk/dns/src/nsTLDService.cpp

>+// For a complete description of the expected file format and parsing rules, see
>+// http://wiki.mozilla.org/Gecko:TLD_Service

Might be good to prefix this comment a bit to explain that this service
reads its TLD rules from a file.


>+  stopOK: If true, this node marks the end of a rule.
>+  important: If true, this node marks the end of an important rule.

It's a tad confusing that the comments below mention "stop" and "imp"
instead of "stopOK" and "important".


>+SubdomainNode mSubdomainTreeHead;

The "m" prefix is generally reserved for member variables.  For statics,
we generally use a "s" prefix:

  static SubdomainNode sSubdomainTreeHead;

That said, maybe this should be a member variable.  As we discussed, you
could then allocate the nsTLDService class statically as well.

See nsThreadManager for an example of allocating an XPCOM singleton
statically.  Note the AddRef and Release implementations.

I think it is also important to cleanup the hashtable on shutdown since
our tools look for objects left at shutdown when looking for objects that
may have leaked during a browser session.  Also, when Gecko is embedded,
it may be the case that NS_ShutdownXPCOM does not correspond to the app
shutting down (although that is unlikely).


>+// nsTLDService::Init
>+//
>+// Initializes the root subdomain node and loads the TLD file.
>+nsresult
>+nsTLDService::Init()
>+{
>+  mSubdomainTreeHead.important = PR_FALSE;
>+  mSubdomainTreeHead.stopOK = PR_FALSE;
>+  mSubdomainTreeHead.children.Init();

What if Init fails?  (i.e., out of memory)

It is generally a good idea to catch those errors and return failure.


>+
>+  nsresult rv = LoadTLDFile();
>+  return rv;

nit: "return LoadTLDFile();"


>+// nsTLDService::GetHostTLDLength
>+//
>+// The main external function: finds the length in bytes of the TLD for the
>+// given hostname.
>+NS_IMETHODIMP
>+nsTLDService::GetHostTLDLength(const nsACString &aHostname, PRInt32 *TLDLength)
>+{
>+  // Calcluate a default length: either the level-0 TLD, or the whole string
>+  // length if no dots are present.
>+  PRInt32 nameLength = FindEndOfName(aHostname);

I think this can just be aHostname.Length() since you shouldn't
need to worry about trailing whitespace in the given value.


>+  if (dotLoc < 0)
>+    defaultTLDLength = nameLength;
>+  else
>+    defaultTLDLength = nameLength - dotLoc - 1;

nit: favor brackets when coding an else block


>+PRInt32
>+FindEarlierDot(const nsACString &aHostname, PRUint32 aDotLoc) {
>+  if (aDotLoc < 0)
>+    return -1;
>+
>+  // Searching for '.' one byte at a time is fine since UTF-8 is a superset of
>+  // 7-bit ASCII.
>+  const char *start;
>+  PRUint32 endLoc = NS_CStringGetData(aHostname, &start) - 1;

The APIs defined in nsXPCOMStrings.h are for use outside of Mozilla
by extension authors and embedders (when MOZILLA_INTERNAL_API is not
defined).  For internal code such as this you should just avoid the
functions declared in that header file (it's more efficient that way).

The following is similar:

  nsACString::const_iterator iter;
  aHostname.BeginReading(iter);

  const char *start = iter.get();
  PRUint32 endLoc = iter.size_forward();

(I think the string code should define a RFindChar method that you
could use to avoid having to roll your own like this!)


>+FindEndOfName(const nsACString &aHostname) {
...
>+  PRUint32 length = NS_CStringGetData(aHostname, &start);

Again, avoid NS_CStringGetData in internal code.


>+SubdomainNode *
>+FindMatchingChildNode(SubdomainNode *parent, const nsACString &aSubname,
>+                                     PRBool aCreateNewNode)
>+{
>+  nsCString name(aSubname);
>+  PRBool important = PR_FALSE;
>+
>+  // Is this node important?
>+  if (StringBeginsWith(aSubname, NS_LITERAL_CSTRING("!"))) {

Can you use the .First() method here?  It gets angry if aSubname is
empty, so watch out for that.

Also, once you have your nsACString available as a nsCString (see
"name" above), it is highly preferrable to use that directly instead
of the nsACString type.  nsCString's methods for accessing the string's
data are mostly inlined and much faster to call.


>+    important = PR_TRUE;
>+    name = Substring(aSubname, 1, aSubname.Length());

You can also call: name.Cut(0, 1)


>+  // Look for an exact match.
>+  SubdomainNode *result = nsnull;
>+  SubdomainNode *match;
>+  if (parent->children.Get(name, &match)) {
>+    result = match;
>+  }
>+
>+  // If the subname wasn't found, optionally create it.
>+  else if (aCreateNewNode) {
>+    SubdomainNode *node = new SubdomainNode;

check for out of memory


>+    node->children.Init();
>+    parent->children.Put(name, node);

ditto


>+// AddTLDEntry
>+//
>+// Adds the given domain name rule to the TLD tree.
>+nsresult
>+AddTLDEntry(const nsACString &aDomainName) {
>+  SubdomainNode *node = &mSubdomainTreeHead;
>+
>+  PRInt32 dotLoc = FindEndOfName(aDomainName);
>+  while (dotLoc > 0) {
>+    PRInt32 nextDotLoc = FindEarlierDot(aDomainName, dotLoc - 1);
>+    const nsACString &subname = Substring(aDomainName, nextDotLoc + 1,
>+                                          dotLoc - nextDotLoc - 1);

change that to |const nsCSubstring &subname| as its methods will be
more efficient to call than those on nsACString.  And, then you could
change FindMatchingChildNode to take a nsCSubstring instead too.


>+  // Open the file as an input stream.
>+  nsCOMPtr<nsIFileInputStream> fileStream =
>+      do_CreateInstance(NS_LOCALFILEINPUTSTREAM_CONTRACTID, &rv);
>+  NS_ENSURE_SUCCESS(rv, rv);
>+
>+  // 0x01 == read-only mode
>+  rv = fileStream->Init(tldFile, 0x01, -1, nsIFileInputStream::CLOSE_ON_EOF);
>+  NS_ENSURE_SUCCESS(rv, rv);

I recommend using NS_NewLocalFileInputStream() from nsNetUtil.h


>+  nsCOMPtr<nsILineInputStream> lineStream = do_QueryInterface(fileStream, &rv);
>+  NS_ENSURE_SUCCESS(rv, rv);
>+
>+  nsCAutoString lineData;
>+  PRBool moreData = PR_TRUE;
>+  while (moreData) {
>+    rv = lineStream->ReadLine(lineData, &moreData);
>+    if (NS_SUCCEEDED(rv)) {
>+      if (! lineData.IsEmpty() &&
>+          ! StringBeginsWith(lineData, NS_LITERAL_CSTRING("//"))) {

I suggest moving the NS_LITERAL_CSTRING outside of the loop.  Use the
NAMED version like this:

  NS_NAMED_LITERAL_CSTRING(commentStart, "//");

That way you do a little less work inside the loop.


>Index: netwerk/dns/src/nsTLDService.h
...
>+  static nsTLDService* GetTLDService()

This seems to be unused.
Attachment #223520 - Flags: review?(darin) → review-
Personally, I think I like "effective TLD" the best.  With a little documentation, we can easily make clear the fact that this is not the actual TLD, as that is never going to have a dot in it.

interface nsITLDService : nsISupports {
  /**
   * Get the effective top-level domain from the given hostname.  The effective
   * TLD is the top-most domain under which domains may be registered.  The 
   * effective TLD may therefore contains dots.  For example, the effective TLD
   * of "www.bbc.co.uk" is "co.uk" because the "uk" ccTLD does not permit the 
   * registration of "bbc.uk" as a domain.  Similarly, the effective TLD of
   * "www.cnn.com" is "com".
   */
  AUTF8String getEffectiveTLD(in AUTF8String hostname);

  /**
   * blah blah...
   */
  PRUint32 getEffectiveTLDLength(in AUTF8String hostname);
};

I think explaining "effective TLD" to someone is much easier than explaining "public-level domain" because the latter doesn't stick in the mind as easily.  Moreover, PLD as an acronym is not very clear (nsIPLDService!?!) and reminds me of PLDHash!

That's just my opinion.  I'd like to know what others think.
(In reply to comment #46)

All done, except:

>   static SubdomainNode sSubdomainTreeHead;
> 
> That said, maybe this should be a member variable.  As we discussed, you
> could then allocate the nsTLDService class statically as well.
 
I can't think of a compelling reason to change it to a static object since it's already working this way, but I'll keep the idea in mind for future situations.

> I recommend using NS_NewLocalFileInputStream() from nsNetUtil.h

Done; but that returns an nsIInputStream instead of the nsIFileInputStream I was using before.  What are the tradeoffs between the two?

Also, please take a close look at the error handing.  I tried to make reasonable decisions about when to abort on an error and when to attempt to recover, but I'm not sure I'm conforming to all the conventions.
Attachment #223520 - Attachment is obsolete: true
Attachment #224274 - Flags: review?(darin)
Attached file Sample TLD rule file for testing (obsolete) —
Drop this in your "dist/bin/res/" directory for testing.
Attached file JS wrapper for testing (obsolete) —
This JS file can be used with xpcshell to test the Effective TLD Service.
Also, I renamed it to "Effective TLD Service" and changed all the references to pseudo-TLDs to "effective TLD".  It can change again, though.
A few comments on http://wiki.mozilla.org/Gecko:Effective_TLD_Service with the caveat that I've looked at the wiki documentation but not at the patch.  That document specifies a few things regarding case insensitivity that seem dangerously vague:  "Rules are not case-sensitive" and "All matching is case-insensitive.".  Is your definition of case-equivalence that of ASCII, a specific version of Unicode (ugly, but what IDN does), or the latest version of Unicode?

It might be a good idea to say that all entries in the file are normalized according to RFC 3491, which specifies the combination of case folding to lowercase, NFKC Unicode normalization, and some other normalization.  This might prevent people accidentially putting the wrong form in the file.

For example, consider all 16 combinations of the following 4 strings, taking one as the end of the domain to be matched and the other as an entry in the file [this is in UTF-8]:

ελλάδα.gr
ελλάδα.gr
ΕΛΛΆΔΑ.GR
ΕΛΛΆΔΑ.GR
(In reply to comment #52)
> Is your definition of case-equivalence that of ASCII, a
> specific version of Unicode (ugly, but what IDN does), or 
> the latest version of Unicode?

I'd been wondering about that issue myself.  It normalizes both the file entries and the incoming hostname, currently using ToLowerCase() from nsReadableUtils.cpp.  But now that I know what to search for, it looks like it should use nsIDNService::stringPrep instead.  I'll switch to that and clarify the docs.
Following further discussions with Darin, I'm removing all the additional convenience functions from the interface, dropping back to the original getEffectiveTLDLength() only.  Others can be added back in if someone actually needs them.  In addition, the user's rule file is appended to the system file, rather than replacing it; and rules and hostnames are normalized according to RFC 3454 using the IDN service.

I'd still appreciate a close look at the error-handling.
Attachment #224274 - Attachment is obsolete: true
Attachment #224730 - Flags: review?(darin)
Attachment #224274 - Flags: review?(darin)
Includes non-ASCII rules
Attachment #224275 - Attachment is obsolete: true
Attached file JS wrapper for testing
(UTF-8): tests non-ASCII hostnames
Attachment #224276 - Attachment is obsolete: true
Comment on attachment 224730 [details] [diff] [review]
Patch addressing Darin's and David's comments

>Index: netwerk/dns/public/nsIEffectiveTLDService.idl

>+interface nsIURI;

nit: this forward decl can go away


>+     * The hostname will be normalized using nsIDNService::Normalize, which

nit: nsIIDNService::normalize ... two "I"'s :)


>+     * follows RFC 3454.  getEffectiveTLDLength() will fail, generating an
>+     * error, if the hostname contains characters that are invalid in URIs.

nit: Perhaps this error should be documented using a @throws section.


>+    PRInt32 getEffectiveTLDLength(in AUTF8String aHostname);

nit: perhaps PRUint32 is better since the length is never negative.


>Index: netwerk/dns/src/nsEffectiveTLDService.cpp

>+nsEffectiveTLDService* nsEffectiveTLDService::sInstance = nsnull;
>+SubdomainNode sSubdomainTreeHead;
>+
>+// Semaphore to prevent using tree before it is fully loaded.
>+SubdomainSemaphore sSubdomainTreeLoadState = kSubdomainTreeNotLoaded;

I think that sSubdomainTreeHead and sSubdomainTreeLoadState both
should be declared static.


>+void LOG_EFF_TLD_TREE(const char *msg = "",
>+                  SubdomainNode *head = &sSubdomainTreeHead)
>+{
>+#if defined(EFF_TLD_SERVICE_DEBUG) && EFF_TLD_SERVICE_DEBUG
>+  if (msg && msg != "")
>+    printf("%s\n", msg);
>+
>+  PRInt32 level = 0;
>+  head->children.EnumerateRead(LOG_EFF_TLD_NODE, &level);
>+#endif // EFF_TLD_SERVICE_DEBUG
>+
>+  return;
>+}

nit: kill unnecessary return statement?


>+nsresult
>+nsEffectiveTLDService::Init()
>+{
>+  // Simple lock while file is loading.
>+  if (sSubdomainTreeLoadState == kSubdomainTreeLoading)
>+    return NS_ERROR_NOT_INITIALIZED;
>+
>+  if (sSubdomainTreeLoadState == kSubdomainTreeLoaded)
>+    return NS_OK;
>+
>+  sSubdomainTreeLoadState = kSubdomainTreeLoading;
>+
>+  sSubdomainTreeHead.exception = PR_FALSE;
>+  sSubdomainTreeHead.stopOK = PR_FALSE;
>+  nsresult rv = NS_OK;
>+  if (!sSubdomainTreeHead.children.Init())
>+    rv = NS_ERROR_OUT_OF_MEMORY;
>+
>+  if (NS_SUCCEEDED(rv))
>+    rv = LoadEffectiveTLDFiles();
>+
>+  if (NS_SUCCEEDED(rv)) {
>+    sSubdomainTreeLoadState = kSubdomainTreeLoaded;
>+  }
>+  else {
>+    // In the event of an error, clear any partially constructed tree and reset
>+    // the semaphore.
>+    EmptyEffectiveTLDTree();
>+  }

I'm confused by the use of this lock flag.  How can
nsEffectiveTLDService::Init be called when sSubdomainTreeLoadState
is equal to kSubdomainTreeLoading?  Also, given that Init is called
by your XPCOM factory constructor, it should only ever be called
once.  Since this object is not threadsafe (non-threadsafe addref
and release), I'm left being confused by the locking.


>+void
>+EmptyEffectiveTLDTree()
>+{
>+  if (sSubdomainTreeHead.children.IsInitialized()) {
>+    sSubdomainTreeHead.children.EnumerateRead(DeleteSubdomainNode, nsnull);
>+    sSubdomainTreeHead.children.Clear();
>+  }
>+  sSubdomainTreeLoadState = kSubdomainTreeNotLoaded;
>+
>+  return;
>+}

nit: return statement unnecessary, but now i see a pattern... if
you are consistent with the usage, then no prob.  i just prefer
less to read ;-)


>+nsresult NormalizeHostname(nsACString &aHostname)

I recommend changing this to nsCString instead because it will be
slightly more efficient.  You will then need to change AddEffectiveTLDEntry
to also take a nsCString ref, but that should be do-able.  The IsASCII
and ToLowerCase operations will be slightly more efficient if you work
with nsCString instead of nsACString.


>+{
>+  nsACString::const_iterator iter;
>+  aString.BeginReading(iter);
>+  *start = iter.get();
>+  *length = aString.Length();

Use iter.size_forward() here instead of aString.Length() because
it will be more efficient (inline accessor instead of a function
call).


>+SubdomainNode *
>+FindMatchingChildNode(SubdomainNode *parent, const nsCSubstring &aSubname,
>+                                     PRBool aCreateNewNode)
>+{
>+  // Make a mutable copy of the name in case we need to strip ! from the start.
>+  nsCString name(aSubname);

I recommend using nsCAutoString here instead so that you leverage the
stack for string buffer storage.


>+    if (NS_SUCCEEDED(rv)) {
>+      rv = effTLDFile->AppendNative(EFF_TLD_FILENAME);
>+      NS_ENSURE_SUCCESS(rv, rv);
>+
>+      rv = effTLDFile->Exists(&exists);
>+      NS_ENSURE_SUCCESS(rv, rv);
>+    }
>+  }
>+  else {
...
>+    rv = effTLDFile->AppendNative(EFF_TLD_FILENAME);
>+    NS_ENSURE_SUCCESS(rv, rv);
>+
>+    rv = effTLDFile->Exists(&exists);
>+    NS_ENSURE_SUCCESS(rv, rv);
>+  }

It looks like this code could be factored out slightly and
then shared.  Might be good for saving on codesize.  Every
little bit helps! :-)


r=darin w/ nits picked
Attachment #224730 - Flags: review?(darin) → review+
I'd like to continue to work on http://wiki.mozilla.org/TLD_List , using the rulesets defined here. However, I won't have time until the end of July, because I'll be in the Colombian jungle until then.

Note: I'm working for a big telecommunications company, and my bosses are also interested in this work, especially Yngve's draft. Maybe it will be used in VoIP too, someday.
(In reply to comment #57)

All nits picked, except:

> nit: return statement unnecessary, but now i see a pattern... if
> you are consistent with the usage, then no prob.  i just prefer
> less to read ;-)

A |return;| at the end of a void function is like a period at the end of a sentence. :)  I prefer to make it visually and conceptually clear that the function was intended to end there.

> >+nsresult NormalizeHostname(nsACString &aHostname)
> 
> I recommend changing this to nsCString instead because it will be
> slightly more efficient.

Done.  Should the string guide (http://developer.mozilla.org/en/docs/XPCOM:Strings#As_function_parameters and the table at the end of that page) be changed?  It currently says, "It is recommended that you use the most abstract interface possible as a function parameter, instead of using concrete classes."
Attachment #224730 - Attachment is obsolete: true
(In reply to comment #58)
> I'd like to continue to work on http://wiki.mozilla.org/TLD_List , using the
> rulesets defined here. However, I won't have time until the end of July,
> because I'll be in the Colombian jungle until then.

I'm really glad you're taking that on! If someone needs a sample file larger than the one attached here before then, I can put one together.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
No longer blocks: 263931
Blocks: 342314
(In reply to comment #58)
> I'd like to continue to work on http://wiki.mozilla.org/TLD_List , using the
> rulesets defined here.

Bug 342314 now tracks creating a rule list for this service.
I don't really believe the blacklist approach would work.

The problem is that it will only work for "official" TLDs, not unofficial ones. I think unofficial pseudo-TLDs (like .com.be) should also be supported. Websites registered in those domains also have a right for security. And it would be quite hard to define a border for this. What about things like .geocities.com?

I don't know a lot about this, but when using `dig` (dns lookup) and looking up some domains, it seems that the ;;Authority Section always got the exact right domain for which I would limit cookies etc.

What's wrong with this?
When using my normal dns server (a cheap dsl router), I get no authoritiy section at all.
When using the dns server of my isp directly, i get the list of root servers as authority section.
A very limited test, but show that you can't really trust the use of that section. It might not be available, or it might not work.
Flags: blocking1.8.1?
Not a blocker, but if something is well baked and ready for release, please request an approval on 181.
Flags: blocking1.8.1? → blocking1.8.1-
This new interface has been on the trunk for quite a while, but it hasn't had any data file, and nothing actually uses it.  It has therefore had little to no testing apart from what I did in development, as far as I know.

I'm not sure what benefit would be had from landing it on the branch, but it should likewise be little harm apart from code size.
Please reconsider this for FF2, the globalStorage feature really needs this data.
Flags: blocking1.8.1- → blocking1.8.1?
Comment on attachment 225038 [details] [diff] [review]
Patch as checked in

dveditz, is globalStorage going to have time to start using this in FF2?

Since there's a need, I guess there's no better way to get more testing than to bake in beta2 for a while.  Requesting approval.
Attachment #225038 - Flags: approval1.8.1?
Given that there's no data file yet, what's the point?  Is that expected to appear soon?
(In reply to comment #68)
> Given that there's no data file yet, what's the point?  Is that expected to
> appear soon?

Yes, work on that is progressing in bug 342314.
OK, so we'll consider the patch if there's a data file, but it's not a blocker.
Flags: blocking1.8.1? → blocking1.8.1-
Even the preliminary data file currently posted for bug 342314 would be a significant improvement (once the bad non-ASCII characters are fixed).  Any domains not listed in it simply fall back to the current behavior.

The patch here has been baking since June.
Plusing to keep on the radar if we can get a working data file and a patch together with low enough risk for FF2
Flags: blocking1.8.1- → blocking1.8.1+
Comment on attachment 225038 [details] [diff] [review]
Patch as checked in

Clearing the approval request as per previous driver comments; when there's a working data file that's been tested on the trunk, please renom!
Attachment #225038 - Flags: approval1.8.1?
Also, it seems like it's actually better not to take it if we don't have a data file, since if we ship in a state that has the code but not the data file, it might be harder for an extension that wants to use the code if it's present to do the detection of whether it's really there...
Talking to DVeditz today this is not wired up to anything on the 18branch and we don't have a patch ready to hookup to cookies.  So removing this from the 18 list...
Flags: blocking1.8.1+ → blocking1.8.1-
Summary: Add knowledge of subdomains to necko → Add knowledge of subdomains to necko (create nsEffectiveTLDService)
Blocks: 367373
Blocks: 368700
So this interface isn't really usable from JavaScript.  At least not if you want to use it with non-ASCII tlds...  See bug 368989.
Flags: wanted1.8.1.x+
You need to log in before you can comment on or make changes to this bug.