Closed Bug 319643 Opened 14 years ago Closed 14 years ago

TLD list: fill out with the rest

Categories

(Firefox :: Bookmarks & History, defect, P4)

defect

Tracking

()

RESOLVED DUPLICATE of bug 342314
Firefox 3 alpha2

People

(Reporter: brettw, Unassigned)

References

Details

Places uses a list of TLD such as ".com" and ".co.uk" to group by site (so all of mozilla.org is grouped together, even developer.mozilla.org). The result is greatly improved grouping, even though it doesn't work in call cases.

Currently, there are only a handlful of the most common ones that I manually typed in for testing. This list should be expanded to include all current TLDs, and possible moved somewhere else where it can be shared.
I think it might make sense to put a service in netwerk/dns that can be used to extract the TLD of a hostname.  We may wish to make use of the same service in the cookies implementation (netwerk/cookies).
That shouldn't be that difficult. Information for the TLD-list could be gathered from bookmarks, cookies, history at startup time or upon request or written to an ever growing file from each resolved DNS-entry. I'll read up on Places first ...

I think it is a great idea to group cookies this way. Ever tried to find a cookie that you blocked months ago but now the site is requiering it? Happy searching ...
This is hard, because there isn't a standard. See bug 8743 (and bug 66383), which discusses the cookie RFC requirement to not access 'top-level domains'. Note the bugsplat comments from 1998.

OTOH, if we add this, then we should fix that bug, too.

I just think that keeping a list of the policies of all TLDs is a maintenance pain - some have the second level but some don't. You get to come up with a sane strategy for intranet pages, too - eg http://staff/ ;)

Anyone got a screenshot of the places stuff in question here? I haven't played with it yet.
This might help bug 252342, but we can't have a Firefox bug blocking a Core bug.
Some more details since this has already gotten a lot of comments:

We can't extract this information from the user's visited sites. In group by domain, we want to treat foo.co.uk and bar.com as having the same depth, so you'll see these two domains in your domain list. Anything like images.bar.com would go under bar.com. We don't want to make you expand ".co.uk" just to see the UK sites you visited. This means we have to know that ".com" and ".co.uk" are special.

For unknown TLDs, we'll just treat them as host names, so if you go to an intranet host "mail", that will appear in the list alongside "mozilla.org", etc. If you have intranet hosts with the same names as TLDs and are typing them as non-fully qualified, you're asking for problems.

It would obviously be ideal to have a list in netwerk which could also be used for cookies. In previous discussions, this has been a contentious issue because the list can change. We should certainly try to come up with a good solution that makes everybody happy. The best thing to do is probably to have some kind of complicated autoupdating service.

But for places, the bar to pass is much lower than for cookies since this only controls grouping in one of the possible history views. If ICANN adopts a new .xxx domain after a release, we'll treat .xxx as a "domain", and every domain you visit in there will be grouped under .xxx in the domain view. I don't think this is a huge deal.
Jo - thats a dupe :)

Can we do something heuristically based on how many different sites under the subdomain? That would fix the common case of local sites and common domains. We then couldn't use that for fixing bug 8743, but maybe thats OK.

How far does the definition of 'top level' go? Are there any countries where some psuedo-TLDS have two dots, but others have three? Australia has <state>.gov.au, so if taken to extremes... (Although that would be tricky since federal government sites are just <site>.gov.au. Thats extra complexity that probably isn't worth it.)
I think we can leverage the software update mechanism in FF to deploy modifications to the TLD list as needed.  That's why I'm in favor of
building a TLD list into FF.  In the past, we did not have an easy way
to update the TLD list, but now we do.  So, that barrier is mostly gone.
(In reply to comment #6)
> How far does the definition of 'top level' go? Are there any countries where
> some psuedo-TLDS have two dots, but others have three? Australia has
> <state>.gov.au, so if taken to extremes... (Although that would be tricky since
> federal government sites are just <site>.gov.au. Thats extra complexity that
> probably isn't worth it.)
> 

Yes there are : read bug 252342 for some examples.

Some countries are like the UK, and always use the same number of levels (*.co.uk, *.com.co, ...), some only use it for a subset. In Belgium, we use *.be, but there's also *.ac.be (http://www.vub.ac.be for instance) for academical institutions. The same thing happens in other countries too : Japan and China have geographical domains (*.chiyoda.tokyo.jp, *.bj.cn, ...), Israel has *.k12.il, etc...

http://www.neuhaus.com/domaincheck/domain_list.htm is a very large list (although not complete), although many of these are probably unused.

The problem is that these 2nd, 3th or even 4th level domains are not standardised, so we need some kind of whitelist that can be easily updated.
We'd only be updating that with new versions, though? Also, most vendors (fedora, redhat, etc) will disable the software update mechanism in favour of their own package management systems; some (eg debian) won't take non-security fixes in a stable release cycle.

Although this is just a UI optimisation, so if its wrong its not exactly critical.... 
There is some stuff in mozilla/browser/components/nsNavHistory.cpp that could be moved to necko and then reused for cookies, etc:

static nsresult GetReversedHostname(nsIURI* aURI, nsAString& host);
static void GetReversedHostname(const nsString& aForward, nsAString& aReversed);
static void GetSubstringFromNthDot(const nsString& aInput, PRInt32 aStartingSpot, PRInt32 aN, PRBool aIncludeDot, nsAString& aSubstr);
static PRInt32 GetTLDCharCount(const nsString& aHost);
static PRInt32 GetTLDType(const nsString& aHostTail);
static void GetUnreversedHostname(const nsString& aBackward, nsAString& aForward);
static PRBool IsNumericHostName(const nsString& aHost);

I now think the only suitable way for common TLDs is a whiteliste that gets updated if necessary.
> Although this is just a UI optimisation, so if its wrong its not exactly
> critical.... 

If it were used for cookies too, outdatedness could mean a couple of things. If a new TLD were added, older mozillae would still allow cookies to be set for those domains (nothing worse than we already do now). If one were removed (a much rarer case, I imagine) we would deny domain cookies (not host cookies) from that domain.

>There is some stuff in mozilla/browser/components/nsNavHistory.cpp that could
>be moved to necko and then reused for cookies, etc:

I can help out with patching cookies to use the common code, if needed.
(In reply to comment #11)
> > Although this is just a UI optimisation, so if its wrong its not exactly
> > critical.... 
> 
> If it were used for cookies too, outdatedness could mean a couple of things. If
> a new TLD were added, older mozillae would still allow cookies to be set for
> those domains (nothing worse than we already do now). If one were removed (a
> much rarer case, I imagine) we would deny domain cookies (not host cookies)
> from that domain.

I think, domains (or for that matter IP-addresses) that do not match any of the whitelisted TLDs would be group under "somewhere" or something similiar.
 
> >There is some stuff in mozilla/browser/components/nsNavHistory.cpp that could
> >be moved to necko and then reused for cookies, etc:

Ooops ... it was mozilla/browser/components/places/src/nsNavHistory.cpp
 
> I can help out with patching cookies to use the common code, if needed.
> 
    I'm currently collecting a list of 3rd and 4th level domains at <http://wiki.mozilla.org/TLD_List>. It will take a few more days, since I'm visiting every NIC for every TLD (luckily I can read most of them, even when they're not in English). But information is scarse, so I expect some errors in the list. We would need a way to quickly update them when errors are found or new domains appear.

    After the list is complete, I'll build a datafile (like au->gov->nsw instead of *.nsw.gov.au), and I'll try to think of a way to verify it. Maybe SOA lookups thru a perl-script or something. But there's no such thing as a download that we can do to retrieve the list of 3rd levels domains.

Priority: -- → P2
*** Bug 324052 has been marked as a duplicate of this bug. ***
Severity: normal → major
Target Milestone: --- → Firefox 2 beta1
I think we need to get this done before the first beta or we'll be flooded with bug reports on grouping.
Target Milestone: Firefox 2 beta1 → Firefox 2 alpha2
(In reply to comment #15)
> I think we need to get this done before the first beta or we'll be flooded with
> bug reports on grouping.

Oh, wait, that's what "beta 1" target means. I'm dumb.
Target Milestone: Firefox 2 alpha2 → Firefox 2 beta1
Currently Firefox lists every .nl site under the same group, what is rather weird for most people in Holland visit more Dutch sites than English sites. Especially the older ones only visit Dutch sites. The group middle aged people is highly represented under the internet users here. Although I don't know how many older people are using Firefox.
The history list in 1.5 does this correctly.
*** Bug 330136 has been marked as a duplicate of this bug. ***
Okay, I'm going to take a swing at pulling the TLD stuff out of browser/components/places/src/nsNavHistory.cpp and into a netwerk/dns service that both Places and cookies can use, and extending it to include all the primary TLDs in Jo's list.  I'm pretty new to Mozilla development, though, and I have some design questions.

What's the best way to include a list of TLDs that can be easily updated later?  What level of granularity does the software update mechanism mentioned by Darin apply to? Can the list be updated easily enough if it's hard-coded into the .cpp file, as it is now?  Or does it need to be pulled out into a text file that can be updated later?  If so, where should that go?
Depends on: 331510
Wouldn't it be better to just use the entire URL after http:// and without www if it is there instead of trying to make a list of all the TLDs in every locale? Also, what about sites that are on shared hosting and have a subdomain, like perhaps mysite.example.org? Especially if example.org was a hosting site that had nothing to do with its clients, it doesn't make sense to list mysite.example.org under example.org if it is totally unrelated. Now I know that the majority of large sites are not under a subdomain but it is still an inconvenience nonetheless. How does the cookie system work? That system seems to list everything by site fine.
Improving the current behaviour - grouping all domains of one country code into one site rather than grouping by domain - would not require a TLD list.

However, I'm not sure if the added complexity of trying to maintain a TLD list is worth it for the possible benefit. Subdomains are usually logically different from  each other, and in some cases are even required to make sense of the domain (i.e. there's nothing at "icio.us").
(In reply to comment #21)
> However, I'm not sure if the added complexity of trying to maintain a TLD list
> is worth it for the possible benefit. Subdomains are usually logically
> different from  each other, and in some cases are even required to make sense
> of the domain (i.e. there's nothing at "icio.us").
> 

I second that.
(In reply to comment #21)
> Subdomains are usually logically
> different from  each other, and in some cases are even required to make sense
> of the domain (i.e. there's nothing at "icio.us").

I'm not sure I understand.  That sounds like an argument in favor of an effective TLD/subdomain list, rather than against it.  Perhaps this bug is misleadingly named; it's really about adding a service to identify "TLD-like" subdomains, not only true TLDs.

Likely everything at icio.us should be grouped together, as it will be since .us is a TLD.  But grouping everything in co.uk together as if it were part of the "co" domain makes less sense.  And some TLDs have a more confusing mix, with some subdomains acting like "effective TLDs" and others not.

Maybe your point is that we don't need "Group by Domain", only "Sort by URL"?  Could be.  (The service still needs to be built for cookies, though; see bug 331510.)
Sorry for being unclear. I'm not debating if this feature is needed for cookies, but since bugs relating to the "Group by Domain" feature are being duped to this one, I was posting about the latter.

Cookies, of course, should be tied to the lowest-level domain that can be owned (such as "google.co.uk" or "icio.us") but History should be grouped by subdomain (such as "images.google.co.uk" or "del.icio.us"). Hence, I do not think the Group by Domain feature for History requires the TLD feature at all, and using it would be counterintuitive.
> History should be grouped by
> subdomain (such as "images.google.co.uk" or "del.icio.us"). Hence, I do not
> think the Group by Domain feature for History requires the TLD feature at all,
> and using it would be counterintuitive.

This is exactly what the current history sidebar does. You prefer that behavior over the current places behavior (when it works)? I find that there are many more cases where we want TLD grouping than you want full hostname grouping. For example, for slashdot you get all these names "developer.slashdot.org" and "apple.slashdot.org" that seldom have any importance to the user and hurt finding things on "slashdot". Many companies have a similar problem, such as "www-1.ibm.com". Using both the old history sidebar's grouping and our new grouping, I am *much* happier with the new one, and I think it more closely matches a typical user's expectation.
(In reply to comment #25)
> > History should be grouped by
> > subdomain (such as "images.google.co.uk" or "del.icio.us"). Hence, I do not
> > think the Group by Domain feature for History requires the TLD feature at all,
> > and using it would be counterintuitive.
> 
> This is exactly what the current history sidebar does. You prefer that behavior
> over the current places behavior (when it works)? I find that there are many
> more cases where we want TLD grouping than you want full hostname grouping. For
> example, for slashdot you get all these names "developer.slashdot.org" and
> "apple.slashdot.org" that seldom have any importance to the user and hurt
> finding things on "slashdot". Many companies have a similar problem, such as
> "www-1.ibm.com". Using both the old history sidebar's grouping and our new
> grouping, I am *much* happier with the new one, and I think it more closely
> matches a typical user's expectation.
> 

I suggest a combination of these two methods. Have them listed the way places currently does it, and then break those down into subdomains. This breaking them down could be a per domain option. Another way to determine where to break things down is actually quite simple: if the user types something, such as "del.icio.us" into the address bar, it is most likely that it is being used as a domain name and should not be broken down. This solves the problem of awkward domain names like the above, and still keeps everything else fine. Of course, I have no idea how complex this is to implement but it doesn't seem like that much of a big deal to keep track of things that a user types into the address bar.
I implemented something very much like you are describing in an early version of the places history service. It would always group by toplevel then by each subdomain, collapsing items that were unique (for example, if you only ever visited "www.mozilla.org" that would appear in the list and you wouldn't have to expand "mozilla.org" and "www" to get at it.

However, I and most of the people I showed it to agree that the experiment was a failure. There were too many things to expand and it was too complicated. Some domains and browsing patterns work great with it, but it seems most don't. I can't imagine a per-domain option because it would be complicated, hard to find, and almost nobody would use it.
Then I suppose you could just do the checking for unique items and not the collapsing, because there are many cases where you would only visit a subdomain of a certain site and not the domain, usually with shared hosting or perhaps a blog such as someone.blogspot.com.
Severity: major → normal
Priority: P3 → P4
Target Milestone: Firefox 2 beta1 → Firefox 3 alpha2
Bug 331510 added an effective-TLD service to the trunk that should obviate the need for the list described in the original report.
Assignee: brettw → nobody
This was sorted over in bug 342314.

Gerv

*** This bug has been marked as a duplicate of 342314 ***
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → DUPLICATE
Bug 451915 - move Firefox/Places bugs to Firefox/Bookmarks and History. Remove all bugspam from this move by filtering for the string "places-to-b-and-h".

In Thunderbird 3.0b, you do that as follows:
Tools | Message Filters
Make sure the correct account is selected. Click "New"
Conditions: Body   contains   places-to-b-and-h
Change the action to "Delete Message".
Select "Manually Run" from the dropdown at the top.
Click OK.

Select the filter in the list, make sure "Inbox" is selected at the bottom, and click "Run Now". This should delete all the bugspam. You can then delete the filter.

Gerv
Component: Places → Bookmarks & History
QA Contact: places → bookmarks
You need to log in before you can comment on or make changes to this bug.