Create an automated way of generating the list of top sites we ship with Firefox

NEW
Unassigned

Status

()

defect
P3
normal
2 years ago
2 years ago

People

(Reporter: Gijs, Unassigned, NeedInfo)

Tracking

(Blocks 2 bugs)

53 Branch
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [fxsearch])

There's a small python script in bug 858829 that we could build on. Alexa now also has a paid-for API that we could consider using.

There's a number of considerations here. Here's a (potentially incomplete) list:

1) how many items do we want to ship?
2) for which items can we use https by default
3) for which items should we use "www.<whatever>" rather than "<whatever>" ? Ideally we should avoid server-side redirects with the default autocomplete entries, which will then clutter up future autocomplete results.
4) we should remove adult content (potentially using the hardcoded list mobile uses, but it's not clear to me how often those domains are likely to change and how it's generated).
5) should we include titles?
6) should the list be localized in some way so e.g. French users get lemonde.fr but English users don't?
7) when / how often should the list be updated?


My proposal would be to use 100 items and include titles, to fix 2-4, and punt on 6/7.

I'd use 100 items because I think after 100 we get into the long tail and doing this becomes a lot less interesting. The remaining 900 (or 999,900...) add to processing time, disk space, memory usage, and clutter up results, but are probably only 1-10% of actual visits for users, so I don't think there's much of a point. Users will visit those pages on their own.

So ship 100 items to everyone, use the mobile list to remove adult content, and use marionette to at least detect redirects and whether shipping with http or https, and with or without 'www.', makes more sense (so load the http URL without 'www.', then determine if the final URL has https and/or www, and use that information in the final URL we use).

I would punt on localizing or automatically updating the list for now to reduce complexity - this should be reasonably straightforward to do in the future if we want to, once we have an automated way of compiling it.

In the future, we could extend the script to also store favicon information to help with shipping those (not just for the location bar suggestions, but maybe also for places / new tab when we have imported history).

For the titles, I think we should just use the document title once the load completes, strip anything after the first punctuation ('-', '.', ':', etc.) that we encounter (so you end up with "Facebook", and not "Facebook - Log In or Sign Up") and trim off any remaining whitespace.

Marco and Sveta, does that sound like a sane plan to you? (Sveta, don't worry if it's not immediately clear how you would implement some of these things or what marionette is, I can help with that. :-) )
Flags: needinfo?(sveta.orlik.code)
Flags: needinfo?(mak77)
(In reply to :Gijs from comment #0)
> Ideally we should avoid server-side redirects with the default autocomplete
> entries, which will then clutter up future autocomplete results.

True, even if now we properly handle simple redirects in frecency (after the fix for bug 737836). So the problem is less critical.

> 4) we should remove adult content

If we use an API like Alexa's, maybe they provide a parental rating we could use? Maybe other APIs provided that info given an url, so we could combine 2 APIs?

> 6) should the list be localized in some way so e.g. French users get
> lemonde.fr but English users don't?

Indeed this was my main concern. Not all US websites are likely to properly redirect an italian user, for example. If I go to amazon.com, it opens the US amazon, and shows a giant banner asking me whether I wanted to go to amazon.it instead. This is not a great experience.
So looks like the lists should come with each locale.

> 7) when / how often should the list be updated?

Basically it's release update VS out-of-band update, where the latter would basically block this bug forever (PSL list?). Imo, considering the limited risk to shipping a broken url that disappears after 14 days, we could just go with "updated with every release".

> I'd use 100 items because I think after 100 we get into the long tail and
> doing this becomes a lot less interesting.

I tried to look at the italian list, and removing obvious NSFW stuff, the list of interesting things is smaller than 50. I suspect for most locales 100 pages would be even excessive once you do some filtering. I'd vote to begin with 50 and eventually expand to 100. But maybe we should see the filtered lists first.

> I would punt on localizing or automatically updating the list for now to
> reduce complexity - this should be reasonably straightforward to do in the
> future if we want to, once we have an automated way of compiling it.

Probably should be auto-generated first, then localization should be able to customize it. Maybe through a whitelist of things that should always appear, and a blacklist of things that should never appear. So they wouldn't have to touch the original file or continuously remove/add the same stuff after an update. Just edit white and black lists.
Flags: needinfo?(mak77)
1) I would agree on a smaller number than mobile uses. We took Alexa top 500 and pared down that list.
4) This was a manual process we put the domains in an etherpad or gdoc and annotated each of them. There is more than just adult content to consider. There are link hosting like t.co and bit.ly, there are ad traffic domains like doubleclick, or other sites that just don't have user content.
7) the list has never been updated since shipping the feature 26 months ago.
Let me know if there’s anything I can assist with on the Marionette front here.  It is certainly possible to access URLs, titles, and favicons (by switching to chrome context).
(In reply to Marco Bonardo [::mak] from comment #1)
> (In reply to :Gijs from comment #0)
> > Ideally we should avoid server-side redirects with the default autocomplete
> > entries, which will then clutter up future autocomplete results.
> 
> True, even if now we properly handle simple redirects in frecency (after the
> fix for bug 737836). So the problem is less critical.

Google.com redirects me to URLs like this:
https://www.google.ru/?gfe_rd=cr&ei=m4quWIvKCIz67gTc6o3oBA
I think it's unique identifier every time.
What should we do with redirecting in such case?
(In reply to Svetlana Orlik from comment #5)
> (In reply to Marco Bonardo [::mak] from comment #1)
> > (In reply to :Gijs from comment #0)
> > > Ideally we should avoid server-side redirects with the default autocomplete
> > > entries, which will then clutter up future autocomplete results.
> > 
> > True, even if now we properly handle simple redirects in frecency (after the
> > fix for bug 737836). So the problem is less critical.
> 
> Google.com redirects me to URLs like this:
> https://www.google.ru/?gfe_rd=cr&ei=m4quWIvKCIz67gTc6o3oBA
> I think it's unique identifier every time.
> What should we do with redirecting in such case?

We should ignore everything apart from the hostname and protocol. :-)
Hence in this case we don't avoid redirect?
(In reply to Svetlana Orlik from comment #7)
> Hence in this case we don't avoid redirect?

That's right, but we can't ship a single unique identifier to all Firefox users. We just try to avoid as many as we can. This one we can't avoid. So we just have to live with it.

Note also that changes to the hostname and protocol *always* result in extra network requests. Because of history.pushState ( https://developer.mozilla.org/en-US/docs/Web/API/History/pushState ) and similar APIs, changes to any other URL components like the path or query string or hash (#foo) can be completely client-side, so if the website is smart they will not require a complete redirect including a new network request, even though the URL changes.
Priority: -- → P3
I agree with Marco and Kevin, a smaller number of URLs is more than adequate. We could even go lower than 50, we need to get the big head right. And that makes me wonder about the importance of localization sooner than later. Getting 10-20 good results per country may be more important than going deeper. We know what our priority markets are this year, we could tailor for the top markets and then give general or regional top sites if it's too impractical to localize everywhere. 

I pinged Marshall/Elvin to see if the legal clearance they gave us last year is still good.
Also agree that right now, updating the list when we ship a new Firefox is all that we need to do. The top sites will stay relatively stable, and even with a little fluctuation it's delivering the added value we seek. That said: we need to make sure that this fails gracefully in the event of bad or malformed uRLs.
In Munich, we made an agreement to cut this from scope for v1 in Munich. for v1 we do not need to operationalize list creation, we'll build it by hand according to the approved policy. We'll do it for only a few top locales at v1, and it might just be EN-US to start.
Whiteboard: [fxsearch]
Duplicate of this bug: 1404075
Duplicate of this bug: 1422967
Blocks: 1427533
(In reply to :Gijs (mostly out until Jan 3) from comment #6)
> We should ignore everything apart from the hostname and protocol. :-)

I think this is also the format we'd want for finally updating the mobile topdomains list in bug 1427533.
You need to log in before you can comment on or make changes to this bug.