Closed Bug 858829 Opened 8 years ago Closed 5 years ago
Ship popular domains by default
Building on top of bug 858340, it would be nice to ship the top X domains (either in history or in a special set of domains to search?). Attaching a little crawler that will pull down Alexa's top 500 sites (for the US because I think we should localize this). A lot of them are porn. Some are not useful for real people (t.co is in there). I'm going to prune it down a bit to things that are probably more... appropriate for our audience.
Weeded out some porn and junk. Will do a more thorough search before I'm done. I wonder if someone has a list of the top MOBILE sites somewhere as well. A simple search turned up a lot of junk and no real results. Also need to come up with a better way to localize this. I think we can store the list in locales/en-US and copy it to res/raw at build time?
Whoops. forgot to qref.
Attachment #734152 - Attachment is obsolete: true
Moved the list into Gecko so that its localizable without doing ugly Android reflection hacks. Pike does that look ok? To minimize messages, I hijacked the SearchEngine messages and made them into more general AwesomeBar:Data messages. Also, I went through this list a few times to weed out content I don't think we want (porn and unused domains). I'm not sure how these basically empty domains even make it in Alexa's list. There are also a lot of site building pages, and more than a few ad sites (some of which I removed like doubleclick.net, who would ever want to go to doubleclick?) Before this lands, I want to go through one more time and load basically every page in the list that I'm not familiar with (to really make sure we're not shipping XXX sites?), but I'm curious if we want some greater criteria here for what we'll ship. piratebay, and movie2k.to are in here. delta-search.com is basically the google homepage with yahoo search results (?). search-results.com IS google with a slightly changed ui. there are numerous dating sites. adam4adam.com (from its description of itself) sorta walks the line between dating site and live sex cam. Some of the sites don't have mobile ready sites... Any opinions? I don't believe that some of these sites are popular, let alone popular on mobile. Can we find a better list somewhere?
Comment on attachment 737664 [details] [diff] [review] Patch v1 Thanks for trying, but I really think there's not that much value in localizing this, beyond a few obvious urls. If I look at the amount of work you're putting in to figure out what the right level of sanity should be, I'm scared. Also, Alexa itself already covers popularity across locales, right? For the locales we can ship, I think one general list is probably fine. AFAICT, bug 858340 goes into history and searches for domains? Can we add search engines on top of that? That'll give the locally installed wikipedia ;-), or just hard-code ab-CD.wikipedia.org.
Attachment #737664 - Flags: feedback?(l10n) → feedback-
How is this list going to be maintained in the long-run for link relevancy to reflect actual rank differences? Can't we use an Alexa API?
My thoughts, it should be pretty trivial to fixup the tool at the top to filter on a blacklist, as well as compare to a white list so that we can spit out a list of "whats new" each time. Then we can update the two. We can do that every 6 weeks if we want? I can do that for separate locales for Alexa too, but Pike is right. The amount of effort required (at least for the first pass) removing things we don't want would be high.
Comment on attachment 737664 [details] [diff] [review] Patch v1 I'm not a fan of abusing SearchEngines like this. So if removing the l10n requirement means we can put this back in Java, I'd be happier. If not, then I think 2 messages might be in order. The domains never need to be sent to Java more than once, right?
Here's a little updated crawler that does what I described. 1.) Reads a blacklist from blacklist.txt and whitelist from topSites.txt 2.) If the site isn't in the blacklist, it adds it to a new file newSites.txt 3.) If the site isn't in the blacklist or whitelist, prints it to the console so you can easily see what's new. This makes it pretty easy to compare the US and, say, Algeria's list. Turns out, they don't match at all. At all. That says we need to localize if we do this (although maybe that's a V2 feature?). I'm going to expand this and try to make it better at filtering out bad sites (try to dig more info out of Alexa?) so that maybe we can do most of the localization automatically and just have locales hand tweak the list a little if they want.
Attachment #734120 - Attachment is obsolete: true
Another random thought, to which extent would the dictionary on the device help us? I could picture that a lot of popular sites have their domain names actually included. Another tangent, would there be sites we'd want to show in one locale but not show in another?
We're going to put a hold on this bug/patch. We want to wait and see if this pre-built list will be necessary. If the user uses Sync, or just browses for a few days, the browsing history should become good enough to power the history-based domain autocompletion.
Comment on attachment 737664 [details] [diff] [review] Patch v1 Putting patch on back burner. The cost of managing and localizing the list might not be worth the end result.
Comment on attachment 734154 [details] [diff] [review] Java only WIP Let's bring this patch back into discussion
We should start looking at this idea again. Wes created a Java-only version of this patch that bundles a single list of domains. We should only search this list if a person's history search returns nothing (i.e. it's a fallback). The reason to bring this back is that it takes a while to accumulate enough history to make domain auto-completion work well. The first week of usage needs whatever help we can give to keep people engaged.
Should we make an Aha! card for this? The user benefit here would be better domain autocomplete results before collecting an extensive browser history.
CCing Javaun, I wonder if there's a complimentary or related way to deal with popular sites and what we've learned about gathering a list of sites that do tracking. Also mconnor, because there's an aspect of regional defaults, not just language.
Attachment #738347 - Attachment mime type: text/x-python → text/plain
Updated version of Wes' Java-only patch. Compiles and works well on device.
(In reply to :Margaret Leibovic from comment #16) > Should we make an Aha! card for this? > > The user benefit here would be better domain autocomplete results before > collecting an extensive browser history. Just so for me to understand this better, this won't be user-facing, i.e. via another panel or something. The only reason this was suggested is because it will speed up typing URL in the address bar and getting results sooner. If you could put a % to it, are we talking about more than 50% speed improvement while searching history etc.?
Flags: needinfo?(bbermes) → needinfo?(margaret.leibovic)
(In reply to Barbara Bermes [:barbara] from comment #19) > (In reply to :Margaret Leibovic from comment #16) > > Should we make an Aha! card for this? > > > > The user benefit here would be better domain autocomplete results before > > collecting an extensive browser history. > > Just so for me to understand this better, this won't be user-facing, i.e. > via another panel or something. The only reason this was suggested is > because it will speed up typing URL in the address bar and getting results > sooner. If you could put a % to it, are we talking about more than 50% speed > improvement while searching history etc.? I can't put a number on this. The improvement would be that where you used to not see any inline autocomplete for a URL, now you would see one (e.g. you type "f", we autocomplete "facebook.com", and if that's what you wanted, all you need to do is hit enter). mfinkle mentioned adding telemetry to see how often people use this domain autocomplete feature, but given how often people type in the urlbar and hit enter (22.6% of all URL loads), this would likely be a noticeable feature, especially since it's even more noticeable for new users.
Flags: needinfo?(margaret.leibovic) → needinfo?(bbermes)
ok, we had a chat in IRC and I know understand the rationale behind this and will create an Aha card for this. I would like to follow what mfinkle said and measure the use of this feature via Telemetry
Another thought, wrt fat fennec etc. Should we bundle this list with each apk each update we ship, or is there a better path via some of our gofaster data thingies?
(In reply to Axel Hecht [:Pike] from comment #22) > Another thought, wrt fat fennec etc. Should we bundle this list with each > apk each update we ship, or is there a better path via some of our gofaster > data thingies? With a single list (if we start with that) the size is very small. If we decide to move to localized lists, then we should consider using downloadable content. We'd need to remember that we want the list on device as fast as possible, so new users get the benefit right away.
Wes started using the Alexa rankings, but that doesn't seem to allow for a "mobile" category. I stumbled upon a site called Quantcast that does allow a filter for "mobile": https://www.quantcast.com/top-mobile-sites/US?userView=Public The list is somewhat different.
Grabbing these lists off the CDN and tracking updates via Kinto does make sense. So does shipping a small list in-product to start with. A hybrid approach is possible to satisfy mfinkle's timeliness point: ship a default top N-hundred as a default, then download larger/per-locale/per-country lists from the CDN. (A user using es-MX in Oregon should have a different candidate set than a user using es-MX in Oaxaca…) This seems like something both we and the Kinto folks would want to do, so CCing Tarek, and we'll file a follow-up.
This patch adds a top500 US list to use as a fallback for domain auto-completion. The I pulled the list from Alexa today. I removed any sites listed in the Alexa top500 adult sites. Then I manually looked through the list and removed several more adult sites.
Comment on attachment 8694241 [details] [diff] [review] fallback-domain-search v0.1 Review of attachment 8694241 [details] [diff] [review]: ----------------------------------------------------------------- The code part looks fine to me. I did a skim of the list to see if anything looks weird, but we probably want to get other people to review that before we ship it. ::: mobile/android/base/home/BrowserSearch.java @@ +561,5 @@ > return uriSubstringUpToMatchedPath(url, hostOffset, hostOffset + searchLength); > } > } while (searchCount < MAX_AUTOCOMPLETE_SEARCH && c.moveToNext()); > > + // If we can't find an autocompletion domain from history, let's try using the fallback list. I like this approach to prefer your local history. I was worried about the autocomplete suggestions watering down the value of suggestions, but this will help us make suggestions before you've typed too much. ::: mobile/android/base/resources/raw/topdomains.txt @@ +271,5 @@ > +redfin.com > +emgn.com > +weibo.com > +alibaba.com > +pinimg.com Some of these domains are not things users would visit... like this one for example, redirects to pinterest.com. I know it's hard to filter through these, but I wonder if we're creating a worse experience if these non-useful domains prevent us from showing useful suggestions. IIRC, we only show an inline autocomplete if there's a single history entry that matches, so by including all these extra domains, we may be requiring the user to type *more* to get to the suggestion they want. Perhaps we should include an ever shorter, more heavily-filtered list. Although I suppose that no matter what, even if the user has to type out the whole first part of the domain, we'll autocomplete the ".foo" suffix. @@ +441,5 @@ > +independent.co.uk > +drugs.com > +rotoworld.com > +nationalgeographic.com > +4chan.org Hm, this is an example of one we may want to exclude.
Attachment #8694241 - Flags: review?(margaret.leibovic) → review+
t.co - is twitter's link shortener, do we expect users to type t.co urls? googleusercontent.com - as a bare domain is a 404. taboola.com - is an ad network. no user content. onclickads.net - looks to be some sort of ad network. no user content. blackboard.com - do users directly go to this? thought colleges/universities host their own version netteller.com - loads a blank page. some sort of banking software? taleo.net - 403 to http://www.oracle.com/us/products/applications/taleo/enterprise/overview/index.html constantcontact.com - bulk email sender service. little to no user use. fbcdn.net - facebook's cdn popads.net - ad network files.wordpress.com - we already have wordpress.com above. is this useful? infusionsoft.com - ad network? directrev.com - ad/content network dmv.org - odd privately owned link site to many state dmvs. seems shady bfads.net - Black friday ads. specific to this months traffic? bp.blogspot.com - content host for blogspot? not loadable as a bare domain cloudfront.net - CDN okcupid.com, pof.com, match.com dating sites might be a contentious thing to add to the domain autocomplete
Can we dump these in a Google Spreadsheet somewhere so we can collaboratively blacklist/whitelist these? Pretty sure we don't want to do this manually through bug comments every time we decide to do a refresh…
(In reply to Kevin Brosnan [:kbrosnan] from comment #28) I removed your list except for these exceptions: > blackboard.com - do users directly go to this? thought colleges/universities > host their own version Still seems legit > okcupid.com, pof.com, match.com dating sites might be a contentious thing to > add to the domain autocomplete Still seems legit
(In reply to :Margaret Leibovic from comment #27) > > +pinimg.com > > Some of these domains are not things users would visit... like this one for > example, redirects to pinterest.com. > > +4chan.org > > Hm, this is an example of one we may want to exclude. Removed!
Updated crawler script
Attachment #738347 - Attachment is obsolete: true
blocklist used to block domains
I landed this behind a Nightly flag.
tracking-fennec: --- → Nightly+
Updated the blocklist
Attachment #8694575 - Attachment is obsolete: true
I wouldn't mind adding some telemetry for both autocomplete and autocomplete-fallback, but I don't have a plan for it yet.
Thanks for the spreadsheet, this will help legal to review as well. Adding Elvin to this bug, as he will be discussing this with the legal team tomorrow.
(In reply to Mark Finkle (:mfinkle) from comment #36) > I landed this behind a Nightly flag. I don't think we track closed bugs with the Nightly+ flag, so I filed bug 1229862 to track enabling this.
Comment on attachment 8694241 [details] [diff] [review] fallback-domain-search v0.1 Approval Request Comment [Feature/regressing bug #]: None [User impact if declined]: This patch improves firstrun experience [Describe test coverage new/current, TreeHerder]: Landed on Fx45 and has been working fine [Risks and why]: Low - the code is scoped to a narrow use case. [String/UUID change made/needed]: None
If this needs legal review why are we uplifting it?
(In reply to Kevin Brosnan [:kbrosnan] from comment #43) > If this needs legal review why are we uplifting it? It's still behind a Nightly flag, so we'll need to fix bug 1229862 before this will ship, even if we uplift this patch here. Although this requires legal review, we're hopeful we can resolve that soon, and as soon as it's resolved we should be able to ship this.
Please don't ship before legal signs off. Generally, we can't manually curate these list and add/remove sites that may or may not be objectionable. We don't want to put ourselves in a position where we need to decide if sites like piratebay or adam4adam are on or off the list.
Evlin, I wasn't sure who to reach out to from legal team. Would you be able to address the concern raised in comment 45? How should we review the list of sites here? Has it already been blessed by Legal? Any help is appreciated.
(In reply to Ritu Kothari (:ritu) from comment #46) > Evlin, I wasn't sure who to reach out to from legal team. Would you be able > to address the concern raised in comment 45? How should we review the list > of sites here? Has it already been blessed by Legal? Any help is appreciated. We are working out the legal details in bug 1232746.
Thanks, Margaret. :) Marshall and I plan to stay plugged in as the details of list sourcing/generation/maintenance are worked out. It will require investigation and diligence on the part of the product team to explore and vet different options.
(In reply to :Margaret Leibovic from comment #47) > (In reply to Ritu Kothari (:ritu) from comment #46) > > Evlin, I wasn't sure who to reach out to from legal team. Would you be able > > to address the concern raised in comment 45? How should we review the list > > of sites here? Has it already been blessed by Legal? Any help is appreciated. > > We are working out the legal details in bug 1232746. Thanks Margaret. Should I leave the beta:? uplift request untouched in the meantime? Or can I A- and then your team can renominate once the legal review is done? Either option works for me.
Comment on attachment 8694241 [details] [diff] [review] fallback-domain-search v0.1 Not ready for uplift. Pending legal review.
Attachment #8694574 - Attachment mime type: text/x-python-script → text/plain
Is this resolved by bug 1240599? Looks like it probably is. Thanks.
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #51) > Is this resolved by bug 1240599? Looks like it probably is. Thanks. Yep. Resolved.
Closing this out from the legal side. This is good to move forward. The general approach approved by legal is to 1) take the list of global sites, 2) subtract the alexa adult sites, and then 3) subtract any additional sites that satisfy the DMOZ adult content definitions. There are additional steps described in bug 1240599.
You need to log in before you can comment on or make changes to this bug.