Dedupe http(s) in highlights; dedupe multiple bookmarks for the same url in top sites

RESOLVED FIXED in Firefox 57

Status

()

Firefox for Android
General
P1
normal
RESOLVED FIXED
10 months ago
9 months ago

People

(Reporter: sdaswani, Assigned: mcomella)

Tracking

(Blocks: 1 bug)

unspecified
Firefox 57
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox57 fixed)

Details

(Whiteboard: [MobileAS] 1.27 [summary in comment 48 49])

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(6 attachments)

(Reporter)

Description

10 months ago
Created attachment 8888014 [details]
Screenshot_20170719-102410.png
(Reporter)

Updated

10 months ago
Blocks: 1377247
Whiteboard: [MobileAS]
Susheel, can you click on the menu icons of the entries, copy the address and post them here? We are deduplicating highlights and content from the same domain. But this seems to fail here for some kind of reason.
Flags: needinfo?(sdaswani)
(Reporter)

Comment 2

10 months ago
Sure for the two Feedly's in Highlights:

https://feedly.com/
https://www.feedly.com/

For the top sites:
https://news.google.com/
http://news.google.com/
Flags: needinfo?(sdaswani)
I have a few duplicates in Top Sites on my test device too:
https://gizmodo.com/us-tangles-with-the-world-over-climate-change-at-g20-1796741069
https://gizmodo.com/frozen-couple-unearthed-by-climate-change-and-a-ski-li-1797024985
http://gizmodo.com

https://www.rockpapershotgun.com/2017/07/13/civilization-6-post-mortem/
https://www.rockpapershotgun.com/

http://www.bbc.co.uk/
http://www.bbc.com/
http://www.bbc.com/news/world-us-canada-40662772 (different favicon from the other two)

---

It looks like it's partially because my articles are showing up as top sites which would presumably clear with more realistic browsing scenarios (but perhaps not – what if you only open external links?).
Created attachment 8888074 [details]
Duplicates with iOS label algo (bug 1377287)

Here's what the duplicates look like for me (via bug 1377287 comment 17).
Another thought: in the event that there are duplicates and you want to pin one (e.g. I hate that I have 3 bbc's so I want to pin bbc.co.uk), it's impossible to get the url without 1) clicking it (which might reorder the top sites, argh!) or 2) copying the data from the context menu.

It'd be great if there was a way to see the actual URL of the page (and it's likely there are other use cases where you might want to see the URL too).
Those are two different things: (Android) Top sites can have (and always had) domain duplicates. Highlights tries to prevent that. Our top sites algorithm is unchanged (except for pins that work a little bit differently now).

Comment 7

10 months ago
We definitely avoid having dupes on desktop for Top Sites and we should follow suit in the mobile experiences as well. Right now we dedupe based on domain. Also, maybe related, is we weight .com's higher than pages with full paths (in this case, http://gizmodo.com would weight higher than https://gizmodo.com/us-tangles-with-the-world-over-climate-change-at-g20-1796741069). Weighting the .com higher AND deduping based on domain would give us the experience we're shooting for.  

For more reference, here's a reference to the code for Top Sites on Desktop: http://searchfox.org/mozilla-central/source/toolkit/modules/NewTabUtils.jsm#935
Adding to triage so we can discuss the trade-offs: we could fix this on the engineering side to dedupe, we could fix it on the UI side (make titles more permissive), etc.
tracking-fennec: --- → ?
Let's compare this against desktop & iOS to see what they're doing and see if we can do something similar.
tracking-fennec: ? → ---
Priority: -- → P2
Iteration: --- → 1.27
Priority: P2 → P1
Whiteboard: [MobileAS] → [MobileAS] 1.27
Assignee: nobody → michael.l.comella
From a UX perspective, this relates to bug 1386902: "TopSites tiles' titles do not accurately indicate site (from PageMetadata provider)".
See Also: → bug 1386902
I see a few more duplicates on my personal device:

kotaku.com
kotaku.com

rockpapershotgun.com
rockpapershotgun.com

(^ Both of these are bookmarked twice - I wonder if we don't dedupe that)

accounts.craigslist.org
sfbay.craigslist.org
accounts.craigslist.org/login/home

(^ All three of these share a title, "craigslist", and a favicon so they're indistinguishable. The first two can't be fixed in this bug but can be fixed with bug 1386902).
In bug 1386902, we're going to change the naming scheme from provider -> subdomain.domain, which should fix the duplicates within top sites.

For this bug, I spoke to bbell about the vision for Top Sites/Highlights:
- Top Sites should favor domain names (kind of like "apps") & Highlights should favor paths (pages in the "apps").
- Top sites should be the pages you've visited absolutely the most (based on the frecency algorithm)
- Highlights should be constrained within some temporal parameters (e.g. past 72 hours)

These are different pools of pages so there should be little overlap and thus duplication.

Note: top sites can display pages if you visit that page frequently enough. Deduping should generally weight domains with no path higher though.
I created a document with all of the duplication examples so far in this bug and a list of how they can be fixed [1], the latter copied below:

Things to change
- Highlights: dedupe “www.domain” and “domain”
- Top sites: dedupe “http” & “https”
- Top sites: ensure multiple bookmarks don’t create duplicates
- Top sites: subdomain.domain (bug 1386902)

Awaiting UX
- Top sites: display path when two pages from same domain appear (how interact with plain domain?)
- Screenshots for pages?

Open questions:
- bbc.co.uk vs. bbc.com

---

Unless someone provides a counter-example I think this means the top sites and highlights algorithms might be working just fine right now - it's just the details that need to be fixed.

[1]: https://docs.google.com/document/d/1UvjKXC86azj5LptWRmU5D7jyY_B7Owig9mHrQhgmWzo/edit?usp=sharing
> - Highlights: dedupe “www.domain” and “domain”
> - Top sites: dedupe “http” & “https”

fwiw, I think these are not very likely to occur on real profiles and just happen to appear on our test profiles because we don't have many visits to outweigh redirects.
> - Highlights: dedupe “www.domain” and “domain”

Did this because it's likely to happen - we only have to visit a Highlight once for it to appear and if that visit was redirect from "www.", we'd get a dupe.

> - Top sites: dedupe “http” & “https”

This is annoying to do because we're passing a Cursor directly to the Adapter (e.g. create CursorWrapper-like thing, process dupes, and skip over dupe entries). This will happen with light use of the browser or a clean profile so it'll be slightly unpolished but I think it's unlilkely to happen with regular use of the browser and so, due to code complexity, I'm going to skip it.

> - Top sites: ensure multiple bookmarks don’t create duplicates

Looking into this.

> Awaiting UX
> - Top sites: display path when two pages from same domain appear (how interact with plain domain?)

Still awaiting UX - will probably be a new bug.

> - Screenshots for pages?

Moved to bug 1389259.
Bryan, a few questions:

How would we handle the case of "bbc.co.uk" and "bbc.com"? They both have the same favicon and would both appear as "bbc". Perhaps we should we include the TLD in this case?

To be explicit, were you still figuring out what the title should be for two top sites that have the same path? Or did we decide on "github/mozilla-mobile/focus-android" (in the case of https://github.com/mozilla-mobile/focus-android)? We could also remove parts of the path if they overlap, e.g. "github/focus-android" & "github/focus-ios" Note that I have the bug open for using screenshots for pages (bug 1389259).
Flags: needinfo?(bbell)

Comment 18

10 months ago
(In reply to Michael Comella (:mcomella) from comment #17)
> Bryan, a few questions:
> 
> How would we handle the case of "bbc.co.uk" and "bbc.com"? They both have
> the same favicon and would both appear as "bbc". Perhaps we should we
> include the TLD in this case?
> 
> To be explicit, were you still figuring out what the title should be for two
> top sites that have the same path? Or did we decide on
> "github/mozilla-mobile/focus-android" (in the case of
> https://github.com/mozilla-mobile/focus-android)? We could also remove parts
> of the path if they overlap, e.g. "github/focus-android" &
> "github/focus-ios" Note that I have the bug open for using screenshots for
> pages (bug 1389259).

Remove parts of the path if they overlap feels right to me. You might even consider going one level further, and drop the domain for sites with a duplicate and only show the parts of the URL that are unique. 

e.g. "focus-android" & "focus-ios"
Flags: needinfo?(bbell)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Still TODO: ^ bbell's suggestion.

Bryan, how would we handle the case of "bbc.co.uk" and "bbc.com"? They both have the same favicon and would both appear as "bbc". Perhaps we should we include the TLD in this case?
Flags: needinfo?(bbell)
Created attachment 8896452 [details]
Title override for shared host screenshot

Bryan, I overrode the title with the unshared path (e.g. "focus-ios" & "focus-android") when two top sites shared the same host and it doesn't appear to be very useful:

- It's hard to understand which site we're looking at sometimes ("signin/v2/ide..." is accounts.google.com sign in but that URL may not have a URL)
- The unshared path isn't always useful, as in "2017/08/02/..." for RockPaperShotgun and "describecom..." for bugzilla.
- If we were to also include the domain, I think we're more likely to overflow than we are to actually show the difference in the paths

Any thoughts?

Caveat 1: this is a clean profile so 1) we see more duplicates than we're likely to see otherwise and 2) less commonly visited sites, which might not make top sites, are appearing like accounts.google.com sign-in.

Caveat 2: This patch isn't perfect: there are two support.google.com entries because their paths are the same but their query parameters are different and I only process paths at the moment.
For comparison, looking at my top sites on desktop, it looks like they're more strict about only showing one site per host, e.g. I saw http://github.com/mozilla-mobile/prox is in my top sites even though I'm sure I visit other github sites just as often like Prox's issues as well as only one bugzilla.mozilla, mail.google top site.

From a user's perspective, however, I'm not convinced this is better behavior (e.g. if I only use my browser for github, I'd want to see all of those sites).
To summarize the consequences of comment 23, I think there are two variables we can tweak are:

1) Revise the UI to help distinguish multiple sites with same host (e.g. allow titles to take up multiple lines only if there is another top site with the same host)

2) Change the top sites algorithm so we have fewer sites with the same host.

In my opinion, I think we should focus on #1: afaik, there's been no desire to modify our existing top sites query so it's probably pretty good (caveat: users used to be able to enter custom sites). The approach of limiting the number of same-host sites that appear in our top sites doesn't seem to be a better user experience to me (e.g. if I use my browser primarily to look at github; and how my top sites on desktop doesn't seem to accurately represent my browser usage).
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Duplicate of this bug: 1320521
Duplicate of this bug: 1388595
FYI that reddit's subreddits is another (and probably common) use-case for having multiple pages:

(In reply to Cameron McCormack (:heycam) from bug 1388595 comment #3)
> To be clear, for me, all of similarly labelled tiles for me are to different
> sub-reddits, and I want to be able to distinguish/select them individually.
Duplicate of this bug: 1385651
Another case:

(In reply to Paul from comment bug 1385651 #5)
> In this bug, the Top Sites are NOT duplicated. The screenshot shows four
> separate pages from the same weather site, all four of which I use very
> frequently. The problem is that all four icons look the same, despite
> belonging to legitimately different pages. That's why a fix of bug #1332621
> will not fix this bug.
Comment on attachment 8896070 [details]
Bug 1382332: Add MapUtils.putIfAbsent.

https://reviewboard.mozilla.org/r/167354/#review172978
Attachment #8896070 - Flags: review?(liuche) → review+
Comment on attachment 8896071 [details]
Bug 1382332: Rm duplicate www. hosts from Highlights.

https://reviewboard.mozilla.org/r/167356/#review172982

Some confusing naming, but other than that this seems good.

::: mobile/android/base/java/org/mozilla/gecko/activitystream/ranking/HighlightsRanking.java:255
(Diff revision 3)
>  
>      /**
> -     * Remove candidates that are pointing to the same host.
> +     * Remove candidates that are pointing to the same host, with special restrictions for "www." hosts.
>       */
>      @VisibleForTesting static void dedupeSites(List<HighlightCandidate> candidates) {
> -        final Set<String> knownHosts = new HashSet<String>();
> +        final Map<String, HighlightCandidate> knownHostToCandidate = new HashMap<>();

I don't really like "candidate" in this name because I think it's a little confusing - maybe you can add a brief comment with key/value example, and possibly change it to be knownHostToHighlight or HighlightCandidate?

::: mobile/android/base/java/org/mozilla/gecko/activitystream/ranking/HighlightsRanking.java:256
(Diff revision 3)
>      /**
> -     * Remove candidates that are pointing to the same host.
> +     * Remove candidates that are pointing to the same host, with special restrictions for "www." hosts.
>       */
>      @VisibleForTesting static void dedupeSites(List<HighlightCandidate> candidates) {
> -        final Set<String> knownHosts = new HashSet<String>();
> +        final Map<String, HighlightCandidate> knownHostToCandidate = new HashMap<>();
> +        final List<HighlightCandidate> wwwHosts = new ArrayList<>();

Just to be consistent, shouldn't this be wwwCandidates? or wwwHighlightCandidates? because it literally is the candidate, not the host.

::: mobile/android/base/java/org/mozilla/gecko/activitystream/ranking/HighlightsRanking.java:275
(Diff revision 3)
>          });
> +
> +        // Include "www." hosts only if (see ifs below):
> +        for (final HighlightCandidate wwwCandidate : wwwHosts) {
> +            final String wwwCandidateHostNoWWW = wwwCandidate.getHost().substring(WWW.length()); // non-null b/c we check above.
> +            final HighlightCandidate correspondingCandidate = knownHostToCandidate.get(wwwCandidateHostNoWWW);

Maybe name this knownCandidate? to match with the knownHost?
Attachment #8896071 - Flags: review?(liuche) → review+

Updated

9 months ago
Iteration: 1.27 → 1.28

Comment 39

9 months ago
mozreview-review
Comment on attachment 8896072 [details]
Bug 1382332: Remove duplicate URLs from AS top sites.

https://reviewboard.mozilla.org/r/167358/#review173684

OK, I'm assuming that the additional table scan (will it sort things, too?) won't be much of a problem since it'll run on a filtered data set.

Sidenote, we should clean up the top sites query(ies) for the post-positioned pins world. A lot of it seems to deal with that side of things.
Attachment #8896072 - Flags: review?(gkruglov) → review+
To follow up on comment 25, spoke with bbell: we're going to try using "subdomain.domain" when the page does not have a path and use the page title when the page has a path.

fwiw, he mentioned nchapman has some code that will extract the more relevant parts out of a url (e.g. would pull out the Bug # for Bugzilla) but we haven't had a chance to talk to him and to me it sounds like a heuristic rabbit hole so we'll try ^ first.
Flags: needinfo?(bbell)
I'm going to land what we have and continue working.

Code from the future fixes will build on top of the code from bug 1386902.
Depends on: 1386902
Keywords: leave-open
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
(In reply to Michael Comella (:mcomella) from comment #41)
> I'm going to land what we have and continue working.
> 
> Code from the future fixes will build on top of the code from bug 1386902.

mozreview team said it's better to open new bugs because mozreview doesn't handle pushing two sets of commits to a bug so I'll file another one.
No longer depends on: 1386902
Keywords: leave-open

Comment 46

9 months ago
Pushed by michael.l.comella@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/5bb1644f7b89
Add MapUtils.putIfAbsent. r=liuche
https://hg.mozilla.org/integration/autoland/rev/c252481fa62a
Rm duplicate www. hosts from Highlights. r=liuche
https://hg.mozilla.org/integration/autoland/rev/cd7fce820293
Remove duplicate URLs from AS top sites. r=Grisha

Comment 47

9 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/5bb1644f7b89
https://hg.mozilla.org/mozilla-central/rev/c252481fa62a
https://hg.mozilla.org/mozilla-central/rev/cd7fce820293
Status: NEW → RESOLVED
Last Resolved: 9 months ago
status-firefox57: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → Firefox 57

Updated

9 months ago
See Also: → bug 1332621
Changing the title to reflect what I've actually changed in this bug.

Things that are still a problem:
- Dedupe "http(s)" in top sites (this bug was highlights only) (bug 1332621)
- Dedupe "(www.)url" in top sites (bug 1332621)
- bbc.com & bbc.co.uk look identical in top sites (bug 1391479)
See Also: → bug 1391479
Summary: Both Highlights & Top Sites Have Duplicate Entries → Dedupe http(s) in highlights; dedupe multiple bookmarks for the same url in top sites
Also, to summarize the relevant labeling changes:
- Use subdomain.domain for top sites page titles (bug 1386902)
- Follow-up to ^: Show the page title for pages that have a path (bug 1390372)
- Screenshots could help distinguish sites (bug 1389259)

Between these two comments, it should summarize all the changes I've made or are considering making for these issues: distinguishing similar top sites entries and removing actual duplicates.
Whiteboard: [MobileAS] 1.27 → [MobileAS] 1.27 [summary in comment 48 49]
(In reply to Michael Comella (:mcomella) from comment #49)
> Also, to summarize the relevant labeling changes:

One more:
- Distributions should use page titles (bug 1391413)
You need to log in before you can comment on or make changes to this bug.