Open Bug 1418080 Opened 7 years ago Updated 5 months ago

Consider making "mozIAsyncFavicons.getFaviconURLForPage" and page-icon: redirect-aware

Categories

(Toolkit :: Places, enhancement, P3)

57 Branch
enhancement

Tracking

()

People

(Reporter: nanj, Unassigned)

References

Details

(Whiteboard: [sng])

This is a feature request for this API, would like to hear your thoughts.

Let's take "python.org" as an example, typing this in the awesome bar will generate 3 records in moz_places, "python.org/", "https://python.org/", and "https://www.python.org/", respectively.

Two icons (a reguar one and a rich one) will be fetched and stored in the favicons database, the associated page_url is the final redirection "https://www.python.org/". However, query favicon with the original URL with getFaviconURLForPage won't work unless you pass it the exact URL "https://www.python.org/".
I was thinking a potential alternative is when we set the icon for a page, we could set the icon for the pages it redirected from. This trades off read time with write time.

But then I wonder if this is generally desirable… E.g., a search result page could redirect to some other website, but that doesn't always mean the source and destination should have the same icon.

In the particular case of activity stream top sites, we would be requesting favicons only for some small set of high frecency pages, so that would seem to favor not impacting write time in the common case. Although for activity stream's own performance, we would want reads to be fast too without potentially expensive redirect traversals…
Just playing around with queries… Start with the most recent visit to a url and follow any next visits that are redirects. Show the visit id and url of each visit along the way:

WITH RECURSIVE path(visit)
AS (
  SELECT v.id
  FROM moz_places h
  JOIN moz_historyvisits v
    ON v.place_id = h.id
  WHERE h.url = "https://twitch.tv/"
    AND v.visit_date = h.last_visit_date

  UNION

  SELECT id
  FROM moz_historyvisits
  JOIN path
    ON visit = from_visit
  WHERE visit_type IN (5, 6)
)
SELECT visit, (
  SELECT (
    SELECT url
    FROM moz_places
    WHERE id = place_id)
  FROM moz_historyvisits
  WHERE id = visit)
FROM path;

where pretend https://twitch.tv/ redirects to https://ed.twitch.tv/ to https://go.twitch.tv/ to https://www.twitch.tv/

4|https://twitch.tv/
5|https://ed.twitch.tv/
6|https://go.twitch.tv/
7|https://www.twitch.tv/
And that query's explain query plan:

2|0|1|SCAN TABLE moz_historyvisits AS v USING COVERING INDEX moz_historyvisits_placedateindex
2|1|0|SEARCH TABLE moz_places AS h USING INTEGER PRIMARY KEY (rowid=?)
3|0|0|SCAN TABLE moz_historyvisits
3|1|1|SCAN TABLE path
1|0|0|COMPOUND SUBQUERIES 0 AND 0 USING TEMP B-TREE (UNION)
0|0|0|SCAN SUBQUERY 1
0|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 4
4|0|0|SEARCH TABLE moz_historyvisits USING INTEGER PRIMARY KEY (rowid=?)
4|0|0|EXECUTE CORRELATED SCALAR SUBQUERY 5
5|0|0|SEARCH TABLE moz_places USING INTEGER PRIMARY KEY (rowid=?)
:Mardak wow, that's beautiful. On my commute, I've figured out a similar query that only returns the final redirection.

This should resolve the issue for activity stream for now, shall we wrap this into the places's API so that it could be used in other scenarios?
:mak, I wonder If we could piggyback bug 1365913 to include the referrers as well as all the redirections, which could be done based on Mardak's query.

WITH RECURSIVE path(visit)
AS (
  SELECT v.id
  FROM moz_places h
  JOIN moz_historyvisits v
    ON v.place_id = h.id
  WHERE h.url_hash = hash(:url) AND h.url = :url
    AND v.visit_date = h.last_visit_date

  UNION

  SELECT id
  FROM moz_historyvisits
  JOIN path
    ON visit = from_visit
  WHERE visit_type IN (${History.TRANSITIONS.REDIRECT_PERMANENT}, ${History.TRANSITIONS.REDIRECT_TEMPORARY})
)
SELECT visit, (
  SELECT (
    SELECT url
    FROM moz_places
    WHERE id = place_id)
  FROM moz_historyvisits
  WHERE id = visit)
FROM path;
I was also thinking if we could use those redirected URLs instead of the original ones when we generate the top sites for AS. That'll save us some trouble like the icon fetching.

What do you think? :Mardak
(In reply to Ed Lee :Mardak from comment #1)
> I was thinking a potential alternative is when we set the icon for a page,
> we could set the icon for the pages it redirected from. This trades off read
> time with write time.
> 
> But then I wonder if this is generally desirable… E.g., a search result page
> could redirect to some other website, but that doesn't always mean the
> source and destination should have the same icon.

Indeed, we do something like that for bookmarks and in some cases it ends up storing the wrong favicon.
You could probably do that if both pages are from the same base host (ignoring sub domains).

(In reply to Ed Lee :Mardak from comment #2)
> Just playing around with queries… Start with the most recent visit to a url
> and follow any next visits that are redirects. Show the visit id and url of
> each visit along the way:
> 
> WITH RECURSIVE path(visit)

It doesn't seem particularly efficient :(
In the past I thought of ways to more quickly fetch all the redirects chain, and I was evaluating to reuse the sessionid field in moz_historyvisits for that. Basically we'd store a monotonic integer there and every time a redirect chain starts we annotate all the visits in that chain with that session id. Though, that requires to fetch the largest integer before adding the first visit in the session, and when we are a redirect target, store the right session id from our source. That in the end may slow down visits insertions a little bit, but would allow very quick fetching of the redirects chain.

There's still the problem of unrelated pages redirecting though, and we will never be 100% sure that even sub domains are strictly related.

What we could potentially do, is ignore scheme and www. when storing icons and consider all of those pages the same. Similar to how I used fixed_icon_url_hash, we could replace page_url and page_url_hash with the "fixed" version. One downside is that the relation with moz_places will need some translation layer in the middle.
Priority: -- → P3
Severity: normal → S3

This is still a valid problem, and lately I've heard various teams dealing with it, so we may want to do something sooner.

One example is having a page url of
https://www.yahoo.com/sports/jets-qb-aaron-rodgers-has-strong-words-for-insecure-sean-payton-over-nathaniel-hackett-comments-183609249.html
that redirects to
https://sports.yahoo.com/jets-qb-aaron-rodgers-has-strong-words-for-insecure-sean-payton-over-nathaniel-hackett-comments-183609249.html
We store a favicon for the latter, not for the former.
In this case even if we'd have a root icon as https://sports.yahoo.com/favicon.ico, it would not apply to our page-url that is in www.yahoo.com domain. We only ignore "www.", not other subdomains.

On the storing side, when we're not setting a root icon, we could probably walk up all redirect levels and set the icon on all the sources IFF the top level matches, and there's no root icon for the redirecting urls (so they'd be icon-less). This could not fix the www.yahoo.com => sports.yahoo.com root icon problem, cause we likely don't want to copy over a root icon of a subdomain to the main domain.

On the fetching side we could probably walk up 1 level and eventually accept imperfect root icons. But maybe the storing improvement will be good enough to not require this.

Priority: P3 → P2
Summary: Consider making "mozIAsyncFavicons.getFaviconURLForPage" redirect-aware → Consider making "mozIAsyncFavicons.getFaviconURLForPage" and page-icon: redirect-aware
Whiteboard: [sng]
Severity: S3 → N/A
Priority: P2 → P3
You need to log in before you can comment on or make changes to this bug.