Last Comment Bug 704977 - Awesomeness calculation should use nicer recency decay
: Awesomeness calculation should use nicer recency decay
Status: RESOLVED FIXED
:
Product: Firefox for Android
Classification: Client Software
Component: General (show other bugs)
: unspecified
: ARM Android
: P3 normal (vote)
: Firefox 14
Assigned To: :Margaret Leibovic
:
Mentors:
: 722709 (view as bug list)
Depends on: 736171
Blocks: 722709
  Show dependency treegraph
 
Reported: 2011-11-23 14:13 PST by Johnathan Nightingale [:johnath]
Modified: 2016-04-26 01:04 PDT (History)
17 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
fixed
fixed
beta+
11+


Attachments
Linear vs. e^(-x^2) (14.16 KB, image/png)
2011-11-23 14:13 PST, Johnathan Nightingale [:johnath]
no flags Details
Linear vs. e^(-x^2) vs Cauchy (14.01 KB, image/png)
2012-03-12 09:18 PDT, Johnathan Nightingale [:johnath]
no flags Details
(1/2) Clean up filterAllSites (4.02 KB, patch)
2012-03-14 11:36 PDT, :Margaret Leibovic
lucasr.at.mozilla: review-
Details | Diff | Splinter Review
(2/2) Sort by new frencency calculation (2.47 KB, patch)
2012-03-14 11:40 PDT, :Margaret Leibovic
lucasr.at.mozilla: review+
Details | Diff | Splinter Review
(1/3) Clean up filterAllSites (v2) (4.27 KB, patch)
2012-03-15 11:10 PDT, :Margaret Leibovic
lucasr.at.mozilla: review+
akeybl: approval‑mozilla‑aurora+
Details | Diff | Splinter Review
(2/3) Sort by new frencency calculation (rebased) (2.38 KB, patch)
2012-03-15 11:11 PDT, :Margaret Leibovic
akeybl: approval‑mozilla‑aurora+
Details | Diff | Splinter Review
(3/3) Update frecency calcuation in profile migrator (2.57 KB, patch)
2012-03-15 11:14 PDT, :Margaret Leibovic
gpascutto: review+
akeybl: approval‑mozilla‑aurora+
Details | Diff | Splinter Review

Description Johnathan Nightingale [:johnath] 2011-11-23 14:13:45 PST
Created attachment 576621 [details]
Linear vs. e^(-x^2)

Right now we calculate our version of frequency by taking visit count and then applying a multiplier based on the recency of the last visit. Basically

score = numVisits * (120-daysSinceLastVisit)

Blassey and I were discussing this in IRC, and I asserted that we wanted a different calculation for the date multiplier. In particular, we wanted the last couple days to all be relatively equivalent, but then it should fall off more quickly than linear (no reason for 3 month old stuff to be 30x more potent than 4 month old stuff).

After consulting with my local math ph.d, I learned that I wanted gaussian decay (e^-(x^2)).

See attachment:
- x axis is "days since last visit"
- y axis is multiplier to apply
- blue is current linear
- orange is gaussian decay with some constant factors to fit our calendars and such - 120*e^(-((x/30)^2))

Code that does the calculation currently lives here:
http://hg.mozilla.org/projects/birch/annotate/fec276ae77d5/mobile/android/base/AwesomeBarTabs.java#l520

I think we should change the calculation to use 120*e^(-((x/30)^2)) (though at this point the 120 is arbitrary, could just as easily be 100, or 10, or omitted altogether depending on how much we want recency to amplify visit count)
Comment 1 Madhava Enros [:madhava] 2011-11-23 14:16:32 PST
Also - we should give more weight to bookmarked sites. Probably also to sites that are open right now on one of your other computers. Possibly also to things you've shared.

But at least to bookmarks.
Comment 2 Matt Brubeck (:mbrubeck) 2011-11-23 14:17:30 PST
https://wiki.mozilla.org/User:Jesse/NewFrecency proposes a a similar exponential decay function for frecency, including a way to implement it efficiently.
Comment 3 Johnathan Nightingale [:johnath] 2011-11-23 18:28:44 PST
(In reply to Matt Brubeck (:mbrubeck) from comment #2)
> https://wiki.mozilla.org/User:Jesse/NewFrecency proposes a a similar
> exponential decay function for frecency, including a way to implement it
> efficiently.

Yeah - very similar approach, though Jesse's still requires computing a sum of each visit's score, whereas we may not (??) have the luxury of storing such things using the system history APIs.

The approach in comment 0 is utterly at the mercy of the last visit, but it has the advantage of being in closed form, so as far as efficiency goes it ought to be O(1) (though in actual fact it won't be since it requires knowing the last visit to the page, which means history queries, which I hope are O(log N) on the size of the history. I think, though, that at least *one* hit to the history DB will be required of any awesomeness computation :)
Comment 4 Johnathan Nightingale [:johnath] 2011-11-25 12:07:42 PST
After a bit of coefficient twiddling in a scratchpad, I'd assert that this gives us approximately what we'd want -- yesterday is almost as good as today, 3 days ago is still pretty strong, a week ago is lower but still solid. Visits from any time in the last month still get amplified, but older stuff flatlines.


function multiplier(age) {
 var score = Math.exp( -(Math.pow((age/15),2) ))
 return Math.max(1, 100*score);
}

multiplier(0); /* 100 */
multiplier(1); /* 99.55654174830929 */
multiplier(3); /* 96.07894391523232 */
multiplier(7); /* 80.43041560655723 */
multiplier(14);/* 41.84863060425645 */
multiplier(30);/* 1.8315638888734178 */
multiplier(60);/* 1 */


Of course, it may all be moot, because I don't think sqlite has built in pow and exp functions, and I suspect (??) that we're not in the mood to add new stored functions to the system sqlite?
Comment 5 Madhava Enros [:madhava] 2012-01-18 09:53:43 PST
Additionally - do we, on desktop, give extra weight in the results when your entered string matches the _beginning_ of a URL or title? Versus a set of characters in the middle of a string, which is less likely how a user is going to be looking for things.
Comment 6 Madhava Enros [:madhava] 2012-01-18 10:01:48 PST
Actually - that should be matching to the beginning of a word in the title or section of a URL.
Comment 7 Johnathan Nightingale [:johnath] 2012-01-18 10:02:01 PST
I should note that the math in comment 4 is just a question of how to weight visit count - we should probably also be deliberate about how much weight to give a bookmark here. How should a bookmark weigh against a site you visited 5 times yesterday?

If I check my gut, I'd say

- 3 visits in the last day is worth more than a bookmark
- a bookmark is worth more than 1 visit last week, but probably not more than 1 visit yesterday

to that end, I suspect "being a bookmark" should be worth a 100 point bonus (presumably they will also get visit points, because presumably you also visit your bookmarked sites). 

Also, the code that does this now lives here: http://mxr.mozilla.org/mozilla-central/source/mobile/android/base/db/LocalBrowserDB.java#94

The link in comment 0 dead ends, now that birch has been resorbed into the collective.
Comment 8 Wesley Johnston (:wesj) 2012-01-19 07:59:55 PST
If it helps, Math people will cringe, but we used to use Lorentzians (or apparently Cauchy distributions to smarty math people) to approximate exponentials.

http://en.wikipedia.org/wiki/Cauchy_distribution

I think they fall off a little faster than Gaussians though, but something like:

 var score = 15^2 / (age^2 + 15^2)
Comment 9 Madhava Enros [:madhava] 2012-01-26 11:56:46 PST
I've opened Bug 721476 because we seem not to strain out duplicates in top sites or search results.
Comment 10 Mark Finkle (:mfinkle) (use needinfo?) 2012-02-02 09:28:36 PST
*** Bug 722709 has been marked as a duplicate of this bug. ***
Comment 11 :Margaret Leibovic 2012-03-09 15:16:40 PST
So, I started looking into this, and I'm trying to figure out where/how we'll compute this new score multiplier. As johnath mentioned in comment 4, we don't have built-in pow or exp functions in sqlite, so should we look into adding those? (Are we using our own sqlite now? Would this be easy or hard? I'm not familiar with how to go about doing this.)

Alternately, now that we're dependent on LocalDB, we could change our DB schema to add a frecency column, and compute these values in the background. This option sounds like it would be more complicated to implement, and we'd have to be careful to avoid regressing performance, but it might be better in the long run if it could lead to things like syncing frecency. (Did XUL fennec do that? To do that, would we need the same frecency calculation as desktop?)

Anyway, anyone care to advise on what our course of action should be here?
Comment 12 Gian-Carlo Pascutto [:gcp] 2012-03-10 01:12:25 PST
>So, I started looking into this, and I'm trying to figure out where/how we'll 
>compute this new score multiplier. As johnath mentioned in comment 4, we don't 
>have built-in pow or exp functions in sqlite, so should we look into adding those? 
>(Are we using our own sqlite now? Would this be easy or hard? I'm not familiar 
>with how to go about doing this.)

We can use our own SQLite, but you will likely need to add connection caching to it if you want to use it for performance sensitive stuff (right now we use it for compatibility with Gecko-originating databases).

Comment 8 explains how to approximate it with math that uses basic operations. This should allow you to do the calculation in current SQLite - if you choose to go that route.
Comment 13 Johnathan Nightingale [:johnath] 2012-03-12 09:18:20 PDT
Created attachment 604976 [details]
Linear vs. e^(-x^2) vs Cauchy

Cauchy, in green, is a not-bad approximation, as Wes says, and only requires multiplication. I'd say ship it (with the rest of this bug's bookmark bonuses and such), it's gotta be an improvement over linear.
Comment 14 Lucas Rocha (:lucasr) 2012-03-12 09:26:18 PDT
One route could be to update a frecency score column in the history table on each URL visit. Then the query would be a simple SORT BY frecency. Need to investigate implications for sync though.
Comment 15 Johnathan Nightingale [:johnath] 2012-03-12 09:29:01 PDT
(In reply to Lucas Rocha (:lucasr) from comment #14)
> One route could be to update a frecency score column in the history table on
> each URL visit. Then the query would be a simple SORT BY frecency. Need to
> investigate implications for sync though.

I think that's a better long term play, if it's robust and not-sync-breaking. To me it feels like follow-up work, though. If my vague mathematical conjectures hold water, we can make things better right now by just replacing the math with different math. Do you agree?
Comment 16 Mark Finkle (:mfinkle) (use needinfo?) 2012-03-12 09:29:41 PDT
(In reply to Johnathan Nightingale [:johnath] from comment #15)
> (In reply to Lucas Rocha (:lucasr) from comment #14)
> > One route could be to update a frecency score column in the history table on
> > each URL visit. Then the query would be a simple SORT BY frecency. Need to
> > investigate implications for sync though.
> 
> I think that's a better long term play, if it's robust and
> not-sync-breaking. To me it feels like follow-up work, though. If my vague
> mathematical conjectures hold water, we can make things better right now by
> just replacing the math with different math. Do you agree?

I agree.
Comment 17 Lucas Rocha (:lucasr) 2012-03-12 09:33:45 PDT
(In reply to Johnathan Nightingale [:johnath] from comment #15)
> (In reply to Lucas Rocha (:lucasr) from comment #14)
> > One route could be to update a frecency score column in the history table on
> > each URL visit. Then the query would be a simple SORT BY frecency. Need to
> > investigate implications for sync though.
> 
> I think that's a better long term play, if it's robust and
> not-sync-breaking. To me it feels like follow-up work, though. If my vague
> mathematical conjectures hold water, we can make things better right now by
> just replacing the math with different math. Do you agree?

Agree. The only limitation we'd have is that the frecency calculation wouldn't be able to account for bookmarks though (as they're in a separate table). But, yes, that's a follow-up.
Comment 18 Tony Chung [:tchung] 2012-03-13 13:25:08 PDT
*** Bug 732432 has been marked as a duplicate of this bug. ***
Comment 19 Johnathan Nightingale [:johnath] 2012-03-14 10:49:43 PDT
Via comment 4's scratchpad, here's what the updated math does:

function multiplier(age) {
// var score = Math.exp( -(Math.pow((age/15),2) ))
 var score = 225 / ( age*age + 225); 
 return Math.max(1, 100*score);
}

multiplier(0);  /* 100 */
multiplier(1);  /* 99.5575221238938 */
multiplier(3);  /* 96.15384615384616 */
multiplier(7);  /* 82.11678832116789 */
multiplier(14); /* 53.44418052256532 */
multiplier(30); /* 20 */
multiplier(60); /* 5.88235294117647 */
multiplier(90); /* 2.7027027027027026 */

More gradual fall off, but still decently steep.
Comment 20 :Margaret Leibovic 2012-03-14 11:36:45 PDT
Created attachment 605866 [details] [diff] [review]
(1/2) Clean up filterAllSites

The query in filterAllSites was really difficult to read, so I decided to break it apart using temp variables. There is some repetition here now, but I feel like it's worth it to understand what's going on. (Also, it was confusing that there was both a |urlFilter != null ? ... | and |urlFilter == null ? ...| in the query, so this makes it clearer what's going on in each case).
Comment 21 :Margaret Leibovic 2012-03-14 11:40:57 PDT
Created attachment 605875 [details] [diff] [review]
(2/2) Sort by new frencency calculation

This updates sortOrder to use the Cauchy approximation we've been talking about. Since we need bug 721731 before we can take bookmarks into account, we can split off that twiddling into a follow-up.
Comment 22 Gian-Carlo Pascutto [:gcp] 2012-03-14 12:01:26 PDT
Margaret, can you patch your new formula into the Profile Migrator? Look for the comment "// see BrowserDB.filterAllSites for this formula".

If you figure out a way to share this so we don't have to do it in the future, that's even more awesome.
Comment 23 Lucas Rocha (:lucasr) 2012-03-15 04:25:45 PDT
Comment on attachment 605866 [details] [diff] [review]
(1/2) Clean up filterAllSites

Review of attachment 605866 [details] [diff] [review]:
-----------------------------------------------------------------

::: mobile/android/base/db/LocalBrowserDB.java
@@ +123,5 @@
> +            int limit, CharSequence urlFilter) {
> +        // The history selection queries for sites with a url or title
> +        // containing the constraint string
> +        String historySelection = "(" + History.URL + " LIKE ? OR " +
> +                                        History.TITLE + " LIKE ?)";

final String

@@ +124,5 @@
> +        // The history selection queries for sites with a url or title
> +        // containing the constraint string
> +        String historySelection = "(" + History.URL + " LIKE ? OR " +
> +                                        History.TITLE + " LIKE ?)";
> +        String historySelectionArg = "%" + constraint.toString() + "%";

final String

@@ +129,5 @@
> +
> +        // ORDER BY is number of visits times a multiplier from 1 - 120 of how recently the site
> +        // was accessed with a site accessed today getting 120 and a site accessed 119 or more
> +        // days ago getting 1
> +        String sortOrder = History.VISITS + " * MAX(1, (" +

final String

@@ +133,5 @@
> +        String sortOrder = History.VISITS + " * MAX(1, (" +
> +                           History.DATE_LAST_VISITED + " - " + System.currentTimeMillis() + ") / 86400000 + 120) DESC";
> +
> +        Cursor c;
> +        if (urlFilter != null) {

I'd prefer something like:

String selection = historySelection;
String[] selectionArgs = new String[] { historySelectionArg, historySelectionArg};
if (urlFilter != null) {
    selection = DBUtils.concatenateWhere(selection, "(" + History.URL + " NOT LIKE ?));
    selectionArgs = DBUtils.appendSelectionArgs(selectionArgs, new String[] { urlFilter.toString() });
}

Cursor c = cr.query(historyUriWithLimit(limit),
                    projection,
                    selection,
                    selectionArgs,
                    sortOrder);

Simpler and no code duplication.
Comment 24 Lucas Rocha (:lucasr) 2012-03-15 04:28:06 PDT
Comment on attachment 605875 [details] [diff] [review]
(2/2) Sort by new frencency calculation

Review of attachment 605875 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me.

::: mobile/android/base/db/LocalBrowserDB.java
@@ +132,5 @@
> +        // Since we're limited by the math we can do with sqlite, we're calculating this
> +        // approximation using the Cauchy distribution: multiplier = 15^2 / (age^2 + 15^2).
> +        // Using 15 as our scale parameter, we get a constant 15^2 = 225. Following this math,
> +        // frecencyScore = numVisits * max(1, 100 * 225 / (age*age + 225)). (See bug 704977)
> +        String age = "(" + History.DATE_LAST_VISITED + " - " + System.currentTimeMillis() + ") / 86400000";

final String
Comment 25 Lucas Rocha (:lucasr) 2012-03-15 11:01:11 PDT
Before I forget, please file a follow-up bug to write tests for the frecency code.
Comment 26 :Margaret Leibovic 2012-03-15 11:10:54 PDT
Created attachment 606296 [details] [diff] [review]
(1/3) Clean up filterAllSites (v2)

Thanks for the DBUtils tip. That makes this nicer.
Comment 27 :Margaret Leibovic 2012-03-15 11:11:30 PDT
Created attachment 606297 [details] [diff] [review]
(2/3) Sort by new frencency calculation (rebased)
Comment 28 :Margaret Leibovic 2012-03-15 11:14:53 PDT
Created attachment 606298 [details] [diff] [review]
(3/3) Update frecency calcuation in profile migrator

I wasn't sure if there was a way to factor out the age calculation like I did in my other patch, so I just did it inline here.
Comment 29 Lucas Rocha (:lucasr) 2012-03-16 10:23:03 PDT
Comment on attachment 606296 [details] [diff] [review]
(1/3) Clean up filterAllSites (v2)

Review of attachment 606296 [details] [diff] [review]:
-----------------------------------------------------------------

Nice.
Comment 30 Gian-Carlo Pascutto [:gcp] 2012-03-16 10:51:51 PDT
Comment on attachment 606298 [details] [diff] [review]
(3/3) Update frecency calcuation in profile migrator

Review of attachment 606298 [details] [diff] [review]:
-----------------------------------------------------------------

::: mobile/android/base/ProfileMigrator.java
@@ +152,5 @@
>          "       places.title            AS p_title, "     +
>          "       MAX(history.visit_date) AS h_date, "      +
>          "       COUNT(*) AS h_visits, "                   +
>          // see BrowserDB.filterAllSites for this formula
> +        "       MAX(1, 100 * 225 / (" + 

nit: trailing whitespace

@@ +446,5 @@
>              try {
>                  final String[] queryParams = new String[] {
>                      /* current time */
>                      Long.toString(System.currentTimeMillis()),
> +                    Long.toString(System.currentTimeMillis()),

nit: factor this out. Although it doesn't matter, there's subtleties here like the times not necessarily being equal.
Comment 33 :Margaret Leibovic 2012-03-19 11:39:38 PDT
Comment on attachment 606296 [details] [diff] [review]
(1/3) Clean up filterAllSites (v2)

[Approval Request Comment]
Mobile-only. Blocking fennec-1.0.
Comment 34 :Margaret Leibovic 2012-03-19 11:40:01 PDT
Comment on attachment 606297 [details] [diff] [review]
(2/3) Sort by new frencency calculation (rebased)

[Approval Request Comment]
Mobile-only. Blocking fennec-1.0.
Comment 35 :Margaret Leibovic 2012-03-19 11:40:22 PDT
Comment on attachment 606298 [details] [diff] [review]
(3/3) Update frecency calcuation in profile migrator

[Approval Request Comment]
Mobile-only. Blocking fennec-1.0.
Comment 36 Alex Keybl [:akeybl] 2012-03-20 13:04:06 PDT
Comment on attachment 606296 [details] [diff] [review]
(1/3) Clean up filterAllSites (v2)

[Triage Comment]
Mobile only - approved for Aurora 13.

Note You need to log in before you can comment on or make changes to this bug.