Closed Bug 1135738 Opened 9 years ago Closed 1 month ago

Build followup newtab telemetry experiment that can compute uniques

Categories

(Content Services Graveyard :: Tiles, defect, P1)

defect
Points:
5

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: Mardak, Assigned: mzhilyaev)

References

Details

(Whiteboard: .004)

Attachments

(1 file, 3 obsolete files)

One of the main drawback from bug 1062708's experiment implementation was that it prevented some types of analysis that requires data once from each user instead of all impressions from all users.

For example, when we tried to analyze co-occurence, a single user who has siteA and siteB together and opened 1000 new tabs would skew the data heavily.

The thinking is we can measure uniques by having the experiment flag a submission as "first" and on the server, we can then do analysis on all "first" submissions. This avoids us needing to send any persistent IDs to de-dupe/find-firsts on the server.
There are two reasons for re-running the telemetry experience: audience sizing and co-occurrence analysis. newtab impression data collected in the previous experiment did not help with either.
The details of why and notes on how new telemetry experience will help are below.
Also note argumentation on increasing experiment sample size and using clear-text frecency scores.  

Audience sizing:

We met with partner-facing sales to understand which minimal data collection is needed to satisfy
immediate biz needs for "related tile" product. The minimum delivery is an ability to estimate unique users and impressions per marketing campaign.

- A partner provides a list of targeted sites 
 (and example here:https://bugzilla.mozilla.org/show_bug.cgi?id=1132534#c12)
- Browser checks if any of the targeted sites occur in 100 top sites of user history
- If so, browser plays that campaign creative

The partners needs estimates of:
- unique users whose top frecent sites match one or more sites being targeted
- potential impressions the campaign may generate

To provide such estimates we have to collect top frecent sites per user: for we must aggregate same data that the actual targeting is using. The current, in-browser targeting implementation is here: https://bugzilla.mozilla.org/show_bug.cgi?id=1126183. The new experiment uses same code in PlacesLinkProvider to access top links from user history.

Besides extending the number of sites sent to endpoint, we also need to aggregate impressions from distinct users and impressions total.

The distinct user aggregation is achieved by setting first-ping and day-ping flags into impression record:
- first-ping is sent on the first record sent to end point
- day-ping is sent on the first record seen after GMT day changes

This allows to compute daily and overall distinct user counts per site or campaign, and enables co-occurrence analysis to remove noise from impressions generated by identical users.

Co-occurrence analysis:

The results and issues encountered are listed in https://bugzilla.mozilla.org/show_bug.cgi?id=1129938
The conclusion is that one user generating many newtab impressions fools the system into recognizing sites that co-occur in that user history as related (bug 1129938 has doc attached that details the problem). 

Another severe limitation is that we only collected sites actually shown on newtab (15 max).  This placed many utility sites (booking, ticketing, e-commerce, banking, etc.) off our radar: the relationship may exist between booking.com and ticketmaster.com, but we could not identify it because neither site is visited consistently by substantial number of users to get into newtab tile. For example, booking.com was only shown 946 time in newtab during previous experiment. Given that few (if not majority) of this impressions came from just a handful of users, this is an incredibly low number (especially for a month worth of data).  We are cautiously optimistic that collecting top frecent sites (instead of just shown) will greatly approve our related-site clustering quality.

Number of users to collect data from:

The previous experiment covered only 25% of beta population. Unless there are strong privacy implications, I suggest we increase it to 100%.  I already mentioned that impression count (and user count) for booking.com is statistically insignificant.  Collecting top frecent sites will definitely help, but we do not know to which degree. 

Below are impression counts for golf sites impressions from the prior experiment:
golfnow.com,220
golfsmith.com,169
pga.com,70
golf.com,43
lpga.com,38
golfchannel.com,37
golfdigest.com,19
rockbottomgolf.com,5
golfwrx.com,4

Should there be a partner wanting to target golfers, we can't show him these numbers - we will be laughed at.  Out of 500K users, and billions of impressions, there are only 605 impressions from golfing audience?!  We should be aggressive in data collection, for we need to serve clients wanting to target narrow audiences like golfers (or hobbiests, or jazz enthusiasts) and that requires larger coverage. if we can go for 100% let's go for 100%.

Frecency scores:

frecency scores were obfuscated in the prior experiments, which made them virtually unusable.
I do not see how this helps privacy when we collect 100 top frecent sites.
Hence, I suggest to report frecency scores as they are.
It will help us study tail distribution of low scored sites, which is critical to understanding how sites BECOME available for targeting, and how targeting algorithm works. For example, booking.com targets expedia.com, should we play booking.com add when expedia.com has a lower score or a high score, or does it matter? Can we catch the user intent before it's the user intent before his history is flooded with ticketing sites, but avoid playing ad on accidental visits? 

I am not saying frecency scores will answer these questions, but they might. Which is why I suggest collecting them in clear.
Attached patch new-tab telemetry experiment V2 (obsolete) — Splinter Review
Attachment #8568825 - Flags: review?(felipc)
Note that after conversing with Mardak, we decided to keep frecency scores obfuscated as in the previous experiment.
(In reply to maxim zhilyaev from comment #1)
> The previous experiment covered only 25% of beta population. Unless there
> are strong privacy implications, I suggest we increase it to 100%.
If there's enough statistical random sampling, I would have assumed we could just multiply the results by 4 to go from 25% to 100%.

The low counts on golf might just be that they don't appear in many user's top 12 but might appear in the top 100, which is what we're using to determine if a suggested tile should be shown.
(In reply to Ed Lee :Mardak from comment #4)
> (In reply to maxim zhilyaev from comment #1)
> > The previous experiment covered only 25% of beta population. Unless there
> > are strong privacy implications, I suggest we increase it to 100%.
> If there's enough statistical random sampling, I would have assumed we could
> just multiply the results by 4 to go from 25% to 100%.

The key here is "enough statistical random sampling". 
What's enough and not enough tuned out to be a tricky question that took me a day to answer:
The number sorcery can be found here (WARN! reading this doc would be quite masochistic):
https://docs.google.com/document/d/1rZ9LRFXu5nAdT85VJZwr1dGX8TgTwSVHUuPeOkQlikI/edit#

The end result of that study is following.
There are 4 parameters of importance:
N - size of FX population (in given country/locale)
K - size of targeted audience we want to predict from telemetry experiment
E - error limit that we want to guarantee (for example, we guarantee that estimated audience will deviate from real audience no more than 25%)
S - sample size needed to provide such error limit guarantee

These parameters are related via this formula: K/N = 9 / (E*E*S + 9)
Note that K/N is a percentage of audience size in the total population.

So, what are the minimal audience size we can estimate with 25% error guarantee?

US:  total FX users 30M, beta user 400K
full beta - 0.036% of population
25% of beta - 0.14% of population

Canada: total FX users 4M, beta users 40K
full beta - 0.3% of population
25% of beta - 1.4% of population

Spain: total FX users 5M, beta users 20K
full beta - 0.7% of population
25% of beta - 2.7% of population

Imaginary Max's island: total FX users 1M, beta users 5K
full beta - 2.7% of population
25% of beta - 10% of population

For countries/locales less presented in beta, sizes of audiences we can estimate with acceptable accuracy grows pretty rapidly.  It's entirely possible that a marketer wants to target golfers in Chili.
What do we tell them?  If Chili's beta users are only 5K, we can't provide a customer reliable estimates of their targeted audience unless it's 10% of all FX users in Chili!

Which is why I strongly suggest running experiment on full beta.
If we start sizing audiences across the globe and do not want to embarrass Mozilla and our customer with estimates that are orders of magnitude wrong, I see no other way but use full beta. And even this will be insufficient in numerous cases.
Also note that 25% error guarantee could be way to optimistic. Could partners sensibly budget their campaigns if we tell them that their audience is somewhere between 75K and 125K of users?
Priority: -- → P1
Hi Maxim, wrong patch posted? Or does the code live somewhere else?
Flags: needinfo?(mzhilyaev)
Attached patch new-tab telemetry experiment v2 (obsolete) — Splinter Review
reattaching correct patch
Attachment #8568825 - Attachment is obsolete: true
Attachment #8568825 - Flags: review?(felipc)
Attachment #8571528 - Flags: review?(felipc)
Felipe,  please take a look at new patch.
Iteration: 39.1 - 9 Mar → 39.2 - 23 Mar
Comment on attachment 8571528 [details] [diff] [review]
new-tab telemetry experiment v2

clearing need info flag from felipc, as new attachment is provided
Flags: needinfo?(mzhilyaev)
Comment on attachment 8571528 [details] [diff] [review]
new-tab telemetry experiment v2

Review of attachment 8571528 [details] [diff] [review]:
-----------------------------------------------------------------

All minor issues, but I'd like to see the updated patch once before giving final r+

::: experiments/newtab-data-beta-v2/code/bootstrap.js
@@ +22,5 @@
> +// Allowed ping actions remotely stored as columns: case-insensitive [a-z0-9_]
> +const PING_ACTIONS = ["block", "click", "pin", "sponsored", "sponsored_link", "unpin", "view"];
> +
> +// Preferences
> +let rootPrefsBranch = Components.classes["@mozilla.org/preferences-service;1"].getService(Components.interfaces.nsIPrefService);

You can shorten this to Cc or Ci, or better, Services.prefs.

@@ +29,5 @@
> +const FIRST_PING_SENT = "first-ping-sent";
> +const LAST_PING_DAY = "last-ping-day";
> +
> +// leading www. replacement regex
> +const WWW_REGEX = /www\./;

probably want to make this /^www\./, to remove www. only from  the beggining of a url, right?

@@ +32,5 @@
> +// leading www. replacement regex
> +const WWW_REGEX = /www\./;
> +
> +// number of milliseconds in a day
> +const DAY_MILISECONDS = 86400000000

this looks multiplied by 1000 more..  Please use the expression "24*60*60*1000" to make it easier to verify.

@@ +197,5 @@
> +function uninstall(data, reason) {
> +  removeAllPrefs(data.id);
> +}
> +
> +function startup(data, reason) {

please use the following pattern:

gStarted = false;

function startup(data, reason) {
  if (gStarted) {
    return;
  }
  gStarted = true;
  ...
}

There's a bug that makes startup() being called twice for experiments on install time.

::: experiments/newtab-data-beta-v2/manifest.json
@@ +2,5 @@
> +  "publish"     : true,
> +  "priority"    : 5,
> +  "name"        : "New Tab Data V2",
> +  "description" : "An experiment to analyze the data on about:newtab, see bugs 1062708 and 1135738.",
> +  "info"        : "<p><a href=\"https://bugzilla.mozilla.org/show_bug.cgi?id=1062708\">Related bug</a></p>",

update this related bug

@@ +6,5 @@
> +  "info"        : "<p><a href=\"https://bugzilla.mozilla.org/show_bug.cgi?id=1062708\">Related bug</a></p>",
> +  "manifest"    : {
> +    "id"               : "newtab-data-beta-v2@experiments.mozilla.org",
> +    "startTime"        : 1418169600,
> +    "endTime"          : 1420848000,

This endTime is 2015-01-09. Needs to be updated

@@ +12,5 @@
> +    "appName"          : ["Firefox"],
> +    "channel"          : ["beta"],
> +    "minVersion"       : "33.0",
> +    "maxVersion"       : "37.*",
> +    "sample"           : 0.99

is 99% intentional? If you want 100% you can omit "sample" from the json.
Attachment #8571528 - Flags: review?(felipc) → feedback+
Attached patch V3 of new-tab-v2 experiment (obsolete) — Splinter Review
Fixed reviewer comments, experiment is set to run between March 20 and April 20 of 2015, on full beta population
Attachment #8571528 - Attachment is obsolete: true
Attachment #8577011 - Flags: review?(felipc)
Comment on attachment 8577011 [details] [diff] [review]
V3 of new-tab-v2 experiment

Review of attachment 8577011 [details] [diff] [review]:
-----------------------------------------------------------------

::: experiments/newtab-data-beta-v2/code/install.rdf
@@ +12,5 @@
> +    <em:targetApplication>
> +      <Description>
> +        <em:id>{ec8030f7-c20a-464f-9b0e-13a3a9e97384}</em:id>
> +        <em:minVersion>33.0a1</em:minVersion>
> +        <em:maxVersion>37.0</em:maxVersion>

beta is 37 now, but until the end of the experiment it will become 38
Attachment #8577011 - Flags: review?(felipc) → review+
Comment on attachment 8577011 [details] [diff] [review]
V3 of new-tab-v2 experiment

>+++ b/experiments/newtab-data-beta-v2/manifest.json
>+    "startTime"        : 1426896000,
>+    "endTime"          : 1429574400,
>+    "maxActiveSeconds" : 604800,
>+    "appName"          : ["Firefox"],
>+    "channel"          : ["beta"],
>+    "minVersion"       : "33.0",
>+    "maxVersion"       : "41.*",
maksik currently has the experiment running on 100% of beta users from March 21 to April 21 (4 weeks) for 7 days.

bsmedberg, would it make more sense to run it for a shorter time, e.g., March 21 to April 7 (2 weeks) to avoid blocking out other telemetry experiments from running? We want to be able to measure the effects of weekday vs weekend traffic, so that's why the 7 days for a given user. But running the experiment for a longer time is more to catch the users who don't use Firefox every day, so running it for a 2 additional weeks vs just the first 2 weeks has relatively small benefits.

Just making sure there aren't other telemetry experiments that might be happening around the same time. (Although I'm not entirely sure what happens with overlap other than only one runs at a time.)
Flags: needinfo?(benjamin)
We typically don't run any experiment at 100%. Can we do this with a 20% sample?
Flags: needinfo?(benjamin)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #15)
> We typically don't run any experiment at 100%. Can we do this with a 20%
> sample?

Benjamin, 

Let me try to argue the need for a sample larger than 20%, preferably 100%.

It is my understanding that our advertisement clients will rely heavily on our inventory projections, to the extend of budgeting campaigns based on the number we provide.  If our projections are wildly wrong, clients will under or over budget, which will corrode industry trust in Mozilla brand. Other publishers provide exact inventory projections, which makes the issues even more sensitive.

We project inventory by summing up newtab impressions for targeted sites collected via telemetry experiment. The previous experiment did not have data for many sites that comprise some targeting lists.  Consider “Parenting” sites and the corresponding new tab impressions from prior  experiment (1 month on 25% of beta):

parenting sites	     newtab impressions
--------------------------------------
babycenter.com	       125
education.com          113
parents.com            88
whattoexpect.com	       36
parenting.com	       24
todaysparent.com	        4
babiesonline.com	        0
babyzone.com	        0
parenting.org	        0
thebabycorner.com       0
todaysparent.com        0

Yes, theoretically, we can add up counts and multiply by 500, but for half of the list we plainly have no data! And this is typical for many common targeting categories: Golf, Weddings, Tennis, Cooking, Design, to name a few:

golf sites	     newtab impressions
--------------------------------------
pgatour.com           243	
pga.com		       70
golf.com               43
golfchannel.com	       37
golfdigest.com	       19
golfillustrated.com	0
golfreview.com	        0
golftipsmag.com	        0
pga.com		        0
pgatour.com		0
usga.org		        0
usopen.com		0

Moderately popular sites with focused content have hard time getting represented in beta! This is a serious issue for us - usopen.com has visitors, but we have no way telling how many.  The second experiment will collect more history, and it will ease the problem, but I doubt it will eliminate it. 

Statistical literature typically suggests 5% of population to be representative, sometimes 1% sample is mentioned as the lowest limit. But, 20% of US beta is roughly 0.02% of full US FX population.  Given potential business impact, projections based on 0.02% sample put us at serious risk. I would much rather use full beta now and compare estimates with reports from real targeting campaigns that we will run for Mozilla-internal tiles. This way we will learn how accurate our estimate are, and how we can reduce sample size in the future.


my 2c
The alternative would be to run a serious of experiments, each having a lower sample %.
For example, we may run 4 experiments each taking 25% sample.

25% from March 15 to April 1
25% from April 1  to April 15
25% from April 15 to May 1
25% from May 1 to May 15

In order for this solution to work, we need to ensure that same users do not participate in multiple experiments, otherwise our counting will come out wrong.  Is there a way to engineer something like that?
> but for half of the list we plainly have no data! 

The reason for us not having data for the sites in my examples is that they are not included in eTLD1.whitelist of 50K most popular sites! Which is why we miss them :(  

The only suspect is babyzone.com which is in eTLD1.whitelist, and is missing from En-US site count, but it's a Netherlands site, which explains why US data has 0 for it.

The examples in my prior comment are wrong, but the problem still exists however.
I went through all sites listed in our EdRules categorization list here: https://github.com/Mardak/interestReference/blob/master/interestReference.json

I extracted sites with 0 count and joined them with eTLD1.whitelist and then with Alexa ranking.
Below is the list of sites with 0 count for US:

site                 alexa rank

nme.com              4687
pastemagazine.com    7602
good.is              8263
mymodernmet.com      11336
thefashionspot.com   11744
99u.com              15985
roughguides.com      17685
dwell.com            20945
sqlite.org           21175
flexonline.com       21379
dimemag.com          22158
popphoto.com         23432
contemporist.com     24748
coolhunting.com      25357
startupnation.com    26633
uxmag.com            28272
weddingchicks.com    28461
automobilemag.com    30512
cameralabs.com       31692
boxingnews24.com     32804
lover.ly             33994
allmovie.com         35654
ruffledblog.com      39000
onstartups.com       43747
gardenista.com       48116
weddingchannel.com   56089
baseballamerica.com  72511


So, out of 285 targeted sites  listed in eTLD1.whitelist, we are missing 27, which is 10% of targeted sites, perhaps it's bearable.  However, for another 10% of our site selection counts are low (under 10 impressions).  Perhaps that's bearable too, but we do not know which targeting customers will want to execute, and it could be something that we have do not have data for.  So, I do still feel very uncomfortable with doing estimations on 0.02% of FX population, and still would like to run on full beta.
Unknowns are too many, risks are too high... We have no idea which sample size is needed for our projections to be accurate...  

Unless there are serious reasons of not running on full beta, why don't we take all data we can?
Depends on: 1144815
Depends on: 1144821
The table below shows how scaling error depends on beta sample size for audiences of decreases sizes.
The table is computed for US population of 45M and full beta sample of 400K.
jterry, please, provide feedback on the lowest sample size, that supports acceptable precision for audience estimation for potential customers.  

+---------------+-----------+----------+----------+----------+----------+
| audience size |                   scaling error in %                  |
+               +-----------+----------+----------+----------+----------+
|               | full beta | 75% beta | 50% beta | 25% beta | 20% beta |
+---------------+-----------+----------+----------+----------+----------+
|        200000 |       4.7 |      5.5 |      6.7 |      9.5 |     10.6 |
|        100000 |       6.7 |      7.7 |      9.5 |     13.4 |       15 |
|         75000 |       7.7 |      8.9 |       11 |     15.5 |     17.3 |
|         50000 |       9.5 |       11 |     13.4 |       19 |     21.2 |
|         25000 |      13.4 |     15.5 |       19 |     26.8 |       30 |
|         10000 |      21.2 |     24.5 |       30 |     42.4 |     47.4 |
|          5000 |        30 |     34.6 |     42.4 |       60 |     67.1 |
|          1000 |      67.1 |     77.5 |     94.9 |    134.2 |      150 |
+---------------+-----------+----------+----------+----------+----------+

For reference - all supporting math is attached to Bug# 1144836
Flags: needinfo?(jterry)
Iteration: 39.2 - 23 Mar → 39.3 - 30 Mar
We're currently building campaigns for the Engagement team, FENNEC, MDN, and other Mozilla groups and we're trying to build out accurate expectations for campaign performance based on user clusters. In order for us to be able to accurately deliver messaging that is relevant for a particular audiences (for either internal or external partners), we need to be able to incorporate Boolean logic to our targeting parameter, which means that audience sizes will be smaller and the potential for inaccurate projections is greatly increased. The industry standard for such estimates is +/- 5%. I know we might not be able to get to that level of confidence for smaller audiences, but the closer we get, the more accurately we gauge potential performance and opportunity.

Judging by the error table above, running on full beta seems required. If full beta is not possible, we need to know the largest sample size telemetry would allow.
Flags: needinfo?(jterry)
Benjamin,

The business would like to run on full beta as the two last comments suggest.
However, if this is not possible, could you give us the largest possible sample we can use.
Flags: needinfo?(benjamin)
Users can only run with a single experiment at a time. We have other experiments pending, and so it really is not practical to ship this as an experiment and operate on large samples of the beta population.

If you want to run on all the beta population, I think you should just build this into the train (gated on telemetry) instead of shipping it as an experiment.

Otherwise, 50% is typically the most we'd allow for a single experiment.
Flags: needinfo?(benjamin)
Iteration: 39.3 - 30 Mar → 40.1 - 13 Apr
- Retested on 38.0 and fixed a bug in implementation of NewTabUtils.getPlacesSortedLinks()
- Replaced alexa white list with similar sites white list
Attachment #8577011 - Attachment is obsolete: true
Iteration: 40.1 - 13 Apr → 40.2 - 27 Apr
Iteration: 40.2 - 27 Apr → 40.3 - 11 May
Iteration: 40.3 - 11 May → ---
Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: