Closed Bug 1136461 Opened 9 years ago Closed 5 years ago

Include the site triggering a suggested tile in the tiles data ping

Categories

(Firefox :: New Tab Page, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: Mardak, Unassigned)

References

Details

Attachments

(1 file)

From the decision in bug 1132534 where it's our responsibility to choose/augment the list of sites that trigger a suggested tile, we'll need data about how the suggested tile is performing for each site to improve the experience for users. 

We want to be able to add more sites to try to increase the audience coverage and at the same time improve overall user privacy. We also want to make sure those added sites are relevant because otherwise users will get annoyed from seeing bad recommendations.

One straightforward approach is to send the site that triggered the suggested tile. This data would be treated in the same way we've been treating tiles user data: remove personal data and compute metrics then delete the raw data within 7 days. None of the data is associated back to any one user as we're interested in aggregates.

The performance (most likely click-through-rates) of each triggering site will be used internally to optimize, and Clients will get overall performance across all triggering sites (as they don't have control over exactly which sites we match against).

Some slight tweaks on this functionality:
- don't send triggering site data if DNT is enabled ?
- only send triggering site data if telemetry is enabled ?
- have a newtab customize menu option for users to not-send/send ?

I'm not sure if this is complicating things too much (premature-privacy-policy-optimization?), but if we do have a send/not-send control in the newtab customize menu, should it be just use an existing pref... Telemetry? So by default nightly/aurora/beta [x] Improve new tab and Firefox experience by sharing usage data with Mozilla; and default release [ ] Improve new tab ....
Since we do not track users, we do not know which newtab impression reported to backend came from which user.  In this case a procedure of adding Laplacian noise to all sites in the targeted list described in http://www.cs.columbia.edu/igert/courses/E6898/dwork_cacm.pdf will mask real site triggering suggested tile to the prescribed differential privacy guarantee.

Following the paper logic:
- there are N sites in a tile targeted list
- Suppose that a tile was shown because of site A 
- Browser creates a vector of N 0s and sets 1 for site A (call it a "triggering" vector)
- Browser adds Laplacian noise to each vector cell and sends back to Mozilla
- Laplacian noise is described here: http://en.wikipedia.org/wiki/Laplace_distribution 


What are the parameters of Laplacian noise?

Suppose a site B caused a tile to show, then its "triggering" vector will 0 for A, 1 for B and 0 for rest. Hence, the maximum difference between any two "triggering" vectors is 2.
According to the paper, if newtab impressions can't be traced back to same user (and sites in the list are not correlated), then for a prescribed differential privacy guarantee G, the noise should be:

Lap(2/G) = G/4 * e^(-G*z/2) 
Here z is a random real variable uniformly distributed in [-10,10]
 
I have no idea what G should be, but RAPPOR (http://arxiv.org/pdf/1407.6981v2.pdf) uses G=ln(3), so we use same (although i am totally puzzled by how G should be chosen)

Adding up values for site A, will give us a distribution with expected value equal to number of impressions suggested by site A, with standard deviation (sigma) == sqrt(8)/G = sqrt(8)/ln(3) = 2.6
Taking a close look at Lap(2/G) distribution shows that pretty much all of probability mass is contain within [-4:4] interval.  Since numbers of "viewed" (or clicked) impressions per site reach thousands, an error of +/- 4 make for VERY precise estimations.

Here is a realistic example - the technology sites collection
arstechnica.com   5158
slashdot.org      3941
engadget.com      2749
gizmodo.com       1861
lifehacker.com    1779
wired.com         1582 

numbers represent impression count for 25% of beta en-US/US population.
there are 40M FX users in US and 400K US beta users.
Hence we should expect 400x site users in US audience:

arstechnica.com   2063200
slashdot.org      1576400
engadget.com      1099600
gizmodo.com        744400
lifehacker.com     711600
wired.com          632800

For a reasonable CTR of 0.04%, the expected clicks per site should roughly be:
rstechnica.com  825.28
slashdot.org    630.56
engadget.com    439.84
gizmodo.com     297.76
lifehacker.com  284.64
wired.com       253.12
   
Which are still high enough to make an error range around 2% in the least case.

Even if we are not measuring clicks precisely enough, since we optimize the ratio of "clicks" to "views", it will be the number of "views" that drive optimization not clicks.  Suppose we know "views" exactly and "clicks" with some error, then:

clicks/views = (real_clicks + Error) / views = real_clicks/views + Error/views

But if real_clicks/views = 0.0004, and Error is 0.1, then Error/views = 0.00004, which is probably negligible.  If CTR are different for various sites they should differ more then 0.004% to make an optimization decision.
(In reply to maxim zhilyaev from comment #1)
> Adding up values for site A, will give us a distribution with expected value
> equal to number of impressions suggested by site A, with standard deviation
> (sigma) == sqrt(8)/G = sqrt(8)/ln(3) = 2.6

This statement and the number example that follows it are COMPLETELY wrong.
The deviation of 2.6 does not refer to the sum of Laplace distributions at all.
Please ignore everything after "Adding up values for site A".
I am looking through literature to figure out how number queries works and how accurate they are for a given privacy guarantee.   

If anyone has good pointers, please share.
(In reply to maxim zhilyaev from comment #1)
In fact the whole comment does not make sense (not just a part related to site "A" noisy summation).

The "differential privacy" is about adding noise to already known query result, so that the noise covers up the difference between most dissimilar records participating in the query. What we have with site reporting is different: we want to obfuscate individual records so that we, ourselves, can't extract private information from any record, but still can reason about how many times a given site caused related tile to appear.  Due to my horrid illiteracy in privacy, I am failing to see the direct relation between these two scenarios.
Deferring this for now as it's a nice to have optimization of our bucket composition and the initial suggested tiles are just the affiliate fennec ones.
Blocks: 1140185
No longer blocks: 1120311
Whiteboard: .?
Depends on: 1142386
From bug  1149734:

Per discussion in Bug# 1142386, it's sufficient to run "randomized response" on ad-group sites on each ping.  The algorithm is as in this comment: //bugzilla.mozilla.org/show_bug.cgi?id=1142386#c6

- Client matches adGroup sites associated with a tile shown/clicked with user top frecent sites.
- Client constructs a json object where each site in adGroup has value of 1 or 0.
- If adGroup site is found in frecency list its bit set to 1, otherwise 0
- Client runs "randomized response" on each value in json object
- Client attaches "randomized" json object to tile ping which is sent to onyx server
- Client re-runs "randomized response" on every ping
Blocks: 1149727
Comment on attachment 8588211 [details] [diff] [review]
Randomized Response for targeted tiles V1

Review of attachment 8588211 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good overall. I think you'll need to ask adw for a formal review

::: browser/modules/DirectoryLinksProvider.jsm
@@ +409,5 @@
>        data[action] = actionIndex;
>      }
>  
> +    // generate randomised list of targeted sites
> +    if (suggestedLink !== undefined) {

Perhaps we can do 'if (link.targetedSite)' here then use 'link' directly instead of 'suggestedLink' below. Saves a couple of lines.

@@ +410,5 @@
>      }
>  
> +    // generate randomised list of targeted sites
> +    if (suggestedLink !== undefined) {
> +      // attach randomised response and auxiliary data to ping

nit: randomised -> randomized

@@ +682,5 @@
>        // Choose the first site a user has visited as the target. In the future,
>        // this should be the site with the highest frecency. However, we currently
>        // store frecency by URL not by site.
> +      targetedSite: chosenTargetedSites.size == 0 ? null :
> +                    chosenTargetedSites.entries().next().value[0],

I believe you can do '[...chosenTargetedSites][0]' here which is a little more concise.

@@ +708,5 @@
> +      if (Math.random() <= RANDOMIZED_RESPONSE_FREQUENCY) {
> +        // do not lie - return same bit
> +        return bit;
> +      }
> +      // otherwise return 1 or 0 with equal chance

I'm just wondering, so 50% of the time we definitely don't lie, but the other 50% of the time we may or may not lie? What if we end up not telling any lies? Do we not need to guarantee that some percentage is a lie?

@@ +709,5 @@
> +        // do not lie - return same bit
> +        return bit;
> +      }
> +      // otherwise return 1 or 0 with equal chance
> +      return (Math.random() < 0.5);

Can we use RANDOMIZED_RESPONSE_FREQUENCY instead of 0.5 here?

@@ +714,5 @@
> +    };
> +
> +    let reportedSites = [];
> +
> +    // walk targetedSites first, since thier value is true

nit: thier -> their

@@ +715,5 @@
> +
> +    let reportedSites = [];
> +
> +    // walk targetedSites first, since thier value is true
> +    for (let index in allFrecentSites) {

Since you're only using index to look up the site, you can do 'for (let site of allFrecentSites)'

::: browser/modules/test/xpcshell/test_DirectoryLinksProvider.js
@@ +1106,5 @@
> +  // start with single site set and two-site frecency list
> +  let targetedSet = new Set(["present.com"]);
> +  let frecencyList = ["present.com", "absent.com"];
> +  let N = 1000;
> +  let f = 0.5;

Could we have a more descriptive variable name for these two variables? And/or a comment on what it is
Attachment #8588211 - Flags: review?(msamuel) → feedback+
(In reply to Marina Samuel [:emtwo] from comment #8)
> > +      if (Math.random() <= RANDOMIZED_RESPONSE_FREQUENCY) {
> > +        // do not lie - return same bit
> > +        return bit;
> > +      }
> > +      // otherwise return 1 or 0 with equal chance
> the other 50% of the time we may or may not lie?
I believe the thing is the server doesn't know whether it's a lie or not for any of the responses. So even if we happen to send the same bit that we would have in telling the truth, the server doesn't learn more.

> @@ +709,5 @@
> > +      // otherwise return 1 or 0 with equal chance
> > +      return (Math.random() < 0.5);
> Can we use RANDOMIZED_RESPONSE_FREQUENCY instead of 0.5 here?
It just happens that tweakable RESPONSE_FREQUENCY is 0.5, and that's the probability for equal chance 0 or 1. The lie should be equally likely to get the desired privacy aspects.
I have some questions about this code.

1. If I am reading it correctly, you are re-randomizing with each ping, right?
Doesn't that mean that repeated queries from the same browser quickly reveal
the true answer for any site that's either frequently or infrequently visited?
I.e., if the underlying value is always 1 or always 0, then you are going to
quickly converge on the true answer.

2. You are using Math.random but this seems to demand a cryptographic PRNG, right?
Adding rbarnes because we are going to want security review of this.
(In reply to Eric Rescorla (:ekr) from comment #10)
> 1. If I am reading it correctly, you are re-randomizing with each ping,
Correct. In the discussion and meeting with mmc/rbarnes and team about bug 1142386, the privacy aspects of the memoization step and instantaneous response of RAPPOR are relatively quickly defeated with the quantity of new tab pings we would send. Our existing data policies around deleting raw data and not tracking users would be needed for the more complicated RAPPOR algorithm as also needed for randomized response.

rbarnes is indeed working with maksik on confirming that and the general privacy aspects of randomized response for new tab with our data policy.

> 2. You are using Math.random but this seems to demand a cryptographic PRNG,
Maybe? We're not really encrypting data, but it should be simple enough to just call a better random(). Probably something with nsICryptoHMAC and an appropriate seed?
(In reply to Eric Rescorla (:ekr) from comment #10)
> I have some questions about this code.
> 
> 1. If I am reading it correctly, you are re-randomizing with each ping,
> right?

This is correct.  The rational for doing it is described here: https://docs.google.com/a/mozilla.com/document/d/1vsLa1A4tqivyyquQsqEhkff6qK9Y4ymlwkyXQNZ7_yk

As Ed mentioned, RAPPOR's instantaneous randomization is not protective under many reports sent from same browser, and RAPPOR memorization leads to switching attack described in the same paper.  Hence, we did not find RAPPOR to be any more protective then plain vanilla RR, but RR is less noisy. Hence, the current implementation. Nevertheless, do please suggest improvements/modifications, or critiques to the document logic. 

> 2. You are using Math.random but this seems to demand a cryptographic PRNG,
> right?

What's the exploit that we trying to protect against with PRNG?
(In reply to maxim zhilyaev from comment #13)
> (In reply to Eric Rescorla (:ekr) from comment #10)
> > I have some questions about this code.
> > 
> > 1. If I am reading it correctly, you are re-randomizing with each ping,
> > right?
> 
> This is correct.  The rational for doing it is described here:
> https://docs.google.com/a/mozilla.com/document/d/
> 1vsLa1A4tqivyyquQsqEhkff6qK9Y4ymlwkyXQNZ7_yk
> 
> As Ed mentioned, RAPPOR's instantaneous randomization is not protective
> under many reports sent from same browser, and RAPPOR memorization leads to
> switching attack described in the same paper.  Hence, we did not find RAPPOR
> to be any more protective then plain vanilla RR, but RR is less noisy.
> Hence, the current implementation. Nevertheless, do please suggest
> improvements/modifications, or critiques to the document logic. 

Well, I think the answer here is *only* to use permanent randomization, not
to to only use instantaneous randomization. Because only providing permanent
randomization allows fingerprinting but only providing instantaneous randomization
allows extraction of the true value.


> > 2. You are using Math.random but this seems to demand a cryptographic PRNG,
> > right?
> 
> What's the exploit that we trying to protect against with PRNG?

Well, in this implementation, if you know the random sequence then you are
able to discard the bogus values and extract the true value. Accordingly
it's important for the PRNG to be unpredictable, which is normally a property
associated with cryptographic PRNGs.

If you're looking for a concrete example, consider the case in which you
have a pile of sites which have a broad distribution of frequencies.
So, for instance, sites 1..N have overall probability .001 and N..M
have .01. With high probability any 1 value for 1..N is therefore
a lie and you can use this to learn information about the PRNG state
and then make predictions about the PRNG sequence for N..M.
To be clear, I think the *actual* right answer is to use both kinds of randomization:
the permanent to provide hiding of the true value and the instantaneous to let you measure
stuff over time. But if you're only going to do one it should be permanent for
the privacy reasons mentioned in the previous comment.
(In reply to Eric Rescorla (:ekr) from comment #15)

> Well, I think the answer here is *only* to use permanent randomization

Could you describe what "permanent randomization" exactly is.

Specifically for the scenario of changing history:
1. There is bucket that contains two sites A,B
2. Initially a user had site A in his history, so the tile was triggered by A
3. Eventually user history lost site A from its frecency, so the tile was not played at all
4. Then site B appeared in user history, and it triggered the tile show

Over the period of time, user client needs to report A being a trigger, B being a trigger, and not report at all.  All user reports are available to an attacker. So, how exactly "permanent randomization" is used in scenario?  What's the procedure to  assign bits either state of user history?
You will also need to lie about which tiles were shown.
rbarnes describes RAPPOR-like "permanent randomization" here: https://docs.google.com/a/mozilla.com/document/d/1CiYoGKG8agm9l7gOUUtPYoKS4WfjiZ065ThroUj4cbM/edit#heading=h.mo5dixudk1ik

Randomization is done by creating a "static", randomized bit vector for each site in ad-group, then for a trigger-site taking its "static" bit vector, subjecting it to "instantaneous randomization" and sending to server.

This seems to protect against bit switching attack of original "basic" RAPPOR, where the state of each site in ad-group was encoded in a separate bit of the "static" bit vector to be reported.
Moving site reporting to suggested tiles 1.1 for now.
Blocks: 1145418
No longer blocks: 1140185
No longer blocks: 1145418

Hello!

This bug has been closed due to inactivity and/or the potential for this bug to no longer be an issue with the new Discovery Stream-powered New Tab experience.

Please help us triage by reopening if this issue still persists and should be addressed.

Thanks!

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: