1136461 - Include the site triggering a suggested tile in the tiles data ping

Reporter

Description

•

10 years ago

From the decision in bug 1132534 where it's our responsibility to choose/augment the list of sites that trigger a suggested tile, we'll need data about how the suggested tile is performing for each site to improve the experience for users. We want to be able to add more sites to try to increase the audience coverage and at the same time improve overall user privacy. We also want to make sure those added sites are relevant because otherwise users will get annoyed from seeing bad recommendations. One straightforward approach is to send the site that triggered the suggested tile. This data would be treated in the same way we've been treating tiles user data: remove personal data and compute metrics then delete the raw data within 7 days. None of the data is associated back to any one user as we're interested in aggregates. The performance (most likely click-through-rates) of each triggering site will be used internally to optimize, and Clients will get overall performance across all triggering sites (as they don't have control over exactly which sites we match against). Some slight tweaks on this functionality: - don't send triggering site data if DNT is enabled ? - only send triggering site data if telemetry is enabled ? - have a newtab customize menu option for users to not-send/send ? I'm not sure if this is complicating things too much (premature-privacy-policy-optimization?), but if we do have a send/not-send control in the newtab customize menu, should it be just use an existing pref... Telemetry? So by default nightly/aurora/beta [x] Improve new tab and Firefox experience by sharing usage data with Mozilla; and default release [ ] Improve new tab ....

maxim zhilyaev

Comment 1

•

10 years ago

Since we do not track users, we do not know which newtab impression reported to backend came from which user. In this case a procedure of adding Laplacian noise to all sites in the targeted list described in http://www.cs.columbia.edu/igert/courses/E6898/dwork_cacm.pdf will mask real site triggering suggested tile to the prescribed differential privacy guarantee. Following the paper logic: - there are N sites in a tile targeted list - Suppose that a tile was shown because of site A - Browser creates a vector of N 0s and sets 1 for site A (call it a "triggering" vector) - Browser adds Laplacian noise to each vector cell and sends back to Mozilla - Laplacian noise is described here: http://en.wikipedia.org/wiki/Laplace_distribution What are the parameters of Laplacian noise? Suppose a site B caused a tile to show, then its "triggering" vector will 0 for A, 1 for B and 0 for rest. Hence, the maximum difference between any two "triggering" vectors is 2. According to the paper, if newtab impressions can't be traced back to same user (and sites in the list are not correlated), then for a prescribed differential privacy guarantee G, the noise should be: Lap(2/G) = G/4 * e^(-G*z/2) Here z is a random real variable uniformly distributed in [-10,10] I have no idea what G should be, but RAPPOR (http://arxiv.org/pdf/1407.6981v2.pdf) uses G=ln(3), so we use same (although i am totally puzzled by how G should be chosen) Adding up values for site A, will give us a distribution with expected value equal to number of impressions suggested by site A, with standard deviation (sigma) == sqrt(8)/G = sqrt(8)/ln(3) = 2.6 Taking a close look at Lap(2/G) distribution shows that pretty much all of probability mass is contain within [-4:4] interval. Since numbers of "viewed" (or clicked) impressions per site reach thousands, an error of +/- 4 make for VERY precise estimations. Here is a realistic example - the technology sites collection arstechnica.com 5158 slashdot.org 3941 engadget.com 2749 gizmodo.com 1861 lifehacker.com 1779 wired.com 1582 numbers represent impression count for 25% of beta en-US/US population. there are 40M FX users in US and 400K US beta users. Hence we should expect 400x site users in US audience: arstechnica.com 2063200 slashdot.org 1576400 engadget.com 1099600 gizmodo.com 744400 lifehacker.com 711600 wired.com 632800 For a reasonable CTR of 0.04%, the expected clicks per site should roughly be: rstechnica.com 825.28 slashdot.org 630.56 engadget.com 439.84 gizmodo.com 297.76 lifehacker.com 284.64 wired.com 253.12 Which are still high enough to make an error range around 2% in the least case. Even if we are not measuring clicks precisely enough, since we optimize the ratio of "clicks" to "views", it will be the number of "views" that drive optimization not clicks. Suppose we know "views" exactly and "clicks" with some error, then: clicks/views = (real_clicks + Error) / views = real_clicks/views + Error/views But if real_clicks/views = 0.0004, and Error is 0.1, then Error/views = 0.00004, which is probably negligible. If CTR are different for various sites they should differ more then 0.004% to make an optimization decision.

maxim zhilyaev

Comment 2

•

10 years ago

(In reply to maxim zhilyaev from comment #1) > Adding up values for site A, will give us a distribution with expected value > equal to number of impressions suggested by site A, with standard deviation > (sigma) == sqrt(8)/G = sqrt(8)/ln(3) = 2.6 This statement and the number example that follows it are COMPLETELY wrong. The deviation of 2.6 does not refer to the sum of Laplace distributions at all. Please ignore everything after "Adding up values for site A". I am looking through literature to figure out how number queries works and how accurate they are for a given privacy guarantee. If anyone has good pointers, please share.

maxim zhilyaev

Comment 3

•

10 years ago

(In reply to maxim zhilyaev from comment #1) In fact the whole comment does not make sense (not just a part related to site "A" noisy summation). The "differential privacy" is about adding noise to already known query result, so that the noise covers up the difference between most dissimilar records participating in the query. What we have with site reporting is different: we want to obfuscate individual records so that we, ourselves, can't extract private information from any record, but still can reason about how many times a given site caused related tile to appear. Due to my horrid illiteracy in privacy, I am failing to see the direct relation between these two scenarios.

Ed Lee :Mardak

Reporter

Comment 4

•

10 years ago

Deferring this for now as it's a nice to have optimization of our bucket composition and the initial suggested tiles are just the affiliate fennec ones.

Blocks: 1140185
No longer blocks: 1120311

Whiteboard: .?

Ed Lee :Mardak

Reporter

Updated

•

10 years ago

Depends on: 1142386

Ed Lee :Mardak

Reporter

Comment 6

•

10 years ago

From bug 1149734: Per discussion in Bug# 1142386, it's sufficient to run "randomized response" on ad-group sites on each ping. The algorithm is as in this comment: //bugzilla.mozilla.org/show_bug.cgi?id=1142386#c6 - Client matches adGroup sites associated with a tile shown/clicked with user top frecent sites. - Client constructs a json object where each site in adGroup has value of 1 or 0. - If adGroup site is found in frecency list its bit set to 1, otherwise 0 - Client runs "randomized response" on each value in json object - Client attaches "randomized" json object to tile ping which is sent to onyx server - Client re-runs "randomized response" on every ping

Blocks: 1149727

maxim zhilyaev

Comment 7

•

10 years ago

Attached patch Randomized Response for targeted tiles V1 — Details — Splinter Review

Attachment #8588211 - Flags: review?(msamuel)

Marina Samuel [:emtwo]

Comment 8

•

10 years ago

Comment on attachment 8588211 [details] [diff] [review] Randomized Response for targeted tiles V1 Review of attachment 8588211 [details] [diff] [review]: ----------------------------------------------------------------- Looks good overall. I think you'll need to ask adw for a formal review ::: browser/modules/DirectoryLinksProvider.jsm @@ +409,5 @@ > data[action] = actionIndex; > } > > + // generate randomised list of targeted sites > + if (suggestedLink !== undefined) { Perhaps we can do 'if (link.targetedSite)' here then use 'link' directly instead of 'suggestedLink' below. Saves a couple of lines. @@ +410,5 @@ > } > > + // generate randomised list of targeted sites > + if (suggestedLink !== undefined) { > + // attach randomised response and auxiliary data to ping nit: randomised -> randomized @@ +682,5 @@ > // Choose the first site a user has visited as the target. In the future, > // this should be the site with the highest frecency. However, we currently > // store frecency by URL not by site. > + targetedSite: chosenTargetedSites.size == 0 ? null : > + chosenTargetedSites.entries().next().value[0], I believe you can do '[...chosenTargetedSites][0]' here which is a little more concise. @@ +708,5 @@ > + if (Math.random() <= RANDOMIZED_RESPONSE_FREQUENCY) { > + // do not lie - return same bit > + return bit; > + } > + // otherwise return 1 or 0 with equal chance I'm just wondering, so 50% of the time we definitely don't lie, but the other 50% of the time we may or may not lie? What if we end up not telling any lies? Do we not need to guarantee that some percentage is a lie? @@ +709,5 @@ > + // do not lie - return same bit > + return bit; > + } > + // otherwise return 1 or 0 with equal chance > + return (Math.random() < 0.5); Can we use RANDOMIZED_RESPONSE_FREQUENCY instead of 0.5 here? @@ +714,5 @@ > + }; > + > + let reportedSites = []; > + > + // walk targetedSites first, since thier value is true nit: thier -> their @@ +715,5 @@ > + > + let reportedSites = []; > + > + // walk targetedSites first, since thier value is true > + for (let index in allFrecentSites) { Since you're only using index to look up the site, you can do 'for (let site of allFrecentSites)' ::: browser/modules/test/xpcshell/test_DirectoryLinksProvider.js @@ +1106,5 @@ > + // start with single site set and two-site frecency list > + let targetedSet = new Set(["present.com"]); > + let frecencyList = ["present.com", "absent.com"]; > + let N = 1000; > + let f = 0.5; Could we have a more descriptive variable name for these two variables? And/or a comment on what it is

Attachment #8588211 - Flags: review?(msamuel) → feedback+

Ed Lee :Mardak

Reporter

Comment 9

•

10 years ago

(In reply to Marina Samuel [:emtwo] from comment #8) > > + if (Math.random() <= RANDOMIZED_RESPONSE_FREQUENCY) { > > + // do not lie - return same bit > > + return bit; > > + } > > + // otherwise return 1 or 0 with equal chance > the other 50% of the time we may or may not lie? I believe the thing is the server doesn't know whether it's a lie or not for any of the responses. So even if we happen to send the same bit that we would have in telling the truth, the server doesn't learn more. > @@ +709,5 @@ > > + // otherwise return 1 or 0 with equal chance > > + return (Math.random() < 0.5); > Can we use RANDOMIZED_RESPONSE_FREQUENCY instead of 0.5 here? It just happens that tweakable RESPONSE_FREQUENCY is 0.5, and that's the probability for equal chance 0 or 1. The lie should be equally likely to get the desired privacy aspects.

Eric Rescorla (:ekr)

Comment 10

•

10 years ago

I have some questions about this code. 1. If I am reading it correctly, you are re-randomizing with each ping, right? Doesn't that mean that repeated queries from the same browser quickly reveal the true answer for any site that's either frequently or infrequently visited? I.e., if the underlying value is always 1 or always 0, then you are going to quickly converge on the true answer. 2. You are using Math.random but this seems to demand a cryptographic PRNG, right?

Eric Rescorla (:ekr)

Comment 11

•

10 years ago

Adding rbarnes because we are going to want security review of this.

Ed Lee :Mardak

Reporter

Comment 12

•

10 years ago

(In reply to Eric Rescorla (:ekr) from comment #10) > 1. If I am reading it correctly, you are re-randomizing with each ping, Correct. In the discussion and meeting with mmc/rbarnes and team about bug 1142386, the privacy aspects of the memoization step and instantaneous response of RAPPOR are relatively quickly defeated with the quantity of new tab pings we would send. Our existing data policies around deleting raw data and not tracking users would be needed for the more complicated RAPPOR algorithm as also needed for randomized response. rbarnes is indeed working with maksik on confirming that and the general privacy aspects of randomized response for new tab with our data policy. > 2. You are using Math.random but this seems to demand a cryptographic PRNG, Maybe? We're not really encrypting data, but it should be simple enough to just call a better random(). Probably something with nsICryptoHMAC and an appropriate seed?

maxim zhilyaev

Comment 13

•

10 years ago

(In reply to Eric Rescorla (:ekr) from comment #10) > I have some questions about this code. > > 1. If I am reading it correctly, you are re-randomizing with each ping, > right? This is correct. The rational for doing it is described here: https://docs.google.com/a/mozilla.com/document/d/1vsLa1A4tqivyyquQsqEhkff6qK9Y4ymlwkyXQNZ7_yk As Ed mentioned, RAPPOR's instantaneous randomization is not protective under many reports sent from same browser, and RAPPOR memorization leads to switching attack described in the same paper. Hence, we did not find RAPPOR to be any more protective then plain vanilla RR, but RR is less noisy. Hence, the current implementation. Nevertheless, do please suggest improvements/modifications, or critiques to the document logic. > 2. You are using Math.random but this seems to demand a cryptographic PRNG, > right? What's the exploit that we trying to protect against with PRNG?

Eric Rescorla (:ekr)

Comment 14

•

10 years ago

(In reply to maxim zhilyaev from comment #13) > (In reply to Eric Rescorla (:ekr) from comment #10) > > I have some questions about this code. > > > > 1. If I am reading it correctly, you are re-randomizing with each ping, > > right? > > This is correct. The rational for doing it is described here: > https://docs.google.com/a/mozilla.com/document/d/ > 1vsLa1A4tqivyyquQsqEhkff6qK9Y4ymlwkyXQNZ7_yk > > As Ed mentioned, RAPPOR's instantaneous randomization is not protective > under many reports sent from same browser, and RAPPOR memorization leads to > switching attack described in the same paper. Hence, we did not find RAPPOR > to be any more protective then plain vanilla RR, but RR is less noisy. > Hence, the current implementation. Nevertheless, do please suggest > improvements/modifications, or critiques to the document logic. Well, I think the answer here is *only* to use permanent randomization, not to to only use instantaneous randomization. Because only providing permanent randomization allows fingerprinting but only providing instantaneous randomization allows extraction of the true value. > > 2. You are using Math.random but this seems to demand a cryptographic PRNG, > > right? > > What's the exploit that we trying to protect against with PRNG? Well, in this implementation, if you know the random sequence then you are able to discard the bogus values and extract the true value. Accordingly it's important for the PRNG to be unpredictable, which is normally a property associated with cryptographic PRNGs. If you're looking for a concrete example, consider the case in which you have a pile of sites which have a broad distribution of frequencies. So, for instance, sites 1..N have overall probability .001 and N..M have .01. With high probability any 1 value for 1..N is therefore a lie and you can use this to learn information about the PRNG state and then make predictions about the PRNG sequence for N..M.

Eric Rescorla (:ekr)

Comment 15

•

10 years ago

To be clear, I think the *actual* right answer is to use both kinds of randomization: the permanent to provide hiding of the true value and the instantaneous to let you measure stuff over time. But if you're only going to do one it should be permanent for the privacy reasons mentioned in the previous comment.

maxim zhilyaev

Comment 16

•

10 years ago

(In reply to Eric Rescorla (:ekr) from comment #15) > Well, I think the answer here is *only* to use permanent randomization Could you describe what "permanent randomization" exactly is. Specifically for the scenario of changing history: 1. There is bucket that contains two sites A,B 2. Initially a user had site A in his history, so the tile was triggered by A 3. Eventually user history lost site A from its frecency, so the tile was not played at all 4. Then site B appeared in user history, and it triggered the tile show Over the period of time, user client needs to report A being a trigger, B being a trigger, and not report at all. All user reports are available to an attacker. So, how exactly "permanent randomization" is used in scenario? What's the procedure to assign bits either state of user history?

Eric Rescorla (:ekr)

Comment 17

•

10 years ago

You will also need to lie about which tiles were shown.

maxim zhilyaev

Comment 18

•

10 years ago

rbarnes describes RAPPOR-like "permanent randomization" here: https://docs.google.com/a/mozilla.com/document/d/1CiYoGKG8agm9l7gOUUtPYoKS4WfjiZ065ThroUj4cbM/edit#heading=h.mo5dixudk1ik Randomization is done by creating a "static", randomized bit vector for each site in ad-group, then for a trigger-site taking its "static" bit vector, subjecting it to "instantaneous randomization" and sending to server. This seems to protect against bit switching attack of original "basic" RAPPOR, where the state of each site in ad-group was encoded in a separate bit of the "static" bit vector to be reported.

Ed Lee :Mardak

Reporter

Comment 19

•

10 years ago

Moving site reporting to suggested tiles 1.1 for now.

Blocks: 1145418
No longer blocks: 1140185

Kevin Ghim

Updated

•

10 years ago

No longer blocks: 1145418

Jessilyn Davis

Comment 20

•

5 years ago

Hello!

This bug has been closed due to inactivity and/or the potential for this bug to no longer be an issue with the new Discovery Stream-powered New Tab experience.

Please help us triage by reopening if this issue still persists and should be addressed.

Thanks!

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → WONTFIX