Closed Bug 1142386 Opened 9 years ago Closed 9 years ago

Explore using randomized response / RAPPOR for sending pings for suggested site data

Categories

(Content Services Graveyard :: Tiles, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: Mardak, Assigned: mzhilyaev)

References

Details

(Whiteboard: .003)

http://en.wikipedia.org/wiki/Randomized_response
http://research.google.com/pubs/pub42852.html

Potentially we can protect user privacy by not allowing a server know for sure if a user has a site that triggered a suggested tile.

We need to get experience implementing the techniques to make sure they're working correctly in both generating the response from the client as well as measuring what we expect.

For example, we can pretend our raw records are:

100x { siteA: true, siteB: false, siteC: false }
200x { siteA: false, siteB: true, siteC: false }
300x { siteA: false, siteB: false, siteC: true }

The privacy protecting transformations will result in some distribution of:

{ siteA: false, siteB: false, siteC: false }
{ siteA: false, siteB: false, siteC: true }
{ siteA: false, siteB: true, siteC: false }
{ siteA: false, siteB: true, siteC: true }
{ siteA: true, siteB: false, siteC: false }
{ siteA: true, siteB: false, siteC: true }
{ siteA: true, siteB: true, siteC: false }
{ siteA: true, siteB: true, siteC: true }

I would guess we count up the trues for each site individually, do some math, and compare?
maksik has put together a doc:

https://docs.google.com/document/d/1vsLa1A4tqivyyquQsqEhkff6qK9Y4ymlwkyXQNZ7_yk/edit?usp=sharing

One of the findings is that the instantaneous response doesn't seem to help protect us for new tab pings where we're likely to get many of them.

maksik, you should try reaching out to Úlfar asking about use cases where hundreds of pings could be sent from the same memoized randomized response.
Simulation scripts can be found here: https://github.com/mzhilyaev/testrappor
Iteration: 39.2 - 23 Mar → 39.3 - 30 Mar
We discussed RAPPOR's protection (and the lack of it in related tiles context) with privacy team on Marh 24 15. The consensus appears to be:

- RAPPOR does not provide any better protection then plain Randomized Response
- RAPPOR is far noisier than Randomized Response, which will inhibit our ability to extract aggregates for low traffic targeted sites

Therefore, the privacy team suggests using "Randomized Response" on every report, and aggregate counts inside Onyx servers so that sensitive user records are not stored.

Monica, Christoph, could you, please, comment/verify that I expressed the consensus correctly.
Flags: needinfo?(mmc)
Flags: needinfo?(ckerschbaumer)
(In reply to maxim zhilyaev from comment #3)
> We discussed RAPPOR's protection (and the lack of it in related tiles
> context) with privacy team on Marh 24 15. The consensus appears to be:
> 
> - RAPPOR does not provide any better protection then plain Randomized
> Response

In this case, yep. There are some cases where RAPPOR is a better fit but this is not one of them.

> - RAPPOR is far noisier than Randomized Response, which will inhibit our
> ability to extract aggregates for low traffic targeted sites
> 
> Therefore, the privacy team suggests using "Randomized Response" on every
> report, and aggregate counts inside Onyx servers so that sensitive user
> records are not stored.
> 
> Monica, Christoph, could you, please, comment/verify that I expressed the
> consensus correctly.

Yep. I recommend re-iterating your data retention policies and technical means for performing the aggregation (on the fly, without storing identifiers, right?) as supporting privacy measures when you describe how RR works for related tiles. Thanks for taking the time to meet with us and summarizing your notes!
Flags: needinfo?(mmc)
I think you rather want the response from Richard - forwarding my needinfo to him!
Flags: needinfo?(ckerschbaumer) → needinfo?(rlb)
(In reply to [:mmc] Monica Chew (please use needinfo) from comment #4)

> Yep. I recommend re-iterating your data retention policies and technical
> means for performing the aggregation (on the fly, without storing
> identifiers, right?) as supporting privacy measures when you describe how RR
> works for related tiles. Thanks for taking the time to meet with us and
> summarizing your notes!

Proposed privacy protection comprises of:

1) Subjecting adGroup vector for related tile to "randomized response" on the client:

- Client matches adGroup sites associated with a tile shown/clicked with user top frecent sites.
- Client constructs a json object where each site in adGroup has value of 1 or 0.
- If adGroup site is found in frecency list its bit set to 1, otherwise 0
- Client runs "randomized response" on each value in json object
- Client attaches "randomized" json object to tile ping which is sent to onyx server
- Client re-runs "randomized response" on every ping

2) Aggregating "randomized responses" by Mozilla

We initially thought that aggregation will happen immediately inside Onyx servers.
However, such aggregation would require Onyx serves map IP addresses into country codes.
This functionality is not available yet, hence I will provide two scenarios: one for Onyx aggregation and other for cluster aggregation.

2.a) Cluster aggregation
- Onyx server receives a ping and passes to cluster unchanged
- "randomized" responses are aggregated via infernyx map-reduce job
- all user records (including "randomized" responses) are removed from cluster after 7 days

This scheme is missing pan-privacy protection allowed by Onyx aggregation.
However, it avoids an immediate significant change in Onyx architecture cased by the need of mapping IP to country code. If IP is checked in Onyx server, then IP could be removed all together, and never stored on cluster.  We see this privacy overhaul as much needed, but not immediately available for us due to resource constrains. An additional advantage on implementing cluster aggregation first, is our ability to debug "randomized response" extraction algorithm (or any other problems which may arise). 

We would like privacy team to approve Cluster aggregation as an intermediate solution, which will be replace with pan-privacy protection inside Onyx.  

2.b) Onyx aggregation   

- Onyx server receives the ping
- maps IP to country
- removes "randomized" json object from ping
- aggregates "randomized" json object into data structure keyed by locale,country,tile,site where the data portion contains count and sum fields
- if count for an aggregated site reaches certain threshold (like 100), onyx server ends locale,country,tile,site:{count,sum} record to a cluster, and zeros out data fields for the site.
mmc, after chatting with the engineers, our current architecture has onyx (server that accepts tiles pings) as a stateless machine designed for high throughput. It'll be significantly more scalable to do the randomized response site aggregation in the existing cluster aggregation where we process the raw data that's to be deleted in 7 days. As we do now, longer term storage of aggregate results is done without any user identifiers.

Just wanted confirmation that our new randomized response approach + existing data retention policies is acceptable.
Flags: needinfo?(mmc)
Iteration: 39.3 - 30 Mar → 40.1 - 13 Apr
(In reply to Ed Lee :Mardak from comment #7)
> mmc, after chatting with the engineers, our current architecture has onyx
> (server that accepts tiles pings) as a stateless machine designed for high
> throughput. It'll be significantly more scalable to do the randomized
> response site aggregation in the existing cluster aggregation where we
> process the raw data that's to be deleted in 7 days. As we do now, longer
> term storage of aggregate results is done without any user identifiers.
> 
> Just wanted confirmation that our new randomized response approach +
> existing data retention policies is acceptable.

Hi Ed,

Today's my last day and unfortunately I'll won't have time to give you feedback on the proposed changes before I leave. I did run into Marshall and ekr yesterday and we talked about this. One thing ekr rightly mentioned is that it's difficult to fully understand the full architecture of related tiles in bug comments, and it would be great to have a design document that fully laid out all of the moving parts and their privacy implications. The doc that Max started is a good start, but focuses on a narrow part of the problem and doesn't really describe things like the cluster aggregation and the retention policy in a way that's easy for new readers to understand.

He also mentioned that since your group is still figuring out what products to build, it is hard to track changes to data that is being collected and what data may be collected in the future. I think it is a good idea to have a list of these written down in the design doc as well. Needinfo'ing ekr so that he follows up with you soon.

Thanks,
Monica
Flags: needinfo?(mmc.bugzilla) → needinfo?(ekr)
Iteration: 40.1 - 13 Apr → 40.2 - 27 Apr
Depends on: 1156959
Iteration: 40.2 - 27 Apr → 40.3 - 11 May
1.  RAPPOR simulation code: https://github.com/mzhilyaev/testrappor
2.  Findings document: https://docs.google.com/document/d/1vsLa1A4tqivyyquQsqEhkff6qK9Y4ymlwkyXQNZ7_yk
3.  Theoretical results are verified empirically using simulator  

I ran numbers on  “Financial News Bucket” that shows % of error for all trigger sites:
https://docs.google.com/document/d/1vsLa1A4tqivyyquQsqEhkff6qK9Y4ymlwkyXQNZ7_yk/edit#heading=h.emiplfubl0ex

"click" error is too high - we can't compute accurate CTR for most of the triggers, hence no optimization.  Given that I used very liberal RAPPOR settings (f = 0.1 and no instantaneous randomization), I doubt that RAPPOR will help us.

The matter will worsen dramatically when we start using paths, because such triggers will be many and each one will generate smaller fraction of "clicks".

That brings a question of what's the next alternative to RAPPOR:
- do we send clicks in clear?
- do we establish (outside Mozilla) a proxy server that removes user identifiable data?
- do we simply not optimize?

Looking for privacy team review and input on the issue
Just to weigh in here after reviewing the data with Max, I believe there is a way we could leverage the RAPPOR data to inform Tiles optimization.

The noise levels do appear quite high, but I think there's still a lot of valuable information that give us enough information to make viable adjustments if we account for noise levels in the following ways:

- If we set a max "noise" threshold, it could be possible for us to determine a relative CTR from which we could make fairly informed optimizations. We would need to determine what that ideal threshold would be, but there should be enough data there to identify exceptionally poor performing URLs based on "best case scenario" numbers based upon the noise threshold.

- To account for privacy concerns regarding URLs with low traffic, it would be acceptable to set audience size thresholds as well. We simply state that if a URL does not have sufficient data, it will not be a candidate for optimization. From the data it appears that low traffic = higher noise levels, so this actually helps bolster the point made above.

The combination of Noise Thresholds, Relative CTR, and click volume thresholds would help us reduce the amount of "waste" and provide a significantly better user experience, by allowing us to stop showing certain content to users who are not responsive to the content being optimized.

This is most certainly not the ideal method of optimization, but it would certainly be better than no optimization at all.
Iteration: 40.3 - 11 May → ---
Flags: needinfo?(ekr)
RAPPOR is likely to be unneeded for meromorphic solution being tested by ekr
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Clearing out my needinfo queue - since this became a WONTFIX I am clearing the needinfo flag here.
Flags: needinfo?(rlb)
(In reply to maxim zhilyaev from comment #3)
> We discussed RAPPOR's protection (and the lack of it in related tiles
> context) with privacy team on Marh 24 15. The consensus appears to be:
> 
> - RAPPOR does not provide any better protection then plain Randomized
> Response

This has been proven as incorrect conjecture

> - RAPPOR is far noisier than Randomized Response, which will inhibit our
> ability to extract aggregates for low traffic targeted sites
> 
> Therefore, the privacy team suggests using "Randomized Response" on every
> report, and aggregate counts inside Onyx servers so that sensitive user
> records are not stored.
> 
> Monica, Christoph, could you, please, comment/verify that I expressed the
> consensus correctly.
You need to log in before you can comment on or make changes to this bug.