1379195 - Simulate how changing RAPPOR parameters affects the quality of the data

Reporter

Description

•

8 years ago

This bug is about running RAPPOR simulation locally with randomly generated data. The RAPPOR simulation plan is available at: https://docs.google.com/document/d/1fQN17tVLnW2DA71Uax7JMqON_tK3f6UFYCX0RKy1bco/edit#

Alessio Placitelli [:Dexter]

Reporter

Updated

•

8 years ago

Blocks: 1379180

Points: --- → 3

Priority: -- → P2

Alessio Placitelli [:Dexter]

Reporter

Updated

•

8 years ago

Whiteboard: [measurement:client]

Alessio Placitelli [:Dexter]

Reporter

Updated

•

8 years ago

Assignee: nobody → alexrs95

Alessio Placitelli [:Dexter]

Reporter

Updated

•

8 years ago

Status: NEW → ASSIGNED

Priority: P2 → P1

Alejandro Rodriguez Salamanca [:alejandro]

Assignee

Comment 1

•

8 years ago

This simulations have been performed in a controlled environment, with generated data. The main purpose of this is to confirm that the algorithm works as expected. In a real world situation, in which the ground truth is not known, the results can be different. - Values needed: The obvious answer is as much as possible. But a lower bound is needed. This lower bound depends on other parameters, such as the number of cohorts, the number of unique values. The higher the number of unique values, the higher the number of values reported should be. Depending on the purpose of the simulation, we can have two situations: - We want to see the most used values in the data: We don’t really care about the frequency of a value in the data, but if this value appears, and how this value is related with other values reported. In this case, as we can see in the second simulation, with 500K values reported and 100 unique values (0.0002%) can be enough. - We want a good approximation of the data: Here, the more values reported, the better. With 10M values reported and 100 unique values (0.00001%), we get a good approximation, as seen in the second simulation. - If each client only sends the value once, one-time RAPPOR can be used. The results with one-time RAPPOR are supposed to be more accurate, and the complexity of the algorithm is reduced, as the IRR step is skipped. - The optimum number of hash functions is 2. - The number of cohorts should be chosen carefully. When the number of cohorts is too small, then collisions are still quite likely, while when it is too large, then each individual cohort provides insufficient signal due to its small sample size (approximately N/m, where N is the number of reports). It’s better to have a high number of cohorts, because as shown in the simulation, when the number of cohorts is too small, the results obtained are useless, but when it increases, the results get better, and in some situations, it converges. We must not forget, however, that a large number of cohorts will result in no values detected. - The values of the probabilities p, q and f doesn’t really matter. What matter is the level of Differential Privacy. It should be chosen taking into account that if the level of DP is strong, the noise in the data could be too high, but if it’s too weak, the privacy of the user can be compromised. This bug will be closed once dzeber and mlopatka finish to audit the results.

Alejandro Rodriguez Salamanca [:alejandro]

Assignee

Comment 2

•

8 years ago

Given the previous conclusion, a new simulation was run. Initially, the idea was to run a simulation with 10M clients, 10K unique values, 1 value per client, 10K bloom filter size and 200 cohorts, with f = 0, p = 0.35, q = 0.65, and 2 hash functions. This configuration has an epsilon value of 2.476. Due to the large number of clients and size of the bloom filter, running this simulation is unfeasible. The Python code were then optimized to run the code using 8 processes[1], and the parameters were reduced to 10M clients, 100 cohorts, and 128 bits for the bloom filter. The new bottleneck is the analysis part, as the code in R is slow. Another change was introduced in the code. Now, the data needed to build the plot with the results is stored in a CSV file. This allows us to build more plots without running the analysis. The results for this simulations are: - Given the large number of unique values (10K), compared with the number of clients (10M), RAPPOR was able to detect the first 50 values. - The chosen probabilities (f=0, p=0.35, q=0.65) provide a level of differential privacy of 2.476 - 100 cohorts and 2 hash functions is enough to minimize the collisions using a bloom filter of size 128 bits. Dave, Martin, are this parameters the ones we are going to use for the study? [1]: https://github.com/Alexrs95/rappor/blob/master/tests/rappor_sim.py#L159

Flags: needinfo?(mlopatka)

Flags: needinfo?(dzeber)

mlopatka

Comment 3

•

8 years ago

I am happy with the choice of parameters supported by your simulations. f = 0.0 p = 0.35 q = 0.65 cohorts = 100 bloom_filter_size = 128 bits

Flags: needinfo?(mlopatka)

mlopatka

Comment 4

•

8 years ago

oh yeah, and 2 hash functions

Alessio Placitelli [:Dexter]

Reporter

Comment 5

•

8 years ago

Closing this as FIXED as there is a clear choice of parameters to use for the algorithm.

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Flags: needinfo?(dzeber)

Resolution: --- → FIXED

Bugzilla

Simulate how changing RAPPOR parameters affects the quality of the data

Categories

(Toolkit :: Telemetry, defect, P1)

Tracking

()

People

(Reporter: Dexter, Assigned: alejandro)

References

Details

(Whiteboard: [measurement:client])

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5