Open Bug 1192833 Opened 5 years ago Updated 4 years ago

(k-randomization reporting) [META] Investigate k-rounds randomization reporting

Categories

(Content Services Graveyard :: General, defect)

defect
Not set
Points:
2

Tracking

(Not tracked)

Iteration:
43.2 - Sep 7

People

(Reporter: mzhilyaev, Assigned: mzhilyaev)

References

(Blocks 4 open bugs)

Details

(Whiteboard: [story])

Investigate privacy and extraction precision aspects of user data collection scheme where users reports RR-randomized categorical vectors k times. The reports are then anonymized and mixed. This scheme allows for sqrt(k) reduction of noise. It also appears that its privacy guarantee is proportional to L*k/N, where N is the number of distinct users, and L is length of categorical vector.  

For large audiences this method has a promise of delivering accurate user aggregates without significant loss in privacy.
Blocks: 1189754
the resulting proves are here:

https://docs.google.com/document/d/1_kquDsJfN_MTgBjo4mOAmxKdmiHurFryJGKGBsIVT04/edit?pli=1#

noise is reduced by sqrt(k)
privacy guarantee is:  L*k / N, where 
L - length of bit vector reported
k - randomization rounds
N - distinct users
Blocks: 1193831
No longer blocks: 1189754
The results listed above could be incorrect. I might have discovered an attack that will invalidate privacy guarantee limits above.
Blocks: Sprint_CS_S1
Points: --- → 2
Blocks: Sprint_CS_S2
Iteration: --- → 43.2 - Sep 7
Flags: needinfo?(ekr)
Notes to reviewer:

The doc is mostly finished, the section on "Relaxing Privacy bounds" is still being worked on:
https://docs.google.com/document/d/1_kquDsJfN_MTgBjo4mOAmxKdmiHurFryJGKGBsIVT04/edit?pli=1#heading=h.3b41pubv2rqo

There seems to be a way to relax required collection size significantly without opening to high-cardinality attack. Stay tuned.
The research is complete and ready for review.
Final version https://docs.google.com/document/d/1_kquDsJfN_MTgBjo4mOAmxKdmiHurFryJGKGBsIVT04/edit?pli=1#heading=h.ivaimes7hdrg

Main finding:

L - length of bit vector reported
k - randomization rounds
N - distinct users
p,q - parameters of randomization

1. noise is reduced by sqrt(k)

2.  Breaking point 
If N*q^L >= 1, the observer can't determine the presence of an outlier - privacy guarantee is 0 regardless of k.

3.  Tipping point
If N >= (p/q)^L, the observer may suspect the collection contains a real outlier, but the observer can't distinguish synthetic output from a real user and the noise of the collection, hence attack against RR is impossible.  Privacy guarantee is 2k, but the observer can't conduct the attack against randomized responses from the distinct user. 

4.  When L is large, it's possible to decrease p to cover up for privacy and increase k to cover up for precision. Hence, enabling prescribed precision without privacy loss.

5.  Regardless of L. Mutually exclusive categorical bit vectors occupy only 2 bits of privacy. Categorical vectors that have no more than S bits, occupy 2*S bits of privacy.  

6.  It's possible to collect search queries and user histories by randomizing string representations and running a dictionary attack against randomized corpus of strings

7.  Privacy bounds can be further reduced case by case bases to satisfy infrastructure cost limitations.
(In reply to maxim zhilyaev from comment #4)
> 
> 3.  Tipping point
> If N >= (p/q)^L, the observer may suspect the collection contains a real
> outlier, but the observer can't distinguish synthetic output from a real
> user and the noise of the collection, hence attack against RR is impossible.
> Privacy guarantee is 2k, but the observer can't conduct the attack against
> randomized responses from the distinct user. 
> 

Correction:  Privacy guarantee is k*ln(2)
Blocks: 1202820
No longer blocks: 1202820
Blocks: 1202820
Blocks: 1205313
No longer blocks: 1205313
Flags: needinfo?(ekr)
You need to log in before you can comment on or make changes to this bug.