Closed Bug 1138022 Opened 9 years ago Closed 5 years ago

Add support for telemetry of sensitive data using RAPPOR

Categories

(Toolkit :: Telemetry, defect)

defect
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: gal, Unassigned)

References

()

Details

(Keywords: privacy-review-needed, Whiteboard: [measurement:client])

Attachments

(1 file, 6 obsolete files)

For those who are two lazy to read the whole paper, here is the basic idea of RAPPOR:

Sensitive data is hashed into a bit vector. Each client then does a coin toss for each bit of the vector and randomly decides to either report the real value of the bit, or to lie. This first coin toss permanently masks bits from the response. Any bit that this coin toss filtered out this particular client will always lie about. _Other_ clients may say the truth for the same bit. Over a large population the real result can be reconstructed, but each client has denialability. This partially true value is called b_ in the code.

When actually sending the data, the client does an additional coin toss for each bit and further randomizes its response. This second coin toss makes it very difficult to observe b_ for a single client. Depending on the configuration of the parameters 100,000 observations are needed or so to reliably predict b_. Keep in mind that b_ is a lie, though. So even after 100,000 observations the server only knows what the client decided to tell, not the actual value (b).

The patch contains a good number of tests to verify the various assumptions. The code is modeled pretty closely after Google's implementation.
Assignee: nobody → gal
Attachment #8570896 - Attachment is obsolete: true
Attachment #8570901 - Flags: review?(dougt)
Before you wrote this, I started taking a look at Google's implementation
in Chrome. It's embedded but it shouldn't be that hard to pull out.

What's your opinion on that approach versus this one?
Google's code comes with a bunch of dependencies and support infrastructure. Its not obvious where to draw the line and we would have to add a bunch of XPConnect goop since most the places where we want to create reports from are in JS. That goop alone is probably more code than the whole patch I wrote. I followed Google's code quite closely and used the paper mostly as reference. I intend to use the same analysis code they use.

That having said ... as usual I have no attachments to code. If you volunteer to port the Google code and make that work instead of this, be my guest :)
I cleaned up the code a bit and used a bit more modern constructs for readability and compactness.
Attachment #8570901 - Attachment is obsolete: true
Attachment #8570901 - Flags: review?(dougt)
Attachment #8570913 - Flags: review?(dougt)
I(In reply to Andreas Gal :gal from comment #5)
> Google's code comes with a bunch of dependencies and support infrastructure.
> Its not obvious where to draw the line and we would have to add a bunch of
> XPConnect goop since most the places where we want to create reports from
> are in JS. That goop alone is probably more code than the whole patch I
> wrote. I followed Google's code quite closely and used the paper mostly as
> reference. I intend to use the same analysis code they use.
> 
> That having said ... as usual I have no attachments to code. If you
> volunteer to port the Google code and make that work instead of this, be my
> guest :)

I got the Google code compiling last night. Basically, we will need to shim
out their crypto code and replace it with NSS and then will need to do a
bit of work on rappor_service. The advantage is that we could make it
a standalone thing. The disadvantage is that it's more work. I've reached
out to Google to see if they want to make it standalone. If they don't
reply affirmatively fast, I suggest we go with yours.
Barnes,

Andreas and I talked online and we'd like to replace the makePRNG function.
This looks like it's approximately (perhaps exactly but I haven't checked)
HKDF. Can you replace it with HKDF, perhaps rewriting in WebCrypto and
making an isolated HKDF module that we can pull in, if justified,
Flags: needinfo?(rlb)
Missing some background. Why does Mozilla need to know this?
Attachment #8570913 - Flags: review?(rlb)
Lets first discuss the "how", then the "why". The purpose of RAPPOR is to collect data in a way that is better for the user's privacy. With a daily telemetry ping it takes about 100,000 collections from a user with the current parameters to tell what the homepage is that user claims to have (again, that's B' and a lie, not even the real B). That's 273 years of observation to reveal the first layer of protection and there is a 2nd behind that.

The why is simple. We can improve our current privacy practices with this, and we can collect more information to make Firefox better without affecting the privacy of individual users.

For the specific homepage example, we have a significant problem with malicious add-ons doing all sorts of sketchy things to Firefox. I would like to add probes to understand that better (homepage, search, ad injection). With the "how" here the individual privacy risk to our users is essentially zero.
I understood the how, just didn't know why.  Thanks for the explanation.  I could see us using this RAPPOR system for systems where we want to report information, but don't because of user's privacy.  This will be a pretty awesome system to have.
Comment on attachment 8570913 [details] [diff] [review]
Patch to implement RAPPOR in Firefox and report the currently set homepage using the new infrastructure.

Review of attachment 8570913 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryEnvironment.jsm
@@ +485,5 @@
> +  },
> +
> +  /**
> +   * Return a report for the host portion of the given url string reduced to
> +   * eTLD+1 and obfuscated via RAPPOR.

If we're using RAPPOR, why do we need to trunk the urls?

@@ +488,5 @@
> +   * Return a report for the host portion of the given url string reduced to
> +   * eTLD+1 and obfuscated via RAPPOR.
> +   */
> +  _reportHost: function(name, url) {
> +    let host = null;

set this to "invalid" and you can ignore the null test later in the function.

@@ +490,5 @@
> +   */
> +  _reportHost: function(name, url) {
> +    let host = null;
> +    if (typeof url === "string") {
> +      if (url === "about:home") {

what about the other about: urls?

@@ +493,5 @@
> +    if (typeof url === "string") {
> +      if (url === "about:home") {
> +        host = url;
> +      } else {
> +        let uri = Services.io.newURI(url, null, null);

uri not used.

@@ +527,5 @@
>          channel: updateChannel,
>          enabled: Preferences.get(PREF_UPDATE_ENABLED, true),
>          autoDownload: Preferences.get(PREF_UPDATE_AUTODOWNLOAD, true),
>        },
> +      homepage: this._reportHost("homepage", Preferences.get(PREF_HOMEPAGE, null)),

Did you consider appending the idl so that anyone using the telemetry API can use RAPPOR?  Probably a good follow-on bug, right?
(In reply to Doug Turner (:dougt) from comment #12)
> Comment on attachment 8570913 [details] [diff] [review]
> Patch to implement RAPPOR in Firefox and report the currently set homepage
> using the new infrastructure.
> 
> Review of attachment 8570913 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/telemetry/TelemetryEnvironment.jsm
> @@ +485,5 @@
> > +  },
> > +
> > +  /**
> > +   * Return a report for the host portion of the given url string reduced to
> > +   * eTLD+1 and obfuscated via RAPPOR.
> 
> If we're using RAPPOR, why do we need to trunk the urls?

In RAPPOR analyzability and privacy go hand in hand. A lot of users have to report back the same url for us to be able to see it in the noise. If users set something like homepage.com/?userid=gal then we won't be able to see those urls. Not enough users report them back. In this particular case I don't think it matters much but in general I think ETLD+1 or ETLD+2 makes sense.

> 
> @@ +488,5 @@
> > +   * Return a report for the host portion of the given url string reduced to
> > +   * eTLD+1 and obfuscated via RAPPOR.
> > +   */
> > +  _reportHost: function(name, url) {
> > +    let host = null;
> 
> set this to "invalid" and you can ignore the null test later in the function.

Done.

> 
> @@ +490,5 @@
> > +   */
> > +  _reportHost: function(name, url) {
> > +    let host = null;
> > +    if (typeof url === "string") {
> > +      if (url === "about:home") {
> 
> what about the other about: urls?

They would end up being "invalid". If someone has about:config set as their home, I don't think we care. If invalid comes back as > 0% we can dig into things deeper. Anything rare we can't see anyway. The current settings will go into the noise around 0.1% of our users base or that ballpark. I doubt 0.1% use about:config ...

> 
> @@ +493,5 @@
> > +    if (typeof url === "string") {
> > +      if (url === "about:home") {
> > +        host = url;
> > +      } else {
> > +        let uri = Services.io.newURI(url, null, null);
> 
> uri not used.

Ugh thats a bug. I will see why the test didn't catch that.

> 
> @@ +527,5 @@
> >          channel: updateChannel,
> >          enabled: Preferences.get(PREF_UPDATE_ENABLED, true),
> >          autoDownload: Preferences.get(PREF_UPDATE_AUTODOWNLOAD, true),
> >        },
> > +      homepage: this._reportHost("homepage", Preferences.get(PREF_HOMEPAGE, null)),
> 
> Did you consider appending the idl so that anyone using the telemetry API
> can use RAPPOR?  Probably a good follow-on bug, right?

Yeah I wanted to push this into the field and once it reports back and we have the analysis part in place we should expand this.
Fixed the review issues and added a test for the bug in ETLD+1 stripping.
Attachment #8570913 - Attachment is obsolete: true
Attachment #8570913 - Flags: review?(rlb)
Attachment #8570913 - Flags: review?(dougt)
Fix a couple issues in the comments.
Attachment #8570992 - Attachment is obsolete: true
Attachment #8570993 - Flags: review?(rlb)
Attachment #8570993 - Flags: review?(dougt)
(In reply to Andreas Gal :gal from comment #13)

> > Did you consider appending the idl so that anyone using the telemetry API
> > can use RAPPOR?  Probably a good follow-on bug, right?
> 
> Yeah I wanted to push this into the field and once it reports back and we
> have the analysis part in place we should expand this.

Would probably need to ensure that the unified FHR/Telemetry pipeline's UUIDs don't get used for this.
(In reply to Andreas Gal :gal from comment #13)
> In RAPPOR analyzability and privacy go hand in hand. A lot of users have to
> report back the same url for us to be able to see it in the noise. If users
> set something like homepage.com/?userid=gal then we won't be able to see
> those urls. Not enough users report them back. In this particular case I
> don't think it matters much but in general I think ETLD+1 or ETLD+2 makes
> sense.

No one uses homepage.com/?userid=gal or equivalent, since the LiveJournal debacle ~10 years ago. Same origin for all users, bad times.

Google and other bigs do use foo.google.com vs. bar.google.com and you probably want to see foo and bar, if one is causing hot problems that can get through RAPPOR. Why truncate the FQDN?

/be
Pretty easy to not limit to ETLD+1, I have no objections. I will change the patch.
> Would probably need to ensure that the unified FHR/Telemetry pipeline's
> UUIDs don't get used for this.

Why not? From my understanding of the paper knowing the unique identity of the sender doesn't change the privacy guarantees of RAPPOR.
Send back full host name, not just ETLD+1.
Attachment #8570993 - Attachment is obsolete: true
Attachment #8570993 - Flags: review?(rlb)
Attachment #8570993 - Flags: review?(dougt)
Attachment #8571008 - Flags: review?(rlb)
Flags: needinfo?(rlb)
Attachment #8571008 - Flags: review?(dougt)
(In reply to Andreas Gal :gal from comment #19)
> > Would probably need to ensure that the unified FHR/Telemetry pipeline's
> > UUIDs don't get used for this.
> 
> Why not? From my understanding of the paper knowing the unique identity of
> the sender doesn't change the privacy guarantees of RAPPOR.

I'll read it again. My impression was that the privacy guarantee did not include the scenario of repeated payloads using the same UUID. We'll double-check. Nevertheless, even if it is an issue it's surmountable by various methods on either the client or server.
(In reply to John Jensen from comment #21)
> (In reply to Andreas Gal :gal from comment #19)
> > > Would probably need to ensure that the unified FHR/Telemetry pipeline's
> > > UUIDs don't get used for this.
> > 
> > Why not? From my understanding of the paper knowing the unique identity of
> > the sender doesn't change the privacy guarantees of RAPPOR.
> 
> I'll read it again. My impression was that the privacy guarantee did not
> include the scenario of repeated payloads using the same UUID. 

This is one of the key privacy properties of RAPPOR (they call it
longitudinal attacks). See Section 1.3.
Ok, so my current understanding of FHR and telemetry is that FHR is opt-out, but telemetry is opt-in. Both will share the same code on the client and server side.

As long the above assumption is correct, I will move any RAPPOR reported data from telemetry to metrics (post landing, I want this landed, enabled and tested first). RAPPOR offers such a high degree of protection for the user that opt-out seems appropriate. We will work with Jensen to ensure that we use the right parameters and that discovering even just a few partial bits of information (remember, RAPPOR never discloses the truth, it always lies even over time) would take years of consistent observation of a client.
Hello,
A few things I'd like to point wrt to the RAPPOR paper [1] and the setup we have at Mozilla.

1. When a telemetry or FHR ping is sent back, the JSON payload contains a profile specific UUID. Type about:healthreport, click on Raw and see 'clientID'.

2. Step (3) in the RAPPOR paper (Rp for short) is designed to send the same result back but encoded differently (by randomly bit flipping). This prevents the response from becoming a fingerprint. See page 4, second column, second paragraph of Rp.

3. Because of (1), step (3) doesn't provide us anything beyond step(2) of Rp. We can drop step 3 in the implementation.

4. Rp doesn't return the actual values but only the  encoded bit vectors. To arrive at the original values (in the distribution, you cant infer the profiles original distribution), we would have to apply the bloom filters to "candidate examples" e.g. top 1000 Alex websites for the about:home and then check for presence.

5. The counts (e.g. of different values of about:home ) are estimated via regression. Firstly, the encoded candidate examples are the columns (covariates). The authors model the number of times a bit is set against the set of covariates. A) they use a statistical tool (LASSO) to reduce the number of columns and use B) Least Squares Regression to estimate the counts. Keep in mind, that the estimation will be dependent on all the assumptions that make up these statistical tools.

6. Also note, in section 2.1 of Rp, for simple categorical variables e.g. is about:home mozilla.org, mozilla.com, other, one can use the simple Randomized Response Method (introduced at the beginning of the paper). For numerical variables, the methods described in the paper (adding noise) suffice

7. Another suggestion, is that we try this on data already present in FHR e.g. binary extension names and compare the estimated counts via RAPPOR vs true counts from the FHR data.

8. I will provide parameters p,q, and f (though given (3) above, p should be 0 and q be 1) but it is driven by epsilon. Choices for epsilon are in Table 1 of [2]. It appears good choices of epsilon are 0.05 - 1.

I'll update the bug with 'f' tomorrow (and p and q if we insist on step 3).


[1] http://arxiv.org/pdf/1407.6981v2.pdf
[2] http://arxiv.org/pdf/1402.3329v1.pdf
For implementation reasons I only support values for f, p and q of 0.25, 0.5, and 0.75. If you can work with those, that would be great.

I am trying to think through the implications of the fingerprint. Ekr, mmc, could you comment? I see two ways to deal with this. Indeed drop the 2nd layer of randomization, or invent a new ping that doesn't have a fingerprint and we use it for population surveys only.
(In reply to "Saptarshi Guha[:joy]" from comment #25)
> Hello,
> A few things I'd like to point wrt to the RAPPOR paper [1] and the setup we
> have at Mozilla.
> 
> 1. When a telemetry or FHR ping is sent back, the JSON payload contains a
> profile specific UUID. Type about:healthreport, click on Raw and see
> 'clientID'.
> 
> 2. Step (3) in the RAPPOR paper (Rp for short) is designed to send the same
> result back but encoded differently (by randomly bit flipping). This
> prevents the response from becoming a fingerprint. See page 4, second
> column, second paragraph of Rp.
> 
> 3. Because of (1), step (3) doesn't provide us anything beyond step(2) of
> Rp. We can drop step 3 in the implementation.
> 
> 4. Rp doesn't return the actual values but only the  encoded bit vectors. To
> arrive at the original values (in the distribution, you cant infer the
> profiles original distribution), we would have to apply the bloom filters to
> "candidate examples" e.g. top 1000 Alex websites for the about:home and then
> check for presence.
> 
> 5. The counts (e.g. of different values of about:home ) are estimated via
> regression. Firstly, the encoded candidate examples are the columns
> (covariates). The authors model the number of times a bit is set against the
> set of covariates. A) they use a statistical tool (LASSO) to reduce the
> number of columns and use B) Least Squares Regression to estimate the
> counts. Keep in mind, that the estimation will be dependent on all the
> assumptions that make up these statistical tools.
> 
> 6. Also note, in section 2.1 of Rp, for simple categorical variables e.g. is
> about:home mozilla.org, mozilla.com, other, one can use the simple
> Randomized Response Method (introduced at the beginning of the paper). For
> numerical variables, the methods described in the paper (adding noise)
> suffice
> 
> 7. Another suggestion, is that we try this on data already present in FHR
> e.g. binary extension names and compare the estimated counts via RAPPOR vs
> true counts from the FHR data.
> 
> 8. I will provide parameters p,q, and f (though given (3) above, p should be
> 0 and q be 1) but it is driven by epsilon. Choices for epsilon are in Table
> 1 of [2]. It appears good choices of epsilon are 0.05 - 1.
> 
> I'll update the bug with 'f' tomorrow (and p and q if we insist on step 3).
> 
> 
> [1] http://arxiv.org/pdf/1407.6981v2.pdf
> [2] http://arxiv.org/pdf/1402.3329v1.pdf

Actually I think (3) is incorrect. The additional noise makes it impossible to observe B' directly without a long running observation of a few years. So even with a fingerprint nearby that property is preserved.
If i understand correctly, 
a.  A's response has a bloom filter applied and then this  undergoes a randomized response treatment (steps 1 and 2)
b.  step 2 is enough to hide A's true response (it is a randomized response)
c. the output from step 2 is memoized
d.  if A's response is queried repeatedly, and without step (3) [in RAPPOR], then the same unique bit string is sent back. This would fingerprint the data packet being sent back from A.
e. Step 3 in RAPPOR serves to remove this constant fingerprint by always changing B'. It does hide the value of B' but B' never revealed the original B. So A's original information is hidden with just step (2)

So, yes, step (3) hides B'. But why do we need to hide B' ?
(In reply to "Saptarshi Guha[:joy]" from comment #25)
> 1. When a telemetry or FHR ping is sent back, the JSON payload contains a
> profile specific UUID. Type about:healthreport, click on Raw and see
> 'clientID'.

Per Bug 1120981, not all submissions will contain clientID, so it would still be useful to prevent fingerprinting for those (future?) document types that do not include an ID.
Group: mozilla-employee-confidential
Group: mozilla-employee-confidential
Group: mozilla-employee-confidential
bah. private restricts the group that can see it to a very tiny group, but it does include non-moz employees (trusted, but not sure how restrict we need to be with numbers).  mcote, any thoughts?
Flags: needinfo?(mcote)
dougt: you're talking about restricting the visibility of this bug?  To whom exactly?  We could always create a new group if need be.
Flags: needinfo?(mcote)
basically, i only want employees seeing comment #30
Flags: needinfo?(mcote)
glob?
Flags: needinfo?(mcote) → needinfo?(glob)
(In reply to Doug Turner (:dougt) from comment #36)
> basically, i only want employees seeing comment #30

unfortunately comment visibility is boolean - public, or visible to the security-group only (which, as you pointed out, includes non-mozilla employees).

perhaps the best option is to edit the comment to remove the sensitive numbers, or delete the comment completely.  bmo admins have rights to do both, let me know if this is what you want to do.
Flags: needinfo?(glob)
Comment on attachment 8571008 [details] [diff] [review]
Patch to implement RAPPOR in Firefox and report the currently set homepage using the new infrastructure.

Review of attachment 8571008 [details] [diff] [review]:
-----------------------------------------------------------------

sgtm.  someone needs to review the crypto bits.  rlb, can you start on this or reassign?
Attachment #8571008 - Flags: review?(dougt) → review+
maksik did some analysis of RAPPOR for use in new tab tile pings in bug 1142386 and concluded that with so many new tab impressions, the additional instantaneous response did not provide much benefit over randomized response. So in bug 1136461, he has a patch to do randomized response on our existing bitvector of whether a site triggered a suggested tile (i.e., no need for bloom filter).
Looking more closely at the patch in this bug, we would probably want to reuse the clever PRNG bf_prr by converting our sites data into a bitvector (skipping the bf_signal and bf_irr steps of create_report). Should we cherrypick bf_prr and related functions into bug 1136461 for now? Or would it be reasonable to call TelemetryRappor.internal.bf_prr when this lands?
Comment on attachment 8571008 [details] [diff] [review]
Patch to implement RAPPOR in Firefox and report the currently set homepage using the new infrastructure.

Review of attachment 8571008 [details] [diff] [review]:
-----------------------------------------------------------------

Not reviewing right now, because I think this might be OBE given some other developments.  r- though, because I do want to review if we go down this path.
Attachment #8571008 - Flags: review?(rlb) → review-
Assignee: gal → automation
Not accessible to reporter
Assignee: automation → nobody
OS: Mac OS X → All
Hardware: x86 → All
Whiteboard: [measurement:client]
No need for this to be private.
Group: mozilla-employee-confidential
We have some new projects upcoming to handle sensitive data  now (e.g. Prio)
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: