Closed Bug 718066 Opened 13 years ago Closed 11 years ago

Initial landing of Firefox Health Report

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect)

defect
Not set
normal

Tracking

(firefox19 disabled, firefox20 fixed, firefox21 fixed)

RESOLVED FIXED
Firefox 20
Tracking Status
firefox19 --- disabled
firefox20 --- fixed
firefox21 --- fixed

People

(Reporter: dre, Assigned: gps)

References

()

Details

(Keywords: meta, user-doc-needed)

Attachments

(2 files, 3 obsolete files)

The anonymous metrics collected through this feature are focused on the day to day usage of the Firefox and allow Mozilla to improve the program.

For example, understanding how crash risk, startup times, and UI responsiveness changes over the life of a profile is critical to optimizing the browser.

Separating the metrics data ping into a discrete feature will allow us to simplify and more easily maintain the critical security code for Firefox updates, and plug-in or add-on blocklisting.

Submission happens once per day when the product is running.  Metrics that have not yet been submitted before the application exits are stored locally to be submitted when the application is restarted.

In terms of privacy and data safety, the proposed MDP implementation includes a uuid for longitudinal analysis. There is no personally identifiable information or demographics included in the user data being collected by the ping. Users have the ability to view the information being collected and opt-out of MDP.  Notices will be updated to explain how MDP works, the data being collected, the benefits enabled through MDP, and how to opt-out.
Depends on: 718067
Depends on: 707970
Depends on: 719484
> the proposed MDP implementation includes a uuid for longitudinal analysis.
> There is no personally identifiable information

Sorry, but that's a contradiction. If the user's browser gets unique ID, that *is* PII 

"Personally Identifiable Information (PII), as used in information security, is information that can be used to uniquely identify, contact, or locate a single person or can be used with other sources to uniquely identify a single individual."

An UUID for a user or user device is always a PII, and therefore highly problematic in sense of privacy. This must be dropped.
It is not necessary for an ID to be linked to a real name to be a privacy problem. Please consider the difference between anonymous and pseudonymous. If you have an ID, it's a pseudonym and never anonymous.
Please also see the German privacy law, as well as the US term "potentially personally identifiable information".

Note that the user cannot verify whether you or anybody getting the data actually does it, only that it is theoretically possible, and it *must not be possible*.

Having a UUID would allow, for example, to track all my dynamic IP addresses over time, and allow to build a profile, when combined with access logs. If I have a notebook or mobile browser, it would even allow to track the places where I go based on IP geolocation / whois data.

Instead of building the history on the server, the client should build the history and only submit results. E.g. if you need to know whether things improved, you can let the client keep some old data and submit "12 crashes last week. One week before: 12% more. One year before: 50% less." It should not include exact history numbers either, because they, too, would allow to puzzle the numbers together and allow to again build a history of IP addresses for a given user.
>"aving a UUID would allow, for example, to track all my dynamic IP addresses over >time, and allow to build a profile, when combined with access logs. If I have a >notebook or mobile browser, it would even allow to track the places where I go based >on IP geolocation"

But no IP address or geoloc is being kept. Also, I guess to personally identify someone one needs ip address or some form of demographics neither of which is being collected or stored.
> But no IP address or geoloc is being kept.

That is immaterial. 1) I have no way to verify that claim and 2) somebody intercepting that transfer could store it and 3) you might have a court order that demands you to give that information out.

As I said: What you do with the information does not matter much. It must not be there in the first place.
1) I think the server side source code would be public
2) the blocklist ping goes out for everybody, that can be intercepted too. and it is turned on by default and can be turned of.
3)  applies for (2)
The MDP can be turned  of just as the Blocklist ping can be.
> > 1) I have no way to verify that claim
> 1) I think the server side source code would be public

Again, I cannot verify that you indeed run that exact code and that you don't have some other component before it that's logging.

> 2) the blocklist ping goes out for everybody

Does it have a unique ID per browser? I don't think so, and if it does, I would go and file a bug.

---

FWIW, Google Chrome used to have unique ID, but then they got major pushback for it, esp. in Germany, and they *removed* it because people were so upset about it. Unique IDs are red flags even for the press, and Germany is Firefox' biggest userbases and highly privacy sensitive, so this is likely to give serious backslash and to end up being a PR nightmare. Apart from being just wrong and possibly even violating German law.

It's sad to see that stuff like that is still being proposed here at Mozilla, that privacy still isn't part of the Mozilla DNA. FWIW, I oppose the whole idea here. But unique IDs are a complete no-go. Even Google understood that. We're right at history repeating.
(In reply to Dão Gottwald [:dao] from bug 718067 comment #20)
> (In reply to Dão Gottwald [:dao] from bug 718067  comment #17)
> > (In reply to Saptarshi Guha from bug 718067  comment #14)
> > > But even with user consent, is it reasonable to think that the user has
> > > inspected the ping to look for PIIs?
> > 
> > Probably not... So yes, making sure the decision is an informed one is
> > another problem, but no good reason for doing it without consent.
> > 
> > > I think the key thing here is that they can *easily* turn the feature off.
> > 
> > I don't think privacy works like this on a large scale. We can't expect
> > everyone who happens to be identifiable to make a self-motivated decision.
> > People trust Mozilla not to leak data like that by default.
> 
> ping?

We cannot reasonably stop every way that PII already exists or could possibly be introduced into the browser by a power user or developer.  We can ensure that it is easy for them to discover if there is undesired information available to us and correct it, and we can ensure that it is not something that would happen for the majority of our users, and most importantly, we can ensure that we are not using that data in any way that harms the user's privacy.  That means not leaking it, not sharing it, not using any of the data we collect to identify or track individual users.  That is what we have worked with the privacy and security teams to commit to this and to communicate it through our policies.
> we can ensure that we are not using that data

Sorry, but that stance "We'll get it, but we won't store or use it, we promise!" doesn't work. Too often, it's a lie (Facebook was proven lying recently, ditto Google and others), or intentionally goes wrong, or you get a court order and can't prevent it.

This is an old problem: Once data exists, it *will* be used eventually. That's why German law *mandates* that problematic data is not even gathered. You can't prevent the IP address, but there are alternatives without unique ID, and you *MUST* use those alternatives. It's not your decision, that "design decision" was made by law.
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #7)
> (In reply to Dão Gottwald [:dao] from bug 718067 comment #20)
> > (In reply to Dão Gottwald [:dao] from bug 718067  comment #17)
> > > (In reply to Saptarshi Guha from bug 718067  comment #14)
> > > > But even with user consent, is it reasonable to think that the user has
> > > > inspected the ping to look for PIIs?
> > > 
> > > Probably not... So yes, making sure the decision is an informed one is
> > > another problem, but no good reason for doing it without consent.
> > > 
> > > > I think the key thing here is that they can *easily* turn the feature off.
> > > 
> > > I don't think privacy works like this on a large scale. We can't expect
> > > everyone who happens to be identifiable to make a self-motivated decision.
> > > People trust Mozilla not to leak data like that by default.
> > 
> > ping?
> 
> We cannot reasonably stop every way that PII already exists or could
> possibly be introduced into the browser by a power user or developer.

You could just not send such data.

This is not at all about developers. I created those add-ons as a power user only for my personal use. It's not okay to leak power users' identities. Also note that things like the add-on builder and personas significantly lower the bar for creating custom add-ons.

> We
> can ensure that it is easy for them to discover if there is undesired
> information available to us and correct it,

How exactly would you do that? Is this already part of the current plan?

> and we can ensure that it is not
> something that would happen for the majority of our users,

This doesn't help the affected minority.

> and most
> importantly, we can ensure that we are not using that data in any way that
> harms the user's privacy.  That means not leaking it, not sharing it, not
> using any of the data we collect to identify or track individual users.

We can say that this is our intent, but we cannot ensure it. For instance, the US government could make Mozilla hand in the data.
(In reply to Ben Bucksch (:BenB) from comment #6)
> FWIW, Google Chrome used to have unique ID, but then they got major pushback
> for it, esp. in Germany, and they *removed* it because people were so upset
> about it.

If you have any information about how they implemented their replacement, please share it.  I was not able to find details, but it sounds to me like they switched to using some form of fingerprinting instead.

(In reply to Ben Bucksch (:BenB) from comment #2)
> Instead of building the history on the server, the client should build the
> history and only submit results. E.g. if you need to know whether things
> improved, you can let the client keep some old data and submit "12 crashes
> last week. One week before: 12% more. One year before: 50% less." It should
> not include exact history numbers either, because they, too, would allow to
> puzzle the numbers together and allow to again build a history of IP
> addresses for a given user.

Unfortunately, this method does not allow for understanding what installations are no longer in use, and it doesn't allow us to accurately understand how many installations there are across any time frame larger than the submission interval.
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #10)
> (In reply to Ben Bucksch (:BenB) from comment #6)
> > FWIW, Google Chrome used to have unique ID, but then they got major pushback
> > for it, esp. in Germany, and they *removed* it because people were so upset
> > about it.
> 
> If you have any information about how they implemented their replacement,
> please share it.  I was not able to find details, but it sounds to me like
> they switched to using some form of fingerprinting instead.

http://mynetx.net/2878/google-chrome-soon-without-unique-user-id says they only used the id for counting users, so there would be no point in fingerprinting. Also, Firefox shouldn't send data that's easy to fingerprint...

> (In reply to Ben Bucksch (:BenB) from comment #2)
> Unfortunately, this method does not allow for understanding what
> installations are no longer in use, and it doesn't allow us to accurately
> understand how many installations there are across any time frame larger
> than the submission interval.

There's always more we could know. I think we need to err on the side of privacy.
(In reply to Dão Gottwald [:dao] from comment #9)
> (In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #7)
> > We can ensure that it is easy for them to discover if there is undesired
> > information available to us and correct it,
> 
> How exactly would you do that? Is this already part of the current plan?

Yes.  That is the about:metrics page blocking this one (bug 719484) and the ability for the user to not only opt-out but to trigger the deletion of the data associated with their installation.
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #12)
> (In reply to Dão Gottwald [:dao] from comment #9)
> > (In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #7)
> > > We can ensure that it is easy for them to discover if there is undesired
> > > information available to us and correct it,
> > 
> > How exactly would you do that? Is this already part of the current plan?
> 
> Yes.  That is the about:metrics page blocking this one (bug 719484) and the
> ability for the user to not only opt-out but to trigger the deletion of the
> data associated with their installation.

When I read "easy for them to discover", I expected some kind of notification. Requiring users to do research doesn't sound right to me.
Sorry, opt-out doesn't fly. Per German law, any data gathering has to be opt-in. And even then, it has to be justified and some things are not allowed even when you ask.
Also, decide whether you want to be "anonymous" (bug summary) or have a unique ID. A unique ID is not anonymous, per definition.
The metrics team is working on creating a project page that we can use to share with people the need driving our work on this feature, the due diligence and process already done and to be done, and provide a useful place to gather discussion about the issues in a productive and transparent manner.  I really hope that we can get ideas and help from our community through this process.  I'll post a link here as soon as the page is up, I expect, by Wednesday 2012-01-25.
need to schedule a full review with the team
Whiteboard: [secr:curtisk]
The public project page went up yesterday.  It contains some background information as well as links to various supporting resources.
Curtis Koenig [:curtisk]
> need to schedule a full review with the team

Please include me and Dao.
as soon as you all pick when you want your review I can invite whomever you want
Last week, we reserved the slot for Wednesday, 21:00 UTC (13:00 US-Pacific, 16:00 US-Eastern).

If this review will be looking at the general privacy scope as well as security, then we should make sure that people from Alex Fowler's team and Ben Adida are invited as well since they were part of the initial review by the User Data Council.
Also, please note the updated section on the wiki page regarding replacing the static UUID approach with a per-document identifier: https://wiki.mozilla.org/MetricsDataPing#Document_Identifier_Strategy
I was just talking a lot on the phone with Daniel :dre. Main facts I get out of it are:
* It's vitally important to know data about Firefox installations that are no longer in use. Did they crash a lot? Was the startup time slow (and maybe slower than before)? Did they have many addons installed, and which ones, and are they maybe related to the previous problems? Knowing that might help understand why people switch away from Firefox. ("Retention" problem.)
* We discussed a few algorithms that can gather data anonymously, e.g. the amount of users via time of last submission or/and time of installation, or the number (just the counter) of crashes since the last submission.
* I still believe that the needed data can be submitted in a way that doesn't allow to track individual users (apart from the >1% fringe cases with custom UA strings etc.), i.e. is completely anonymous, but still gives the answers to the important questions.

* If I or we can come up with a such a new scheme that submits the data specifically about Retention as mentioned above, in an anonymous fashion, Daniel promised me that they would adopt it.

I think and hope that we can solve both interests at the same time, and I am happy about Daniel's promise. It makes me hopeful.

I will think about it to find a way, and I encourage others to do the same.
I just need to make one small clarification. I am happy to have more people looking at the problem and challenge, and I would love to see a mechanism that provides a feasible alternative to the current ID-centric solution.  The only thing I can honestly promise is to collaborate on the thinking of, and consider such a solution if presented, and if it meets the stated needs of the project, I will happily and vigorously advocate its adoption to all parties involved.  This was the spirit of my promise, but I don't have the final say on what does or does not go into the product.
> I will happily and vigorously advocate its adoption to all parties involved.

Yeah, thanks. :)
Daniel, I've been thinking about it and I think I have the solution to the problem you have stated, in an anonymous fashion, and I posted it at
https://wiki.mozilla.org/MetricsDataPing#Anonymous_alternative
I'll post this below as well - merely as a historic reference, because wiki software tends to be replaced and short-lived. Sorry for the spam.

= Anonymous alternative =

The following is an alternative approach, proposed by Ben Bucksch:

For simplicity, I will take the number of crashes (e.g. in the last week or overall) as data point that you want to gather. The data itself is anonymous and can (apart from fingerprinting, more to that later) not identify a single user.

== Avoiding UUID ==

You wanted to know which profiles are not used anymore (dormant, retention problem) and which characteristics they have. This is inherently difficult without tracking individual users (installations), but it is possible with the following algo:

The client submits:

* Date of last submission - e.g. 2012-01-18
* Current date (from client perspective) - only date, not time - e.g. 2012-01-20
* Age of profile (Firefox installation) in days - e.g. 500
* (Last submitted age is implied or explicit - e.g. 498 )
* Number of crashes - e.g. 15
* Number of crashes submitted last time - e.g. 10

Then, on the server, you write that information in a database, as such:
 Date of submission | Age of installation | Crash count | Number of users
 2012-01-20         | 500                 | 15          | 100000
Any additional user also submitting today the same combination "age 500, crash count 15" increases the "number of users" column by 1, new value is 100001.
Also, you look up the row for the last submission, namely
 2012-01-18         | 498                 | 10          | 20000
and decrease the number of users by 1, new value is 19999.

If the user later that day decided that there were too many crashes and switches to Chrome, he will now be stranded on the row
 2012-01-20         | 500                 | 15          | 5000
while other users who have continued to use FF have been subtracted after a while. So, you can say with certainty that there were 5000 users who used Firefox the last time on 2012-01-20, after having used Firefox for 500 days, and they had 15 crashes (per day/week/total, whatever you submit) when they stopped using Firefox.

That is exactly the information you are so desperately seeking. Tsere, you has it. Without tracking any individual user: it's completely anonymous.

== Avoiding Fingerprinting ==

Now, what about all the other information that you need: startup times, addons, etc.? If we just add all that information to the same table and row, it would allow fingerprinting. But that is not necessary. You merely make one table per atomic information. I.e.
 Table A
 Date of submission | Age of installation | Crash count | Number of users
 Table B
 Date of submission | Age of installation | Startup time | Number of users
or of course whatever other database schema you want, as long as each value is separate. That takes care of the fingerprinting.

At least on the server side, not on the submission side. I would have to trust you, and anything between you and me. It would be possible to separate the calls and submit each value separately, but I think that would be overdoing it.
Here's the result of the security review:
https://wiki.mozilla.org/Security/Reviews/MetricsDataPing

{{SecReviewInfo
|SecReview name=Metrics Data ping
|SecReview target=https://bugzilla.mozilla.org/show_bug.cgi?id=718066 https://wiki.mozilla.org/MetricsDataPing
}}
{{SecReview
|SecReview feature goal=* MetricsDataPing- get important metrics (see wiki)
** this data is a criticial need for Moz for a variety of reasons
* orig plans focused around a collection of metrics on client side to Moz servers once per day
** for effective retention data and longitudinal study we need a cumulative view (over time)
** initial proposal a UUID associated with installations profile, submitted each time so it can be merged with past data
** the data set is opt-out vs opt-in to avoid self selection bias
* changes
** UUID removed and replaced with a document identifier, generated per request (per profile)
** data accumulated client side vs. server side
** sent with new ID and previous ID, which allows us to remove the older documents with the old ID
|SecReview alt solutions=* UUID vs. Document ID (above)
* blocklist ping - provides ADI, current metrics system, lots of attributes, only point in time, no time analysis; owners don't want other data collection on top, no retention analysis
* telemetry - default opt-out nightly/aurora, but opt-in on others, focused on preformance data; not designed for time analysis, or retention
* Test Pilot - double opt in, large self selection bias, skewed towards power user or early adopter not typical user
* opt in vs. opt out - based on research bias on self selection
* funnelcake - designed for adoption/retention, blocklist ping was the last part, but we were lossing this data
* an actual representative "sample" rather than the full population
** problem of keeping the sample stable and representative over time
|SecReview solution chosen=* need for longitudinal analysis & retention analysis
* we can look at them and see if there were problems if data stops coming
|SecReview threats considered=* UUID could be used if disclosed to find information about the user from the server system
** this would persist across a backup, thus changed
* Server side:
** public unauthenticated system, write only or request to delete
|SecReview threat brainstorming=* Obvious Privacy stuff
* why does the system have retrieval?
** current system with document identifiers does not, maybe in a future version to allow client to get aggregate info so a user can compare things themselves
** so user can see the data and remove if they want
*  Does the  about:metrics / user data retrieval feature have to go out at  the same time as the metrics collection on our servers?  
* What are the compliance issues mentioned on the wiki in regards to the data retrieval?
** EU / Ger: privacy compliance regulations, even data about the functioning of the product without a user facing feature to support it
* Where is the uuid/document identifier stored?  Do webpages have access to UUID/Docuemtn ID? 
** Stored as a preference in about:config - accessible to "chrome" code, not regular web pages.  Hence website fingerprinting not an issue.
** a user could mess up the data by fiddling with about:config, could cause bogus data
* How often will data be sent?  
** not more than once per 24 hours
* What API is used?  
** simple post request to data collection system, same as telemetry.(data.mozilla.com)
* Are there signatures on the request/responses?  Is it over ssl?
** yes over SSL, signature does not matter
* is the certificate checked or basic SSL auth?
** basic SSL Auth.  Perhaps we could extend this.
*  When the server receives a new Document ID, it deletes the previous ID and data  associated with it.  Do we no longer need that data, or do we just  delete the previous ID and retain the data?
** each submission is a cumulative view from the client, there is only one doc at any time that represents that installation
** allows for expiration of documents
* What's the risk of other add-ons grabbing and using the Document ID as a unique identifier, much as iOS apps have been caught doing?
** document ID changes every day, so not likely useful to other chrome privelaged processes unless they check all the time
*** but the add-on can just grab the document ID everyday and chain them.  
****if they had chrome privileges, they could just create an uuid themselves and use it anyway    .
* how random are the document IDs
** uses UUID mechanism, same as crash stats

SecReviewActionStatus
SecReview action item status=In Progress
Feature version=Firefox 12
SecReview action items= code reveiw (about:metrics)
Another idea that came up during the security review is to take a much smaller sample, a random (!) subset of 10000 users. I.e. you need data from about 10,000, you have roughly 200 million users, so on startup, you check whether the "participate" pref is user-set or not. If not, you use var participate = Math.random() * 20000 > 1 and write that to the pref, so you get 1 out of 20000 users, and you get that person permanently as long as the study goes. You should get roughly 10000 users submitting, and they are guaranteed to be a random and statistically representative sample.

It makes a huge difference whether you collect data from 200,000,000 people or just 10,000.
Daniel, please leave my proposal on the page. Normally, I would respect your ownership of the page, but in this particular case this is so critical that I think it must be visible to everybody. It's intentionally at the bottom and clearly marked "Proposal by Ben Bucksch", to avoid any confusion with what is your own plan.

UUID:

You reject my proposal on the grounds that fingerprinting would still be possible. I had intentionally differentiated the two problems in my proposal: to remove the UUID and to counter the risk of fingerprinting. Even if fingerprinting is a risk, a UUID would still make it a 100% definite problem. So, I think the "Avoiding UUID" part is still viable, even if you think it's pointless.

Fingerprinting:

As for fingerprinting, the goal should be to get the data to such a level that fingerprinting wouldn't allow to identify most individual users.

For example, the current proposal sends the installation date of each addon. This information alone would highly likely lead to a unique fingerprint. However, I claim that this information might be useful in some cases, but is not absolutely critical to answer the question of "Why do people leave Firefox?". The list of addons, while sensitive, is highly useful, and I had a bad feeling, but I can see the high value, and therefore I don't object there, but the install date of each addon shouldn't be relevant.
Also note that if you separate data points as I had suggested, you would clearly see relations between installation of a certain addon and a user leaving Firefox. You would see that many users who installed Skype plugin left Firefox, but others had a must better retention, so in fact separating data points *is* the analysis. Again, yes, there may be cases where you would need certain information, but if that information is highly problematic and only useful in some cases, it must be dropped. I claim the installation data of each addon is such a data point. Similarly, the installation time of Firefox down to second level is irrelevant, but rather day granularity is sufficient.

In other words, there are many ways to mitigate fingerprinting. That's why I think my proposal is viable.

So, I believe there is a way to get you the data you need to answer the most pressing questions - even if not *all* questions that are interesting for you -, while avoiding fingerpring to the largest extend.

Now, I know your job is data collection and you want as much data as possible, but the users have a law+given *right* to their privacy, and we have to find a middle ground. I don't think your proposal of specifically tracking individual users over time represents such a middle ground.
(In reply to Ben Bucksch (:BenB) from comment #33)
> Daniel, please leave my proposal on the page. Normally, I would respect your
> ownership of the page, but in this particular case this is so critical that
> I think it must be visible to everybody. It's intentionally at the bottom
> and clearly marked "Proposal by Ben Bucksch", to avoid any confusion with
> what is your own plan.

I don't actually care that much about the "ownership" of the page.  The thing that I object to is putting discussion, opinions, and technical information regarding alternatives on a page that is supposed to be about the status and technical information of the project as it exists right now.  This is exactly the sort of thing that the *discussion* portion of every page is for. It is in fact visible to everyone with a clear discussion link at the top as well as call-outs to various sections of the discussion relevant to specific topics.  This is the way it works on other project pages.  Filling the project page with discussion and technical details about alternatives just makes it harder for anyone coming in to understand what the current state is.

> 
> UUID:
> 
> You reject my proposal on the grounds that fingerprinting would still be
> possible. I had intentionally differentiated the two problems in my
> proposal: to remove the UUID and to counter the risk of fingerprinting. Even
> if fingerprinting is a risk, a UUID would still make it a 100% definite
> problem. So, I think the "Avoiding UUID" part is still viable, even if you
> think it's pointless.

Please put in the discussion page what parts of my reasoning that having several hundred bytes of previous and current values is equivalent to having generated document IDs for the previous and current values.  I tried to lay it out with a concrete example rather than just stating that I think it is pointless.

> Fingerprinting:
> 
> As for fingerprinting, the goal should be to get the data to such a level
> that fingerprinting wouldn't allow to identify most individual users.

We could review in the discussion page how minimal a set of data would have to be before it would be reasonable to believe it wouldn't have a fingerprint.  I believe that even the 16 data points I listed in my example would be enough.  That is not even considering the trust that we not use IP and user agent string information as well.

> For example, the current proposal sends the installation date of each addon.
> This information alone would highly likely lead to a unique fingerprint.
> However, I claim that this information might be useful in some cases, but is
> not absolutely critical to answer the question of "Why do people leave
> Firefox?". The list of addons, while sensitive, is highly useful, and I had
> a bad feeling, but I can see the high value, and therefore I don't object
> there, but the install date of each addon shouldn't be relevant.
> Also note that if you separate data points as I had suggested, you would
> clearly see relations between installation of a certain addon and a user
> leaving Firefox. You would see that many users who installed Skype plugin
> left Firefox, but others had a must better retention, so in fact separating
> data points *is* the analysis. Again, yes, there may be cases where you
> would need certain information, but if that information is highly
> problematic and only useful in some cases, it must be dropped. I claim the
> installation data of each addon is such a data point. Similarly, the

Your alternative proposal does not speak much to add-ons at all.  If you care to put something in the discussion page, I would be happy to provide several reasons that having the installation and update dates of add-ons is critical to being able to attribute performance, stability, and retention results as correlated to a specific add-on (or set of add-ons).

> installation time of Firefox down to second level is irrelevant, but rather
> day granularity is sufficient.

My team has already discussed dropping the resolution of all timestamps down to the date level.  The only value in having any time portion for things like install or update times would be to include the hour so that a user could have the correct date for their time zone when doing local analysis.  It could be confusing to show the user that they updated on Tuesday (UTC) when it was actually Wednesday for them.  That is a minor matter though.  

The only reason it hasn't been changed yet is because we are still waiting for general code review to figure out what other parts of the code might need to be changed.

> In other words, there are many ways to mitigate fingerprinting. That's why I
> think my proposal is viable.
> 
> So, I believe there is a way to get you the data you need to answer the most
> pressing questions - even if not *all* questions that are interesting for
> you -, while avoiding fingerpring to the largest extend.

> Now, I know your job is data collection and you want as much data as
> possible, but the users have a law+given *right* to their privacy, and we
> have to find a middle ground. I don't think your proposal of specifically
> tracking individual users over time represents such a middle ground.

Neither I nor the Metrics team nor Mozilla Corporation in general "want as much data as possible".  I find that statement offensive since it should be clear to anyone that there are hundreds of data points that could potentially be very interesting and could potentially be very helpful in improving our understanding of the browser, but we have focused on the bare minimum that we have decided are most likely to be critical to our current analysis needs.

When we went to the User Data Council with our proposal, we focused on data points that are already visible in other areas such as Blocklist, Services.AMO, AUS, Telemetry, Test Pilot and Crash Stats, but were not able to answer the questions we must answer because they do not support longitudinal and retention analysis in those other forms.  We committed to keeping this list minimal, to going through the UDC cycle for any future metrics we might request, and to going through periodic UDC cycles to evaluate the potential removal of any data points that have not turned out to be useful for analysis.


Finally, one other thing I would like to see in the discussion wiki page is what potential abuses could be made of the data as we have committed in our proposal to storing it.  I stated there that I believe there must be a level of trust and expectation that we will do what we say we will do with the data, and not attempt to deceive the user and attempt to store IP address or personal information.

Looking at the proposed data set with a document ID, if Mozilla or even a party with the ability to request or steal a snapshot of that data were to examine it with the most dubious of intent, what would they possibly be able to extract?  If there are specific concerns there, then it would be well worth our time to look at either mitigating those concerns or deciding if we needed to give up those specific data points.
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #34)
> When we went to the User Data Council with our proposal, we focused on data
> points that are already visible in other areas such as Blocklist,
> Services.AMO, AUS, Telemetry, Test Pilot and Crash Stats, but were not able
> to answer the questions we must answer because they do not support
> longitudinal and retention analysis in those other forms.

I don't think we're at a point where we can accept that having super-accurate answers to those questions is a must-have. It seems like a super-nice-to-have to me -- software can be successful without it. When talking about a core principle such as privacy, it needs to be clear that one possible outcome is that we won't collect that data. It doesn't seem right to rule that out from the beginning.

> We committed to
> keeping this list minimal, to going through the UDC cycle for any future
> metrics we might request, and to going through periodic UDC cycles to
> evaluate the potential removal of any data points that have not turned out
> to be useful for analysis.

... and additions of new data, I assume. ;) What's minimal depends on the usefulness of the collected data as a whole. If it turns out to be not very useful, a minimal useful list would need to be longer. So, I'm curious, what makes us believe the current list is going to be sufficient? Wouldn't it make sense to assume that people leave Firefox for a large variety of reasons, most of which would be more interesting to know than, say, the fact that Firefox crashes frequently? We're already tracking crashes and fixing them as fast as we can (I hope).
Daniel, your argument continues to be that "We are collecting so much data that fingerprinting is trivial and clearly possible, so avoiding a UUID is pointless, so we can use a UUID alright". My point is that these are 2 separate issues, and that fingerprinting should be reduced to a level where it is no longer unique. You can't just wave that away.

> anyone coming in to understand what the current state is.

Exactly. Anyone coming in must see that this is highly controversial. Otherwise, just the project page alone would make such a bad impression of the project on others that it may do damage to the whole project. Even more so when shipped. Apparently you're not seeing that, but this is a ticking PR bomb.

> we have focused on the bare minimum

Not from what I see, see my last comment.

> Telemetry, Test Pilot and Crash Stats ... were not able
> to answer ... longitudinal and retention analysis

Yes, but my proposal can do that.

Apart from that, see Dao. "I want/need it" is not an argument.
> if Mozilla or even a party with the ability to request or steal a snapshot of that
> data were to examine it with the most dubious of intent, what would they possibly
> be able to extract?  If there are specific concerns there

About tracking:
https://wiki.mozilla.org/MetricsDataPing#Impact_for_user
It was there all the time, but I've just extended it a bit.

(And, FWIW, I don't care what Google or Facebook does - from a German perspective, they are illegal. That is the official position of the government offices charged with these matters. And Facebook, and Google earlier, did get major beating in the press over it in the last months.)

Also, knowing the exact minor version of certain software means knowing where exactly you are vulnerable. That may be valuable data to protect users, but there's also clear potential for abuse (you were asking for attack scenarios).
(In reply to Ben Bucksch (:BenB) from comment #37)

> (And, FWIW, I don't care what Google or Facebook does - from a German
> perspective, they are illegal. That is the official position of the
> government offices charged with these matters.

Ben, if you're going to continue to make this claim, can you provide some citations. The specific regulations and any cases demonstrating actual violations and resulting enforcement activity would be a good start. A actual legal expert you know willing to step in and provide actual legal analysis would be eve better.
I don't think it would be easy to encapsulate the entire history leading up to this project, but I will say that it definitely isn't some quick hack we threw together to try to answer some occasional problems.  We have been struggling to answer these questions for four years, mostly using the variants of the data points found in the other services I mentioned above and even with the evolution of data sources and the addition of new ones, we have not answered these questions that the organization feels are must-have.

I'll add a few more comments to the discussion page in various places and then call it a weekend.  I am beat.
(In reply to Ben Bucksch (:BenB) from comment #6)

> FWIW, Google Chrome used to have unique ID, but then they got major pushback
> for it, esp. in Germany, and they *removed* it because people were so upset
> about it. Unique IDs are red flags even for the press, and Germany is
> Firefox' biggest userbases and highly privacy sensitive, so this is likely
> to give serious backslash and to end up being a PR nightmare. Apart from
> being just wrong and possibly even violating German law.

Chrome doesn't need a GUID because Google associates a unique identifier with users when they search. The overwhelming majority of Chrome and Firefox users use Google to search. Ergo Google has better usage data on Firefox than Mozilla has itself.

Imagine if Ford couldn't record its car sales or safety data, but GM got a detailed daily report on both -- not just for their cars, but for Ford's cars as well. I except we'd see a lot more GM owners on the road.

Check it before you wreck it (http://www.google.com/intl/en/policies/privacy/faq/).
> > (And, FWIW, I don't care what Google or Facebook does - from a German
> > perspective, they are illegal. That is the official position of the
> > government offices charged with these matters.

> Ben, if you're going to continue to make this claim, can you provide some citations.

This is going through the press in Germany since months. Search for "Thilo Weichert Facebook"
* Decree from privacy commissioner to have Facebook fan pages of German organizations closed http://www.sueddeutsche.de/digital/profilerstellung-von-nicht-mitgliedern-datenschuetzer-will-facebook-fanseiten-schliessen-lassen-1.1132874
* Viviane Reding offended by new Google privacy policy and Facebook Timeline being mandatory http://www.sueddeutsche.de/digital/datenschutz-bei-google-und-facebook-sie-machen-was-sie-wollen-1.1272375
I picked sueddeutsche, because it's the biggest German newspaper. heise.de is the biggest German IT site and has even more critical news items. There are hundreds of stories like that in the last month.
"months", sorry.

Blake Culter wrote:
> Chrome doesn't need a GUID

That's beside the point. The point was that this was considered *the* major drawback of Google Chrome, before they removed it, and hindered its spread in Europe. You will see that Google Chrome is a lot more popular in the US. The reason is of philosophical nature.
(In reply to Ben Bucksch (:BenB) from comment #41)
> > > (And, FWIW, I don't care what Google or Facebook does - from a German
> > > perspective, they are illegal. That is the official position of the
> > > government offices charged with these matters.
> 
> > Ben, if you're going to continue to make this claim, can you provide some citations.
> 
> This is going through the press in Germany since months. Search for "Thilo
> Weichert Facebook"
> * Decree from privacy commissioner to have Facebook fan pages of German
> organizations closed
> http://www.sueddeutsche.de/digital/profilerstellung-von-nicht-mitgliedern-
> datenschuetzer-will-facebook-fanseiten-schliessen-lassen-1.1132874
> * Viviane Reding offended by new Google privacy policy and Facebook Timeline
> being mandatory
> http://www.sueddeutsche.de/digital/datenschutz-bei-google-und-facebook-sie-
> machen-was-sie-wollen-1.1272375
> I picked sueddeutsche, because it's the biggest German newspaper. heise.de
> is the biggest German IT site and has even more critical news items. There
> are hundreds of stories like that in the last month.


Not one of these articles cite laws or cases with court outcomes. Nor are they even talking about the same kind of activity we're talking about. You keep saying that this is against the law. What law? What activities have been successfully prosecuted under that law? Either cite some law and cases, or stop making the claim that what we're planning here is illegal. Pointing to opinion pieces in the press about completely different activities makes is not citing laws or court cases. Perhaps, since you seem to not know what laws or cases apply, you can find a German lawyer or someone at least familiar with German privacy laws and ask them to join the conversation here.
Asa, I said "That is the official position of the government offices charged with these matters.". You asked for citations. These *are* citations from government officials on "what Google or Facebook does".

> What law?

Bundesdatenschutzgesetz

Specifically:
http://www.gesetze-im-internet.de/bdsg_1990/__3a.html
"The collection, processing and usage of data related to persons and the selection and design of information system are to be chosen on the goal of collecting as little person-related data as possible. …"

> What activities have been successfully prosecuted under that law?

Again, that is US perspective. We have a code law system.

But the Surpreme Court has consistently upheld this, in fact extended it to a constitutional right.

Again, what I try to say is that obeying the law is the absolute minimum barrier, and this is either just barely scratching it or outright crashing it. We shouldn't be on the same side as Google, but on the opposite side. I don't care what Google does, because it *upsets* everybody here.

---

We need to be *different*, totally, 100% different. So far, we are *already* perceived as very different. As I said in comment 42, this is why we have so many users in Europe. It's for ideological reasons. The telemetry and this bug removes that ideological reason and will result in a loss of market share. I can certify you that - how many German newspapers do *you* read?
I should add: the term "person-related data" ("personenbezogene Daten") is roughly the same as the English term "Personally Identifiable Information" (PII), of which I had posted the definition in comment 1. I.e. a UUID for a browser profile would be, the IP address is considered by officials as covered, the mere crash count alone would not be.
(In reply to Ben Bucksch (:BenB) from comment #44)
> I don't care what Google does, because it *upsets* everybody here.

Maybe not everybody: http://chrome.blogspot.com/2012/02/german-federal-office-of-information.html
What is the German context of that announcement rather than the Google PR statement?

> We need to be *different*, totally, 100% different. So far, we are *already*
> perceived as very different. As I said in comment 42, this is why we have so
> many users in Europe. It's for ideological reasons. The telemetry and this
> bug removes that ideological reason and will result in a loss of market
> share. I can certify you that - how many German newspapers do *you* read?

I certainly agree that we need to be different.  I believe that providing the user with the ability to see and use and control their data rather than just taking it and using it behind a curtain is different.  I understand that you don't believe the same.
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #46)
> I certainly agree that we need to be different.  I believe that providing
> the user with the ability to see and use and control their data rather than
> just taking it and using it behind a curtain is different.  I understand
> that you don't believe the same.

It makes some difference. I don't think it's game-changing. There are masses of users who hardly ever open the options dialog. They aren't going to make informed decisions. We decide for them pretty much like Google does for its users.
(In reply to Ben Bucksch (:BenB) from comment #42)

> That's beside the point. The point was that this was considered *the* major
> drawback of Google Chrome, before they removed it, and hindered its spread
> in Europe. 

Would you mind elaborating on this? I don't understand why server-side GUIDs are okay but client-side GUIDs are not. Do users understand the difference between the two? Should they understand the difference between the two? 

Google's statisticians lost almost nothing by removing the GUID from Chrome. 

I hope Mozilla doesn't get caught up fighting battles of perception. Without better data, Firefox's quality will continue to decline with respect to its competitors. 

Mozilla can't advocate for users if it doesn't have users. I want to be involved in a project that can change the world -- not a project that allows philosophical battles to cripple its product. The road to irrelevance is paved with good intentions.
(In reply to Ben Bucksch (:BenB) from comment #28)
> Daniel, I've been thinking about it and I think I have the solution to the
> problem you have stated, in an anonymous fashion, and I posted it at
> https://wiki.mozilla.org/MetricsDataPing#Anonymous_alternative

This is an interesting idea, but it would severely restrict Mozilla's ability to draw causal inferences. We need to ask "why," not just "what."
(In reply to Blake Cutler from comment #48)
> (In reply to Ben Bucksch (:BenB) from comment #42)
> > That's beside the point. The point was that this was considered *the* major
> > drawback of Google Chrome, before they removed it, and hindered its spread
> > in Europe. 
> 
> Would you mind elaborating on this? I don't understand why server-side GUIDs
> are okay but client-side GUIDs are not. Do users understand the difference
> between the two? Should they understand the difference between the two? 

Ben is pointing out that GUIDs are perceived as a privacy problem. Did Google get away with a server-side GUID? Maybe. Good for them, but not Ben's point. Ben didn't cite Google as a good example. We should implement neither client- nor server-side GUIDs (which means to not send a fingerprint).

> Mozilla can't advocate for users if it doesn't have users. I want to be
> involved in a project that can change the world -- not a project that allows
> philosophical battles to cripple its product. The road to irrelevance is
> paved with good intentions.

Google or Facebook could easily state the same. Philosophical grounds make Mozilla different. The idea that this must be lethal doesn't seem compatible with how I understand Mozilla.

Ironically, the Metrics Data Ping relies a lot on good intentions and as you said, all kinds of bad things are paved with good intentions. This is why technology matters as well as philosophy. What the client actually sends and what could be done with that data on the server side (a black box from the user's POV) is crucial.

(In reply to Blake Cutler from comment #49)
> (In reply to Ben Bucksch (:BenB) from comment #28)
> > Daniel, I've been thinking about it and I think I have the solution to the
> > problem you have stated, in an anonymous fashion, and I posted it at
> > https://wiki.mozilla.org/MetricsDataPing#Anonymous_alternative
> 
> This is an interesting idea, but it would severely restrict Mozilla's
> ability to draw causal inferences. We need to ask "why," not just "what."

I'm not sure how you're drawing the line between why and what here. The key difference seems to be that you'd get accumulated statistics directly rather than building them from fine-grained causal data. You need to be open to such restrictions.
(In reply to Dão Gottwald [:dao] from comment #50)
> We should implement neither
> client- nor server-side GUIDs (which means to not send a fingerprint).

(Let's leave the legal issue to a separate - but of course important - discussion.)

Can you explain the issue you see from an actual (not perceived) privacy point of view? My take is that issues arise when either (a) confidential data is revealed directly or (b) unexpected correlations between datasets causes confidential data to be revealed.

On (a), I believe the metrics ping payload has been pared down to a fairly minimal dataset that is not, on its own, problematic.

On (b), the key point to keep in mind is that we are proposing to use a GUID generated purely for the purposes of metrics, never exposed to anything else, in particular not web content.

It's worth being very careful about (b), in particular because future decisions could, if we're not careful, allow correlations against this GUID. I propose that, at the very least, we make a hard rule that the metrics ping GUID cannot be reused anywhere else.

But right now, as best as I can tell, the GUID is not revealing any information because nothing else is tied to it. What you see in the metrics ping is what you get.

I'll add one important thing: even if we solve (a) and (b) fully, that would not be sufficient. With the help of the Privacy team at Mozilla, we've agreed on Data Safety principles that being with user benefit. Sharing data, even data that would not constitute a privacy problem, is something we should consider doing only if there is a user benefit. We believe there is a significant user benefit here: laying the groundwork to let you discover *exactly* why your browser performs well or poorly, e.g. which add-ons are causing you slowdowns, crashes, etc.
(In reply to Ben Adida [:benadida] from comment #51)
> (In reply to Dão Gottwald [:dao] from comment #50)
> > We should implement neither
> > client- nor server-side GUIDs (which means to not send a fingerprint).
> 
> (Let's leave the legal issue to a separate - but of course important -
> discussion.)

I actually don't think it's overly important.... We shouldn't aim for something barely legal. Mozilla should have a higher standard.

> Can you explain the issue you see from an actual (not perceived) privacy
> point of view? My take is that issues arise when either (a) confidential
> data is revealed directly or (b) unexpected correlations between datasets
> causes confidential data to be revealed.
> 
> On (a), I believe the metrics ping payload has been pared down to a fairly
> minimal dataset that is not, on its own, problematic.

I'd consider add-ons problematic, partly besides the IDs alone can let you track down a person, partly because the use of some add-ons could be illegal in some countries. I also second Ben's view that IP addresses + GUIDs need to be considered personally identifiable information. You say you don't store IP addresses, but this just brings us back to good intentions vs. systems that inherently protect privacy by just not sending out problematic data.

Besides, see the second part of comment 35.

> On (b), the key point to keep in mind is that we are proposing to use a GUID
> generated purely for the purposes of metrics, never exposed to anything
> else, in particular not web content.

I'm not concerned about web content. I am concerned about laws forcing Mozilla to reveal data when asked for.

> We believe there is a
> significant user benefit here: laying the groundwork to let you discover
> *exactly* why your browser performs well or poorly, e.g. which add-ons are
> causing you slowdowns, crashes, etc.

The client has the list of installed add-ons, knows about crashes and could be told what to consider "slow". Providing it with a list of add-ons that generally tend to be problematic would probably cover 99.9+%. It's unclear why this requires fain-grained data from hundreds of millions of users.
benadida wrote:
> Can you explain the issue you see from an actual (not perceived) privacy point of view?

https://wiki.mozilla.org/MetricsDataPing#Impact_for_user

> the metrics ping payload has been pared down to a fairly minimal dataset

No, the current data set is far from "minimal", see comment 33.

Blake Cutler wrote:
> Without better data [... bla ...] doesn't have users

Please don't portray this as "either we do it exactly as we have proposed it, or Mozilla will die". That's just not true.

I have proposed an alternative that gives you the data you need urgently.
Daniel had promised to use it, see comment 25 - 27. Daniel challenged me, I provided a solution that covers both interests fairly (not perfectly). It's a fair compromise, *please* use it.

> This is an interesting idea, but it would [not be able to do ABC]

Thanks. As I said, yes, you can't have everything you wish, but that's the nature of a compromise. I am not happy that this project exists at all either, but I understand the need. I think what I propose would give *enough* (!) data to know why people stop using Firefox, or how severe the problems are that people have and how many people are affected.

You can even still correlate crashes or startup time with any given addon.
(In reply to Dão Gottwald [:dao] from comment #52)
> (In reply to Ben Adida [:benadida] from comment #51)
> > On (a), I believe the metrics ping payload has been pared down to a fairly
> > minimal dataset that is not, on its own, problematic.
> 
> I'd consider add-ons problematic, partly besides the IDs alone can let you
> track down a person, partly because the use of some add-ons could be illegal
> in some countries. I also second Ben's view that IP addresses + GUIDs need
> to be considered personally identifiable information. You say you don't
> store IP addresses, but this just brings us back to good intentions vs.
> systems that inherently protect privacy by just not sending out problematic
> data.

Based on your feedback, we removed persona and theme IDs from the list of data submitted.  We also implemented the honoring of the setting that an add-on developer can put into the manifest to prevent submitting the add-on ID to Mozilla services.  That preference was originally set up as part of the services.addons.mozilla.org features that support the Add-on manager.


> > We believe there is a
> > significant user benefit here: laying the groundwork to let you discover
> > *exactly* why your browser performs well or poorly, e.g. which add-ons are
> > causing you slowdowns, crashes, etc.
> 
> The client has the list of installed add-ons, knows about crashes and could
> be told what to consider "slow". Providing it with a list of add-ons that
> generally tend to be problematic would probably cover 99.9+%. It's unclear
> why this requires fain-grained data from hundreds of millions of users.

That presumes that we can know with accuracy what add-ons tend to be problematic for most of our users.  If we don't collect data from the general usage base, the best we could ever hope to know is what AMO hosted add-ons cause problems on our own specific test machines and what add-ons people have told us cause problems for them.  In the latter case, we can't even reliably know what specific characteristics of their installation might be causal factors, and we also have no idea how many users might have no problem at all with said add-on.
(In reply to Ben Bucksch (:BenB) from comment #53)
> benadida wrote:
> > Can you explain the issue you see from an actual (not perceived) privacy point of view?
> 
> https://wiki.mozilla.org/MetricsDataPing#Impact_for_user
> 
> > the metrics ping payload has been pared down to a fairly minimal dataset
> 
> No, the current data set is far from "minimal", see comment 33.

It is the minimal set of data that we feel is vital to the analysis.


> Blake Cutler wrote:
> > Without better data [... bla ...] doesn't have users
> 
> Please don't portray this as "either we do it exactly as we have proposed
> it, or Mozilla will die". That's just not true.

We have already made changes where we agreed that things could be improved.

> I have proposed an alternative that gives you the data you need urgently.
> Daniel had promised to use it, see comment 25 - 27. Daniel challenged me, I
> provided a solution that covers both interests fairly (not perfectly). It's
> a fair compromise, *please* use it.

I did not promise to use your alternative.  I promised to consider any alternative and promote the adoption of one that could fill the requirements that we have listed for analysis.

I replied to your proposal in the discussion page of the wiki with several issues that make it unsuitable, not the least of which is that you state that for your alternative to be viable, it must contain no data that could possibly be used as a fingerprint.


> > This is an interesting idea, but it would [not be able to do ABC]
> 
> Thanks. As I said, yes, you can't have everything you wish, but that's the
> nature of a compromise. I am not happy that this project exists at all
> either, but I understand the need. I think what I propose would give
> *enough* (!) data to know why people stop using Firefox, or how severe the
> problems are that people have and how many people are affected.
> 
> You can even still correlate crashes or startup time with any given addon.

Please put information on the discussion page of the wiki that counters my statements that just the existence of an add-on is not valid to make a positive correlation with stability or performance.
(In reply to Dão Gottwald [:dao] from comment #50)
> I'm not sure how you're drawing the line between why and what here. The key
> difference seems to be that you'd get accumulated statistics directly rather
> than building them from fine-grained causal data. You need to be open to
> such restrictions.

Thanks for the comment Dão. This is an incredibly important distinction.

The problem is that Ben's alternative data collection mechanism isn't a compromise. It would cause Mozilla to lose 90% of the value in the data wants to collect. Why? The short answer is that correlation is not causation. Analyzing aggregate stats and correlations won't tell us much.

Mozilla will make better product decisions if it builds statistical models for user growth, retention, stability, customer satisfaction, and feature adoption. Mozilla can't build these models without instillation level data. "How" data is collected is far more important than "how much" data is collected. 

I'm sorry for not doing a better job of explaining this. I honestly believe if we all has the same set of knowledge, this thread would not be contentious. Our aims are the same.
(In reply to Ben Bucksch (:BenB) from comment #53)
> Blake Cutler wrote:
> > Without better data [... bla ...] doesn't have users
> 
> Please don't portray this as "either we do it exactly as we have proposed
> it, or Mozilla will die". That's just not true.

Hi Ben, couple points I'd like to clarify.

1) I'm a community member, not a Mozilla employee.

2) I didn't say Mozilla is going to die. I implied it's headed toward irrelevance. Let's look a the numbers:
* Webkit's market share is already 10 points higher than Gecko's.
* Gecko is losing .5% market share per month and has no meaningful presence mobile devices. 
* Webkit is gaining over 1% market share per month and dominates mobile browsing.
* Mobile browsing is rapidly overtaking desktop browsing (gaining nearly 1% share per month)

I hate to be the bearer of bad news, but let's be realistic. If you have evidence to support your claim, please share.
I have started a thread in mozilla.dev.planning thread about some of the issues here. I don't think that bugzilla is a good forum for discussing large issues such as this. Please continue the discussion in the group.
> The problem is that Ben's alternative data collection mechanism isn't a
> compromise. It would cause Mozilla to lose 90% of the value in the data wants
> to collect.

Sorry, but I must object here. I built the proposal *exactly* after the requirement that Daniel gave me. See comment 25. My proposal does fulfill the requirement set out there.

Of course you can keep coming up with new requirements until your solution is the only possible solution, but that's not helpful in finding a *solution* that takes care of the valid concerns I and others raised.

> 2) I didn't say Mozilla is going to die. I implied it's headed toward irrelevance.

You implied that your metrics project would stop that, and that it is is the only way to stop it, exactly as you propose it. I claim neither is true.

E.g. the link you cited yourself recommends Chrome because Chrome has a better sandbox and therefore is inherently more secure. You don't need end-user metrics to find out why they prefer Chrome, they *told* you. And the reason they mention is nothing you would ever find in the data you gather. And all the users that take this recommendation as a reason to consider or to install Chrome also couldn't be measured by that data, the data would in fact mislead you.

I claim that Firefox is so incredibly popular in Europe (compared to US, and despite Chrome being technically better) because of mainly philosophical reasons, one of them being privacy. And you are destroying them with this project, running a serious risk of *actively decreasing* market share. I spoke with others from Germany and they agree with this perception.

So, please stop these claims "either we do what we propose or else", because it's just not true. There *are* alternatives that give you want you need and don't have the serious privacy impact that your current proposal has. Instead of fighting me, please let's find a solution.

> I have started a thread in mozilla.dev.planning thread about some of the issues here.

That would be discussion place number 4: this bug, wiki page, wiki discussion page, mailing list. OK, fine.
It seems the sense of this bug report -- in its use of a UUID -- conflicts with the philosophy of bug #572650.  In any case, how is this a Major bug ("major loss of function") and not an RFE?
Mozilla certainly has to make sure to abide to the implementations of the http://en.wikipedia.org/wiki/Data_Protection_Directive, esp. considering that the data is being exported from the EG and considering that since an IP address was already considered a PII (AG Berlin Mitte, ref. 5 C 314/06) a GUID probably is even likelier to be considered one. In this case opt-in is also the only way to go (if just by asking for the acceptance of certain terms at the first start of a Firefox version shipping with this).

You should also keep the potentially harming press this might create, I can already see news sources reviving of the Netscape privacy fiasco (which was BTW used as an example when I was in a lecture regarding privacy law).
Interesting enough I also see nowhere something mentioned about encrypting the pings, what about the use case where you don't want to identifiable to MITMs? Example: Without encryption (with auth ofc) of the GUID it might be able for people who run Tor end nodes to easily group the traffic of its users, something which Tor users probably don't approve of.
Whiteboard: [secr:curtisk] → [secr:curtisk:in progress]
How is the Severity of this bug Major, which means "major loss of function"?  What intended functionality has been lost?  

Should the Severity be Enhancement (RFE) instead of Major.
David: The severity field in Bugzilla isn't really used by anyone for anything of importance. I recommend you ignore it, like everyone else does.
Whiteboard: [secr:curtisk:in progress] → [sec-assigned:curtisk:in progress]
Flags: sec-review?(curtisk)
From Mozillias:

All,

We're always striving to make Firefox better. However, we've had limited visibility into the health of Firefox in the field, given varying hardware capabilities, the effect of add-ons on quality and performance, and other factors that can affect how Firefox performs. I wanted to explain a bit about Firefox Health Report, a feature we're working on that should drive significant improvements to Firefox, deliver an innovative self-diagnostic tool to our users and maintain our high standards for user choice and control.

As you know, we've been working for several months on a feature codenamed the Metrics Data Ping (MDP). We made our initial proposal for MDP in February. This kicked off a rich discussion about how to get the product insights we need while protecting the privacy of people who use Firefox. Achieving both of these aims took a lot of discussion in newsgroups, town halls, and with global privacy advocates and the broader industry. Mitchell posted to her blog today <https://blog.lizardwrangler.com/2012/09/21/firefox-health-report/> on how developments like MDP align with the Mozilla mission.

The result of this process is that we now have an approach that will meet our privacy principles, give us more visibility into the health of Firefox, and deliver direct benefits to our users. We're retiring the "MDP" codename, and will be calling this feature the Firefox Health Report. We're moving forward with the implementation of this feature in Firefox (i.e., the team will open bugs, write and review code, do security and privacy reviews, etc.), and we expect that it will soon land in Nightly builds.

In addition to improving Firefox for everybody, Firefox Health Report also enables each person to see how their instance of Firefox stacks up to the aggregate performance of other instances with comparable configurations. This will be a valuable troubleshooting and optimization tool for Firefox users, plus a powerful way for our users to contribute to the continued innovation that they've come to expect from Firefox. You can read more about the feature here <http://blog.mozilla.org/metrics/2012/09/21/firefox-health-report/> in a blog post from Gilbert FitzGerald, Sr. Director of Analytics at Mozilla.

This FAQ <http://blog.mozilla.org/metrics/fhr-faq/> may help answer any questions you have about FHR, how it works, the proposed data set associated with the feature and how we'll protect and use that data.

Please send any questions to fhr at mozilla.com. We will be talking about this more as the feature develops and will have an open town hall on FHR in the next couple of weeks.

How you can help:

- Email any thoughts you have to fhr at mozilla.com - the Mozilla metrics team, engineers and other people involved will address your questions directly

- We ask that you think carefully about any public posts you draft regarding FHR and if you want guidance, please email press at mozilla.com before posting

- As always, please direct any press inquiries about Firefox Health Report to press at mozilla.com

 
Thanks,

Jay Sullivan
Aside better phrasing, the substance of what we discussed above is the same:
- It will collect data about usage and send it back to Mozilla.
- It will be opt-out, not opt-in - but with a user-visible notification (this is an improvement)
- It uses Daniel Einspanjer's "Document ID" idea, including the link between
  the old and new document, i.e. gives the server the ability to track individual users.
- My proposal on how to collect data anonymously was completely ignored. See comment 25 - 27.
This is an overview of the code in bug 718067.  It's compatible with Markdown syntax so you can generate a nicer-looking HTML version if you prefer.
Attachment #666026 - Flags: feedback?(benjamin)
Whiteboard: [sec-assigned:curtisk:in progress]
Severity: major → normal
Target Milestone: mozilla12 → ---
Version: 12 Branch → unspecified
Attachment #666026 - Flags: feedback?(benjamin) → feedback?(rnewman)
Keywords: meta
Hi Mark - I just added a privacy review bug for this (799552).  It's assigned to Tom Lowenthal on the privacy team, but we also need a point person from the product team for our review.  Would that be you?
Depends on: 799552
Why is that bug not accessible to the public?
Tom Lowenthal explained via private email that the bug being hidden is a side effect of how the "Privacy" Bugzilla component was originally set up (as part of the Legal product). The privacy team is working with the b.m.o admins tocreate a new Privacy Bugzilla product that doesn't have these hard-coded access restrictions - that is tracked in bug 764156.
Depends on: 802914
(In reply to Stacy Martin [:stacy] from comment #68)
> Hi Mark - I just added a privacy review bug for this (799552).  It's
> assigned to Tom Lowenthal on the privacy team, but we also need a point
> person from the product team for our review.  Would that be you?

I am happy to help with this any way I can.  Gregory Szorc (gps) is now the point person for the client code, so he should also be able to help with the review.  Can you add us both to bug 799552?
Depends on: 803655
Depends on: 803699
Depends on: 804745
I'd like to add that "anonymous ID" is a contradiction in itself. An ID is a pseudonym, not anonym. Any system that profiles with an ID is by definition not anonymous.
I agree with Ben on comment #72. Laws may differ among countries, and so may the definition of what "anonymous" is supposed to mean, but any identifier that is unique (whether random but constant or a hash of some PII like an e-mail address) shouldn't be acceptable (and used) for the purpose of gathering such data. This is reflected in https://wiki.mozilla.org/MetricsDataPing#User_identification but hard to tell for me how far (if at all) it is considered in what's envisioned here.

People are getting more sensitive about "phoning home" and any user profiling performed by applications, even if it's just for gathering performance data.
The current plan is:

1) Client generates a random UUID, A
2) Client uploads data keyed under A
3) Wait a day
4) Client generates a random UUID, B
5) Client uploads data keyed under B and requests that A be deleted

If a client opts out of uploading, the client issues a delete request for the most recent UUID stored on the server.

If we remove IDs completely, clients do not have the ability to delete previously-uploaded as they have no handle on the data to be deleted! Control over your own data is important.

If the server were to generate the ID, clients would need to record the ID to support deleting it. And, once you have access to the client, you can correlate with server data. I don't think this is any different from having the client generate the ID.
> 5) Client uploads data keyed under B and requests that A be deleted

As discussed before, this is practically a stable ID, because the server can make the connection.
See bug 802914 comment 5 and 6.
(In reply to Ben Bucksch (:BenB) from comment #75)
> > 5) Client uploads data keyed under B and requests that A be deleted
> 
> As discussed before, this is practically a stable ID, because the server can
> make the connection.
> See bug 802914 comment 5 and 6.

Mozilla won't do this.
OK, then please guarantee this by a client that doesn't allow to link these 2 things, not even by a time+IP correlation.
It would improve matters a lot.
Attachment #666026 - Flags: feedback?(rnewman)
Depends on: 801950
Component: General → Metrics and Firefox Health Report
Product: Toolkit → Mozilla Services
Depends on: 804491
Depends on: 807842
No longer depends on: 807842
Depends on: 807842
Depends on: 808109
Depends on: 808219
No longer depends on: 804491
No longer depends on: 802914
Depends on: 808635
Depends on: 809089
Depends on: 809094
Depends on: 809930
Depends on: 809954
Depends on: 810053
Assignee: mreid → gps
Depends on: 810132
Depends on: 811159
Attached patch Pref off. (obsolete) — Splinter Review
Initial landing pref-off.
Attachment #683405 - Flags: review?(gps)
Attached patch Build disable. (obsolete) — Splinter Review
Weird, that patch didn't apply. Fixxored.
Attachment #683405 - Attachment is obsolete: true
Attachment #683405 - Flags: review?(gps)
Attachment #683406 - Flags: review?(gps)
Comment on attachment 683406 [details] [diff] [review]
Build disable.

Review of attachment 683406 [details] [diff] [review]:
-----------------------------------------------------------------

::: b2g/confvars.sh
@@ +16,5 @@
>  MOZ_OFFICIAL_BRANDING_DIRECTORY=b2g/branding/official
>  # MOZ_APP_DISPLAYNAME is set by branding/configure.sh
>  
>  MOZ_SAFE_BROWSING=
>  MOZ_SERVICES_COMMON=1

IMO we should keep this enabled so the code (and tests!) becomes part of the build. The bit we should be disabling is the healthreport.serviceEnabled pref. That will short-circuit the XPCOM service. Alternatively, we could remove the service registration from the .manifest file for /nearly/ the same effect. The tests should pass in both instances since the tests use custom prefs branches.
Attachment #683406 - Attachment description: Pref off. → Build disable.
Attachment #683406 - Attachment filename: pref-off-fhr → disable-fhr
Attachment #683406 - Flags: review?(gps)
Attached patch Pref off.Splinter Review
Similar effect to build-disable, but pure pref change.
Attachment #683422 - Flags: review?(gps)
Attached patch Build disable.Splinter Review
… and don't even build for non-B2G.
Attachment #683406 - Attachment is obsolete: true
Attachment #683424 - Flags: review?(gps)
Comment on attachment 683422 [details] [diff] [review]
Pref off.

Review of attachment 683422 [details] [diff] [review]:
-----------------------------------------------------------------

This short-circuits the XPCOM service so it is a no-op upon start-up. This constitutes "land preffed off."
Attachment #683422 - Flags: review?(gps) → review+
Comment on attachment 683424 [details] [diff] [review]
Build disable.

Review of attachment 683424 [details] [diff] [review]:
-----------------------------------------------------------------

And this prevents the health report files from being part of the build and shipped on everything except B2G. We will back this out once FHR officially lands, but likely only on m-c. We should strive to revert this on m-c ASAP so we can get test coverage.
Attachment #683424 - Flags: review?(gps) → review+
Note about disabling: strictly speaking, xpcshell-tests should cause a problem here.

Sez our friendly local build meister:

21:52:54 <@gps> the manifest parser has a "strict" mode that among other things determines whether a missing include: file is fatal
21:53:03 <@gps> the xpcshell runner does not run in strict mode

… so with the feature disabled, tests won't bomb out. Eeexcellent.
I was commenting on how every other part of the build system uses constructs like "ifdef MOZ_SERVICES_HEALTHREPORT" to conditionally do things if the health report feature is enabled in the build system. The main xpcshell.ini manifest (testing/xpcshell/xpcshell.ini) does not support preprocessing and the inclusion of services/healthreport/tests/xpcshell.ini is unconditional. The xpcshell test runner loads the master xpcshell.ini in such a way that missing included files are not treated as fatal errors. This is slightly troubling because it makes it easier for us to move things around and not error when running tests, resulting in tests not being executed.

That being said, I'm sure there is a long complicated history about why things are the way they are. And, changing behavior is outside the scope of this bug, especially since we need to uplift to Beta at this juncture. So, if things work, great. If not, we'll probably resort to build system hackery to get this code working on Beta for B2G and will unhack things in m-c when FHR lands for real.
Comment on attachment 683422 [details] [diff] [review]
Pref off.

Pref-off for initial landing.
Attachment #683422 - Flags: approval-mozilla-beta?
Attachment #683422 - Flags: approval-mozilla-aurora?
Attachment #683424 - Flags: approval-mozilla-beta?
Attachment #683424 - Flags: approval-mozilla-aurora?
Target Milestone: --- → mozilla20
Initial push, minus desktop UI, preffed off on all platforms, only built on B2G:

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=16f9ee8cdfea
Not everything has landed yet. Keeping this open as our tracking bug.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → ASSIGNED
QA Contact: gps
Comment on attachment 683422 [details] [diff] [review]
Pref off.

[Triage Comment]
Approving for Aurora/Beta, such that FHR is not enabled at build/runtime for Desktop/Android.
Attachment #683422 - Flags: approval-mozilla-beta?
Attachment #683422 - Flags: approval-mozilla-beta+
Attachment #683422 - Flags: approval-mozilla-aurora?
Attachment #683422 - Flags: approval-mozilla-aurora+
Attachment #683424 - Flags: approval-mozilla-beta?
Attachment #683424 - Flags: approval-mozilla-beta+
Attachment #683424 - Flags: approval-mozilla-aurora?
Attachment #683424 - Flags: approval-mozilla-aurora+
Should this bug block bug 788894, actually?
Our tests have a dependency on Bug 792546, which is causing problems on mozilla-beta. RyanVM is working on uplift.
Depends on: 792546
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #93)
> Should this bug block bug 788894, actually?

Bug 788894 depends on Bug 808219, which is the biggest chunk of implementation for the core that backs both FHR and ADU ping. This bug is the "FHR" part.
(In reply to Richard Newman [:rnewman] from comment #96)
> (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #93)
> > Should this bug block bug 788894, actually?
> 
> Bug 788894 depends on Bug 808219, which is the biggest chunk of
> implementation for the core that backs both FHR and ADU ping. This bug is
> the "FHR" part.

Hmm, OK, I was just wondering because AFAIK, the approval requests for all the dependencies of this tracking bug here have been granted solely on the argument that this work is needed for bug 788894, which is blocking-basecamp+.
All the uplifted bugs are required for B2G. There are some bugs blocking this one not required for B2G. They will not be uplifted, unless unavoidable.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #97)

> Hmm, OK, I was just wondering because AFAIK, the approval requests for all
> the dependencies of this tracking bug here have been granted solely on the
> argument that this work is needed for bug 788894, which is
> blocking-basecamp+.

The code for FHR and the B2G ADU ping is currently 100% shared, with the exception of Bug 809094, which was not nominated for approval (and hasn't even landed in m-c).

I didn't use this tracking bug when nominating bugs; I used the commit logs of our project branch. Any overlap is purely due to diligence in gps's bug tracking :D
https://hg.mozilla.org/releases/mozilla-beta/rev/6d7186a669f9

Should hopefully clear up xpcshell bustage on beta. If not, we should probably back out and get things working through try pushes.
Depends on: 823304
Try run for 12835f9a08ce is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=12835f9a08ce
Results (out of 311 total builds):
    success: 297
    warnings: 10
    failure: 4
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/rnewman@mozilla.com-12835f9a08ce
Depends on: 813287
Comment on attachment 666026 [details]
Overview of the Firefox Health Report (formerly MDP) Code

Setting this overview as obsolete since it no longer reflects the code.
Attachment #666026 - Attachment is obsolete: true
Depends on: 827602
Depends on: 827910
Depends on: 828101
Depends on: 828149
Depends on: 828654
Depends on: 828703
Depends on: 828829
Blocks: 829887
No longer depends on: 828703
No longer depends on: 764645
No longer depends on: 823304
Firefox Health Report has landed in mozilla-central and is enabled there. It should make it into Nightly in the next day or two. I'm closing this bug because I feel it has served its purpose to track the delivery of this feature.

The next major milestone for Firefox Health Report will be its transition into the Beta Channel. Bug 829887 tracks this effort.

Tree sheriffs: this merged in from services-central where all trees were green before merge. I highly doubt you will need to back out due to tree failures. If you do, please don't back out patches. Instead, remove all traces of MOZ_SERVICES_HEALTHREPORT from /browser/confvars.sh. If you encounter any other issues, please run things by mconnor and/or rnewman before you take action.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
Summary: [meta] Add feature to submit anonymous product metrics to Mozilla → Initial landing of Firefox Health Report
Firefox Health Report made it into the 2013-01-12 Nightly.
Is it worth adding a "warning", directly in Options to say something like "No personally identifiable information is collected" or "We protect your privacy".

Not sure how many users will bother clicking on "Learn More" and even less so bother reading through the fine print.
Blocks: 830145
No longer depends on: 801950, 803655, 828149, 828654
(In reply to bug.zilla from comment #106)
> Is it worth adding a ...

Suggestions for changes should go into new bugs, please.
Just to confirm, will this be disabled in Firefox 20 when it moves from beta to release?
To my knowledge it has never been enabled on Fx20, so I would assume that'll continue to be true.
Component: Metrics and Firefox Health Report → Client: Desktop
Product: Mozilla Services → Firefox Health Report
Target Milestone: mozilla20 → Firefox 20
Flags: sec-review?(curtisk) → sec-review+
Thanks guys. Just a side note: Someone with access would need to update:
- The "How to stop Firefox from automatically making connections without my permission" page under https://support.mozilla.org/en-US/kb/how-stop-firefox-automatically-making-connections (and its other languages).

- The Mozillazine KB article it's based on, under http://kb.mozillazine.org/Connections_established_on_startup_-_Firefox

I tried the latter, but I don't have a Mozillazine KB account, and the related forum thread to request KB updates (http://forums.mozillazine.org/viewtopic.php?f=26&t=2657921) is locked, although its last post claimes it having been unlocked.

P.S.: Also, other recently added reports which can be configured in the same Firefox configuration window as the new health report (telemetry, crash reports) are also missing in the above articles.
The thread for requesting MozillaZine KB accounts has been moved, I've left a comment at http://forums.mozillazine.org/viewtopic.php?f=11&t=2698481#p12892493 to adjust the link at that page. You can add a request there for the time being, or just explain what you want to get changed on the respective KB pages.

On a side note, MozillaZine isn't run by Mozilla, thus any comments on bugzilla here may not reach the people in charge of maintaining that KB.
user-doc-needed re Comment 110.
Keywords: user-doc-needed
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: