Closed Bug 1291053 Opened 8 years ago Closed 6 years ago

Privacy sign-off of mach/build system metrics

Categories

(Firefox Build System :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: kmoir)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

This bug serves to track a privacy sign-off of submitting mach/build system metrics to a server. I'll upload a proposed payload shortly...
Attached file mach build data
Here is a proposed data payload that will be generated/submitted after every `mach build` invocation.
David: I think you have enough in the attachment to start a privacy conversation! If not, needinfo me with what else you need.
Flags: needinfo?(dburns)
Mark, The patch attached to this bug is all the data that we would like to collect from people while after they have built Firefox. Please can you help get the ball rolling on this so we can get the data collection happening ASAP
Flags: needinfo?(dburns) → needinfo?(mcote)
When it comes to data compliance, we have a few criteria[1]. Can you help me out by answering a few questions? 1. Collect what we need (least amount of data) Do we have a clear need/use for all the data we're collecting? Is there anything that could be eliminated with no real harm to the project's goals? 2. De-identify where we can (least identifiable form) Is there any data in there that could be used to identify a user (other than the randomly generated unique client ID)? For example, email address, IP, etc. A cursory glance indicates no, but we should double check. (My interpretation of this principle is not that we can't differentiate users in our system, but rather than we can't identify a user as being a particular person. I'm going to verify this with Marshall.) 3. Delete when no longer necessary (least amount of time) This one needs clear answers. How long are we keeping this data? Do users have the ability to remove their data? Do users have the ability to change their unique ID? 4. Store securely This isn't too critical if we have sufficiently de-identified the data, but we should still be clear. Where are we storing this data? 5. Limit access Again the severity of this question depends on the answer to (2) above. Who will have access to the data, and what kind (read/write) of access will they have? How is access requested, and who can grant it? 6. Share appropriately This is sort of an extension of (5) but has a focus on contributors and the public (note to self: clarify the difference between these sections): Will the data be available to non-employee contributors? How will the data be transmitted when shared? (Note: generally, encryption should be used) Will the data be available publicly? If so, in raw form, or only in aggregated form? Thanks! [1] https://mana.mozilla.org/wiki/display/DATAPRACTICES/Platform+Data+Practices
Flags: needinfo?(mcote)
Flags: needinfo?(gps)
When it comes to data compliance, we have a few criteria[1]. Can you help me out by answering a few questions? 1. Collect what we need (least amount of data) All of the data plays a critical role in making sure that we can inform what type of machines are in use. This also informs us when we need to see about trying to push new hardware refreshes out to employees. 2. De-identify where we can (least identifiable form) We are collecting “meta-data” about their machine and build metrics. We are not adding the specific user details 3. Delete when no longer necessary (least amount of time) How long are we keeping this data? 6 months - 1 year Do users have the ability to remove their data? Yes, see https://wiki.mozilla.org/Telemetry/FAQ Do users have the ability to change their unique ID? Not obvious from Telemetry documentation 4. Store securely We are storing this in the Telemetry data pipeline 5. Limit access https://wiki.mozilla.org/Telemetry/FAQ has the answer to this 6. Share appropriately https://wiki.mozilla.org/Telemetry/FAQ has the answer Thanks! [1] https://mana.mozilla.org/wiki/display/DATAPRACTICES/Platform+Data+Practices
Flags: needinfo?(gps)
Mark, is there anything else that we need to do go get bug closed?
Flags: needinfo?(mcote)
Greg, from your point of view, are we correct in what we have put here still.
Flags: needinfo?(gps)
This seems pretty good to me. I'm glad you're leveraging an existing system rather than building your own!
Flags: needinfo?(mcote)
(In reply to David Burns :automatedtester from comment #7) > Greg, from your point of view, are we correct in what we have put here still. Yes.
Flags: needinfo?(gps)
Blocks: buildmetrics
In bug 1237610 gps brought up that we don't currently have a prompt mechanism--we only have an opt-in environment variable. Are there guidelines somewhere on what we'd need to do to enable this as opt-out? I guess a simple solution would be to prompt during `mach bootstrap`, which would not catch existing users until they re-ran bootstrap, but would give us an easy place to put a prompt, since there are already a pile of prompts during the bootstrap process.
I'm not actively working on this. Furthermore, it's unclear what state this bug is in. My recollection is I mostly handed things off to David to do the process work and Mark got involved because he was some kind of data collection steward (or something) for the former Developer Productivity org.
Assignee: gps → nobody
Status: ASSIGNED → NEW
I can drive this. The telemetry bits look solid, but if we're going to start collecting this data prior to the generic ingestion service, we have more work to do. In particular, we should firm up how we'll deal with the data on our custom collection server. Here's my first stab, with questions inline: 1. Collect what we need (least amount of data) Can we itemize what we are planning to collect right now? Is there a schema? 2. De-identify where we can (least identifiable form) Is providing email address opt-in? Should it be opt-in separately from submitting the rest of the data? 3. Delete when no longer necessary (least amount of time) I think our plan is delete this data once the generic service is setup, or possibly to resubmit the data to the service and let the service worry about expiry. I don't anticipate our custom solution lasting beyond the summer (3 months). 4. Store securely 5. Limit access The data will reside on a custom server that will only be accessible to a limited number of build peers, i.e. gps and ted. Does anyone else *need* access? 6. Share appropriately The data won't be shared except in roll-up fashion. The possible exception is for anomalies found in the data that can be tied directly to a developer who has opted-in with their email. We may follow-up directly with that developer for more information or to help improve their indivdual experience.
Assignee: nobody → coop
Status: NEW → ASSIGNED
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #10) > I guess a > simple solution would be to prompt during `mach bootstrap`, which would not > catch existing users until they re-ran bootstrap, but would give us an easy > place to put a prompt, since there are already a pile of prompts during the > bootstrap process. From discussion with gps, prompting during `mach bootstrap` seems to be the way to go. People run bootstrap frequently enough now (or should, at any rate) to pickup changes that we should catch most people quickly.
We also now detect VCS info as part of configure. It would be trivial to look for mozilla.com emails in the Mercurial or Git config and do things based on that. Whether the user-defined VCS username is sufficient to imply permission to send data, I don't know. And obviously not all Mozilla employees use their mozilla.com emails for VCS. But it is certainly possible to leverage that VCS info in the build system itself.
Punting this to Kim.
Assignee: coop → kmoir
Product: Core → Firefox Build System
needinfoing chutten as I understand he is the data steward for firefox telemetry. What is the signoff process to collect build telemetry data on an opt-in basis for Mozilla employees?
Flags: needinfo?(chutten)
That is a very good question that I do not have an answer to. I don't know under what agreements contributors abide when building or contributing to Firefox. I don't know that we have any current frameworks for asking for consent or informing users of, say, `mach` of their rights (though mach bootstrap's probably an excellent place for it). Also, to my knowledge we do not have a process by which to evaluate whether a given data collection via mach is acceptable (build durations: almost certainly acceptable. VCS-configured email addresses: maybe, since they'd show on patch files. directory names: probably not, as they could contain -anything-) I posit that we could reuse a lot of what we use for new Firefox data collection (https://wiki.mozilla.org/Firefox/Data_Collection). The category scheme and data collection review process seem easy enough to adapt. One thing I definitely know is that I would opt in to such a data collection mechanism and would like to subscribe to your newsletter where you publish results. :) So this is where I do what I do when I don't know what to do (in a Data Steward setting): ni?merwin for Trust, Security, Privacy, and Legal advice.
Flags: needinfo?(chutten) → needinfo?(merwin)
As a further question, what is the signoff process/ramifications regarding using opt out as a default, and including non-Mozilla employees as well
Kim, I'm not very familiar with 'mach.' From where can you download and install it? Can you tell me to what extent non-Mozilla staff use it? Is this mostly a tool for out staff that non-employees occasionally use?
Flags: needinfo?(merwin) → needinfo?(kmoir)
mach is a script that resides at the tip of firefox code repositories such as mozilla-central. It is used to run local developer builds. For instance, you can run ./mach bootstrap to setup your environment and then ./mach build to invoke the build process and run the resulting browser with ./mach run https://hg.mozilla.org/mozilla-central/file/tip/mach https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/mach Yes, it is a a tool for our staff that non-employees also use. People who contribute to Firefox who are not employees also use this tool. I don't have any metrics on how many non-Mozilla people use it, we need telemetry :-)
Flags: needinfo?(kmoir)
Flags: needinfo?(merwin)
As discussed in our meeting, collecting this data by default seems appropriate. We have two outstanding issues: - When ready, Kim will attach documentation to this bug describing the data collection. - Kim will propose disclosure language that will appear the first time a person users 'mach.' Once we have the documentation and disclosure language, I will review.
Flags: needinfo?(merwin)
Thanks Merwin. I will provide this. One question, you mentioned in the meeting that I should include a link to the an longer explanation on the data we collect. Is there a standard repository where these sorts of policies are stored or should I just store our policy in a location of our choosing?
Flags: needinfo?(merwin)
removing mi for :merwin since I'm talking to :chutten in irc about this issue
Flags: needinfo?(merwin)
From chutten kmoir: I remember this back when it was a data collection review request :) 9:40 AM (and before it was moco-conf) 9:41 AM (or maybe it always was. Can't recall) 9:41 AM kmoir: So, we have a few places we document data stuff 9:41 AM <chutten> The first is in the source tree itself, via the restructured text (rst) files under toolkit/components/telemetry/docs <chutten> It gets published over thisaway: https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/ <kmoir> thanks, this is very helpful 9:43 AM <chutten> The second is https://docs.telemetry.mozilla.org It is built from the files in https://github.com/mozilla/firefox-data-docs and may be a better fit 9:43 AM The in-tree stuff is mostly documenting in-tree collection mechanisms kmoir> okay 9:44 AM <chutten> kmoir: You could arguably make the case either (or both!) of the two ways for what you're hoping to collect, so just go with whichever you prefer I guess :D
In response to comment #21, this is the material we intend to collect mach command run, time to complete and sequence of commands Basic hardware info - cpu brand string, memory, type of disk Files were changed through invocation - mach watchman Configure flags: Artifact build or not Debug vs opt build Sccache and icecream usage Exception code if applicable Persistent client id to store mach invocations over time One time opt-out text for users when running mach The developer workflow team intends to start collecting data while running local builds. We will use this information to determine how to make builds faster and improve developer tooling. To learn more about the data we intend to collect read here (url tbd). If you have questions, please ask in #build in irc.mozilla.org. If you would like to opt out of data collection, select (N) at the prompt. (side thought - I wonder if we need to provide an option for them to to opt-out if they decide to do so later on - probably I'm over thinking this)
Flags: needinfo?(merwin)
This will likely come up again in the data review process, but it pays to think about it now: any commands or files or paths that you record could contain personal identifiers (the usual one is user names from homedirs), and so may need to be treated very carefully. Rooting paths inside the source dir may be enough (accounting for the case of ../../../someidentifier/someother/mozilla-central/), and bucketing unrecognized strings to an "other" bucket will help as well for cases when we know what strings we expect (like the mach command invocation)
I'd shorten this disclosure language slightly: Mozilla collects data about local builds in order to make builds faster and improve developer tooling. To learn more about the data we intend to collect read here (url tbd). If you have questions, please ask in #build in irc.mozilla.org. If you would like to opt out of data collection, select (N) at the prompt.
Flags: needinfo?(merwin)
:chutten - how do we initiate the data review process? Ted has build telemetry schema patches up in bug 1461992 and Connor is working on the the data ingestion pipeline side of things so we are making progress.
Flags: needinfo?(chutten)
That refers to the Data Collection Review process documented here: https://wiki.mozilla.org/Firefox/Data_Collection Its forms are only 100% relevant for Firefox data collections, so there may be some questions we'll have to hand-wave our answers through. Ultimately it's there to answer questions like "Why are we even bothering?" and "How will we turn this data into decisions?" and as messages to future mozillians who are looking at this data collection without the benefit of our context.
Flags: needinfo?(chutten)
Questions from the data review process: What questions will you answer with this data? What is the distribution of time to complete local builds, and what platforms/hardware are these run on? What sequence of mach commands are run? Are there mach commands that aren't used and can be deprecated? What files are changed during the local build invocation (mach watchman) What percentage of local builds are artifact builds vs non-artifact builds? What percentage of builds are debug vs opt, and what distribution of platforms/hardware are these run on? What percentage of users use sccache or icecream? What is the distribution of operating systems/hardware that these builds run on? What exception codes are resulting from local builds? Why does Mozilla need to answer these questions? Are there benefits for users? Do we need this information to address product or business requirements? Establish baselines and measure how local builds are invoked so that we can better address the needs of developers and improve their local build experienced. What alternative methods did you consider to answer these questions? Why were they not sufficient? We could have talked to people individually and watch them work but this would have been very time consuming and does not work well for our very distributed team. Can current instrumentation answer these questions? No List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories on the Mozilla wiki. This is N/A since we are not collecting data with respect to Firefox usage. How long will this data be collected? Choose one of the following: I want this data to be collected for 6 months initially (potentially renewable). What populations will you measure? Firefox developers both Moco employees and contributors Which release channels? N/A not measuring firefox usage Which countries? Countries where there are Firefox developers who are do not opt out of collecting this data Which locales? Many If this data collection is default on, what is the opt-out mechanism for users? Users can opt out the next time they run mach Please provide a general description of how you will analyze this data. analyze using tools in the general data ingestion pipeline Where do you intend to share the results of your analysis? Mozilla + community wide
Flags: needinfo?(chutten)
DATA COLLECTION REVIEW RESPONSE: Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate? I don't know. :kmoir, where will your dataset documentation be provided? (An example: https://github.com/mozilla/activity-stream/blob/master/docs/v2-system-addon/data_dictionary.md) Is there a control mechanism that allows the user to turn the data collection on and off? Yes, at least at the beginning. :kmoir, will there be a mechanism that allows users to change their minds? It would seem to be required. (an env-var or mozconfig define should do) If the request is for permanent data collection, is there someone who will monitor the data over time?** N/A, 6 months to start. Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under? ** :kmoir, please provide a list of measurements. Is the data collection request for default-on or default-off? Explicit opt-in required. (so, default off?) Does the instrumentation include the addition of any new identifiers (whether anonymous or otherwise; e.g., username, random IDs, etc. See the appendix for more details)? Unknown. Is the data collection covered by the existing Firefox privacy notice? N/A Does there need to be a check-in in the future to determine whether to renew the data? Yes. --- Result: datareview-, see above for information needed.
Flags: needinfo?(chutten) → needinfo?(kmoir)
Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate? kmoir> Yes, see bug 1461992 and python/mozbuild/mozbuild/telemetry.py when the patches in the bug lands. Is there a control mechanism that allows the user to turn the data collection on and off? Yes, at least at the beginning. :kmoir, will there be a mechanism that allows users to change their minds? It would seem to be required. (an env-var or mozconfig define should do) kmoir> I will add this requirement to the changes to mach bootstrap Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under? ** :kmoir, please provide a list of measurements. kmoir> Schema looks like this Required('command', description='The mach command that was invoked'): basestring, Required('argv', description=( 'Full mach commandline. ' + 'If the commandline contains absolute paths they will be sanitized.')): [basestring], Required('success', description='true if the command succeeded'): bool, Optional('exception', description=( 'If a Python exception was encountered during the execution of the command, ' + 'this value contains the result of calling `repr` on the exception object.')): basestring, Optional('file_types_changed', description=( 'This array contains a list of objects with {ext, count} properties giving the count ' + 'of files changed since the last invocation grouped by file type')): [ { Required('ext', description='File extension'): basestring, Required('count', description='Count of changed files with this extension'): int, } ], Required('duration_ms', description='Command duration in milliseconds'): int, Required('build_opts', description='Selected build options'): { Optional('compiler', description='The compiler type in use (CC_TYPE)'): Any(*CompilerType.POSSIBLE_VALUES), Optional('artifact', description='true if --enable-artifact-builds'): bool, Optional('debug', description='true if build is debug (--enable-debug)'): bool, Optional('opt', description='true if build is optimized (--enable-optimize)'): bool, Optional('ccache', description='true if ccache is in use (--with-ccache)'): bool, Optional('sccache', description='true if ccache in use is sccache'): bool, Optional('icecream', description='true if icecream in use'): bool, }, Required('system'): { # We don't need perfect granularity here. Required('os', description='Operating system'): Any('windows', 'macos', 'linux', 'other'), Optional('cpu_brand', description='CPU brand string from CPUID'): basestring, Optional('logical_cores', description='Number of logical CPU cores present'): int, Optional('physical_cores', description='Number of physical CPU cores present'): int, Optional('memory_gb', description='System memory in GB'): int, Optional('drive_is_ssd', description='true if the source directory is on a solid-state disk'): bool, Optional('virtual_machine', description='true if the OS appears to be running in a virtual machine'): bool, }, }) Is the data collection request for default-on or default-off? kmoir> explicit opt-in required. (so, default off?) <-- yes Does the instrumentation include the addition of any new identifiers (whether anonymous or otherwise; e.g., username, random IDs, etc. See the appendix for more details)? kmoir> Not at this time
Flags: needinfo?(kmoir) → needinfo?(chutten)
Thank you for the follow-up. The categories of the collections range from Category 1 to Category 2. datareview+ with the provided information.
Flags: needinfo?(chutten)
I've submitted a pull request with a request for assistance finding a formatting error in the Parquet schema here: https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/191.
The formatting error is resolved and I am now awaiting review on the PR.
This can be closed, also the PR in comment 36 has been merged.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED

The policy about Data Collection Reviews is that they should be public to the audience having their data collected. :mhentges, can this bug be made public? Or should we rerun the mach data collection review in a dedicated public bug?

Flags: needinfo?(mhentges)

This bug doesn't have any spicy information in it, I think that it is clear to make public.

Flags: needinfo?(mhentges)
Group: mozilla-employee-confidential
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: