<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

David Burns :automatedtester

Reporter

Updated

•

8 years ago

Flags: needinfo?(gps)

Comment 5

•

8 years ago

When it comes to data compliance, we have a few criteria[1]. Can you help me out by answering a few questions? 1. Collect what we need (least amount of data) All of the data plays a critical role in making sure that we can inform what type of machines are in use. This also informs us when we need to see about trying to push new hardware refreshes out to employees. 2. De-identify where we can (least identifiable form) We are collecting “meta-data” about their machine and build metrics. We are not adding the specific user details 3. Delete when no longer necessary (least amount of time) How long are we keeping this data? 6 months - 1 year Do users have the ability to remove their data? Yes, see https://wiki.mozilla.org/Telemetry/FAQ Do users have the ability to change their unique ID? Not obvious from Telemetry documentation 4. Store securely We are storing this in the Telemetry data pipeline 5. Limit access https://wiki.mozilla.org/Telemetry/FAQ has the answer to this 6. Share appropriately https://wiki.mozilla.org/Telemetry/FAQ has the answer Thanks! [1] https://mana.mozilla.org/wiki/display/DATAPRACTICES/Platform+Data+Practices

Flags: needinfo?(gps)

David Burns :automatedtester

Comment 6

•

8 years ago

Mark, is there anything else that we need to do go get bug closed?

Flags: needinfo?(mcote)

David Burns :automatedtester

Comment 7

•

8 years ago

Greg, from your point of view, are we correct in what we have put here still.

Flags: needinfo?(gps)

Mark Côté [:mcote]

Comment 8

•

8 years ago

This seems pretty good to me. I'm glad you're leveraging an existing system rather than building your own!

Flags: needinfo?(mcote)

Reporter

Comment 9

•

8 years ago

(In reply to David Burns :automatedtester from comment #7) > Greg, from your point of view, are we correct in what we have put here still. Yes.

Flags: needinfo?(gps)

(not currently active) Ted Mielczarek

Reporter

Updated

•

8 years ago

Blocks: buildmetrics

Comment 10

•

7 years ago

In bug 1237610 gps brought up that we don't currently have a prompt mechanism--we only have an opt-in environment variable. Are there guidelines somewhere on what we'd need to do to enable this as opt-out? I guess a simple solution would be to prompt during `mach bootstrap`, which would not catch existing users until they re-ran bootstrap, but would give us an easy place to put a prompt, since there are already a pile of prompts during the bootstrap process.

Chris Cooper [:coop] (he/him)

Reporter

Comment 11

•

7 years ago

I'm not actively working on this. Furthermore, it's unclear what state this bug is in. My recollection is I mostly handed things off to David to do the process work and Mark got involved because he was some kind of data collection steward (or something) for the former Developer Productivity org.

Assignee: gps → nobody

Status: ASSIGNED → NEW

Comment 12

•

7 years ago

I can drive this. The telemetry bits look solid, but if we're going to start collecting this data prior to the generic ingestion service, we have more work to do. In particular, we should firm up how we'll deal with the data on our custom collection server. Here's my first stab, with questions inline: 1. Collect what we need (least amount of data) Can we itemize what we are planning to collect right now? Is there a schema? 2. De-identify where we can (least identifiable form) Is providing email address opt-in? Should it be opt-in separately from submitting the rest of the data? 3. Delete when no longer necessary (least amount of time) I think our plan is delete this data once the generic service is setup, or possibly to resubmit the data to the service and let the service worry about expiry. I don't anticipate our custom solution lasting beyond the summer (3 months). 4. Store securely 5. Limit access The data will reside on a custom server that will only be accessible to a limited number of build peers, i.e. gps and ted. Does anyone else *need* access? 6. Share appropriately The data won't be shared except in roll-up fashion. The possible exception is for anomalies found in the data that can be tied directly to a developer who has opted-in with their email. We may follow-up directly with that developer for more information or to help improve their indivdual experience.

Assignee: nobody → coop

Status: NEW → ASSIGNED

Chris Cooper [:coop] (he/him)

Comment 13

•

7 years ago

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #10) > I guess a > simple solution would be to prompt during `mach bootstrap`, which would not > catch existing users until they re-ran bootstrap, but would give us an easy > place to put a prompt, since there are already a pile of prompts during the > bootstrap process. From discussion with gps, prompting during `mach bootstrap` seems to be the way to go. People run bootstrap frequently enough now (or should, at any rate) to pickup changes that we should catch most people quickly.

Chris Cooper [:coop] (he/him)

Reporter

Comment 14

•

7 years ago

We also now detect VCS info as part of configure. It would be trivial to look for mozilla.com emails in the Mercurial or Git config and do things based on that. Whether the user-defined VCS username is sufficient to imply permission to send data, I don't know. And obviously not all Mozilla employees use their mozilla.com emails for VCS. But it is certainly possible to leverage that VCS info in the build system itself.

Comment 15

•

7 years ago

Punting this to Kim.

Assignee: coop → kmoir

BMO Automation

Updated

•

7 years ago

Product: Core → Firefox Build System

Assignee

Comment 16

•

7 years ago

needinfoing chutten as I understand he is the data steward for firefox telemetry. What is the signoff process to collect build telemetry data on an opt-in basis for Mozilla employees?

Flags: needinfo?(chutten)

Comment 17

•

7 years ago

That is a very good question that I do not have an answer to. I don't know under what agreements contributors abide when building or contributing to Firefox. I don't know that we have any current frameworks for asking for consent or informing users of, say, `mach` of their rights (though mach bootstrap's probably an excellent place for it). Also, to my knowledge we do not have a process by which to evaluate whether a given data collection via mach is acceptable (build durations: almost certainly acceptable. VCS-configured email addresses: maybe, since they'd show on patch files. directory names: probably not, as they could contain -anything-) I posit that we could reuse a lot of what we use for new Firefox data collection (https://wiki.mozilla.org/Firefox/Data_Collection). The category scheme and data collection review process seem easy enough to adapt. One thing I definitely know is that I would opt in to such a data collection mechanism and would like to subscribe to your newsletter where you publish results. :) So this is where I do what I do when I don't know what to do (in a Data Steward setting): ni?merwin for Trust, Security, Privacy, and Legal advice.

Flags: needinfo?(chutten) → needinfo?(merwin)

Assignee

Comment 18

•

7 years ago

As a further question, what is the signoff process/ramifications regarding using opt out as a default, and including non-Mozilla employees as well

Merwin

Comment 19

•

7 years ago

Kim, I'm not very familiar with 'mach.' From where can you download and install it? Can you tell me to what extent non-Mozilla staff use it? Is this mostly a tool for out staff that non-employees occasionally use?

Flags: needinfo?(merwin) → needinfo?(kmoir)

Assignee

Comment 20

•

7 years ago

mach is a script that resides at the tip of firefox code repositories such as mozilla-central. It is used to run local developer builds. For instance, you can run ./mach bootstrap to setup your environment and then ./mach build to invoke the build process and run the resulting browser with ./mach run https://hg.mozilla.org/mozilla-central/file/tip/mach https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/mach Yes, it is a a tool for our staff that non-employees also use. People who contribute to Firefox who are not employees also use this tool. I don't have any metrics on how many non-Mozilla people use it, we need telemetry :-)

Flags: needinfo?(kmoir)

Assignee

Updated

•

7 years ago

Flags: needinfo?(merwin)

Merwin

Comment 21

•

7 years ago

As discussed in our meeting, collecting this data by default seems appropriate. We have two outstanding issues: - When ready, Kim will attach documentation to this bug describing the data collection. - Kim will propose disclosure language that will appear the first time a person users 'mach.' Once we have the documentation and disclosure language, I will review.

Flags: needinfo?(merwin)

Assignee

Comment 22

•

7 years ago

Thanks Merwin. I will provide this. One question, you mentioned in the meeting that I should include a link to the an longer explanation on the data we collect. Is there a standard repository where these sorts of policies are stored or should I just store our policy in a location of our choosing?

Flags: needinfo?(merwin)

Assignee

Comment 23

•

7 years ago

removing mi for :merwin since I'm talking to :chutten in irc about this issue

Flags: needinfo?(merwin)

Assignee

Comment 24

•

7 years ago

From chutten kmoir: I remember this back when it was a data collection review request :) 9:40 AM (and before it was moco-conf) 9:41 AM (or maybe it always was. Can't recall) 9:41 AM kmoir: So, we have a few places we document data stuff 9:41 AM <chutten> The first is in the source tree itself, via the restructured text (rst) files under toolkit/components/telemetry/docs <chutten> It gets published over thisaway: https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/ <kmoir> thanks, this is very helpful 9:43 AM <chutten> The second is https://docs.telemetry.mozilla.org It is built from the files in https://github.com/mozilla/firefox-data-docs and may be a better fit 9:43 AM The in-tree stuff is mostly documenting in-tree collection mechanisms kmoir> okay 9:44 AM <chutten> kmoir: You could arguably make the case either (or both!) of the two ways for what you're hoping to collect, so just go with whichever you prefer I guess :D

Comment 25

•

7 years ago

Link to the IRC conversation logs: https://mozilla.logbot.info/datapipeline/20180502#c14698010-c14698027

Assignee

Comment 26

•

7 years ago

In response to comment #21, this is the material we intend to collect mach command run, time to complete and sequence of commands Basic hardware info - cpu brand string, memory, type of disk Files were changed through invocation - mach watchman Configure flags: Artifact build or not Debug vs opt build Sccache and icecream usage Exception code if applicable Persistent client id to store mach invocations over time One time opt-out text for users when running mach The developer workflow team intends to start collecting data while running local builds. We will use this information to determine how to make builds faster and improve developer tooling. To learn more about the data we intend to collect read here (url tbd). If you have questions, please ask in #build in irc.mozilla.org. If you would like to opt out of data collection, select (N) at the prompt. (side thought - I wonder if we need to provide an option for them to to opt-out if they decide to do so later on - probably I'm over thinking this)

Flags: needinfo?(merwin)

Comment 27

•

7 years ago

This will likely come up again in the data review process, but it pays to think about it now: any commands or files or paths that you record could contain personal identifiers (the usual one is user names from homedirs), and so may need to be treated very carefully. Rooting paths inside the source dir may be enough (accounting for the case of ../../../someidentifier/someother/mozilla-central/), and bucketing unrecognized strings to an "other" bucket will help as well for cases when we know what strings we expect (like the mach command invocation)

Merwin

Comment 28

•

6 years ago

I'd shorten this disclosure language slightly: Mozilla collects data about local builds in order to make builds faster and improve developer tooling. To learn more about the data we intend to collect read here (url tbd). If you have questions, please ask in #build in irc.mozilla.org. If you would like to opt out of data collection, select (N) at the prompt.

Flags: needinfo?(merwin)

Assignee

Comment 29

•

6 years ago

:chutten - how do we initiate the data review process? Ted has build telemetry schema patches up in bug 1461992 and Connor is working on the the data ingestion pipeline side of things so we are making progress.

Flags: needinfo?(chutten)

Comment 30

•

6 years ago

That refers to the Data Collection Review process documented here: https://wiki.mozilla.org/Firefox/Data_Collection Its forms are only 100% relevant for Firefox data collections, so there may be some questions we'll have to hand-wave our answers through. Ultimately it's there to answer questions like "Why are we even bothering?" and "How will we turn this data into decisions?" and as messages to future mozillians who are looking at this data collection without the benefit of our context.

Flags: needinfo?(chutten)

Assignee

Comment 31

•

6 years ago

Questions from the data review process: What questions will you answer with this data? What is the distribution of time to complete local builds, and what platforms/hardware are these run on? What sequence of mach commands are run? Are there mach commands that aren't used and can be deprecated? What files are changed during the local build invocation (mach watchman) What percentage of local builds are artifact builds vs non-artifact builds? What percentage of builds are debug vs opt, and what distribution of platforms/hardware are these run on? What percentage of users use sccache or icecream? What is the distribution of operating systems/hardware that these builds run on? What exception codes are resulting from local builds? Why does Mozilla need to answer these questions? Are there benefits for users? Do we need this information to address product or business requirements? Establish baselines and measure how local builds are invoked so that we can better address the needs of developers and improve their local build experienced. What alternative methods did you consider to answer these questions? Why were they not sufficient? We could have talked to people individually and watch them work but this would have been very time consuming and does not work well for our very distributed team. Can current instrumentation answer these questions? No List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories on the Mozilla wiki. This is N/A since we are not collecting data with respect to Firefox usage. How long will this data be collected? Choose one of the following: I want this data to be collected for 6 months initially (potentially renewable). What populations will you measure? Firefox developers both Moco employees and contributors Which release channels? N/A not measuring firefox usage Which countries? Countries where there are Firefox developers who are do not opt out of collecting this data Which locales? Many If this data collection is default on, what is the opt-out mechanism for users? Users can opt out the next time they run mach Please provide a general description of how you will analyze this data. analyze using tools in the general data ingestion pipeline Where do you intend to share the results of your analysis? Mozilla + community wide

Flags: needinfo?(chutten)

Comment 32

•

6 years ago

DATA COLLECTION REVIEW RESPONSE: Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate? I don't know. :kmoir, where will your dataset documentation be provided? (An example: https://github.com/mozilla/activity-stream/blob/master/docs/v2-system-addon/data_dictionary.md) Is there a control mechanism that allows the user to turn the data collection on and off? Yes, at least at the beginning. :kmoir, will there be a mechanism that allows users to change their minds? It would seem to be required. (an env-var or mozconfig define should do) If the request is for permanent data collection, is there someone who will monitor the data over time?** N/A, 6 months to start. Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under? ** :kmoir, please provide a list of measurements. Is the data collection request for default-on or default-off? Explicit opt-in required. (so, default off?) Does the instrumentation include the addition of any new identifiers (whether anonymous or otherwise; e.g., username, random IDs, etc. See the appendix for more details)? Unknown. Is the data collection covered by the existing Firefox privacy notice? N/A Does there need to be a check-in in the future to determine whether to renew the data? Yes. --- Result: datareview-, see above for information needed.

Flags: needinfo?(chutten) → needinfo?(kmoir)

Assignee

Comment 33

•

6 years ago

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate? kmoir> Yes, see bug 1461992 and python/mozbuild/mozbuild/telemetry.py when the patches in the bug lands. Is there a control mechanism that allows the user to turn the data collection on and off? Yes, at least at the beginning. :kmoir, will there be a mechanism that allows users to change their minds? It would seem to be required. (an env-var or mozconfig define should do) kmoir> I will add this requirement to the changes to mach bootstrap Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under? ** :kmoir, please provide a list of measurements. kmoir> Schema looks like this Required('command', description='The mach command that was invoked'): basestring, Required('argv', description=( 'Full mach commandline. ' + 'If the commandline contains absolute paths they will be sanitized.')): [basestring], Required('success', description='true if the command succeeded'): bool, Optional('exception', description=( 'If a Python exception was encountered during the execution of the command, ' + 'this value contains the result of calling `repr` on the exception object.')): basestring, Optional('file_types_changed', description=( 'This array contains a list of objects with {ext, count} properties giving the count ' + 'of files changed since the last invocation grouped by file type')): [ { Required('ext', description='File extension'): basestring, Required('count', description='Count of changed files with this extension'): int, } ], Required('duration_ms', description='Command duration in milliseconds'): int, Required('build_opts', description='Selected build options'): { Optional('compiler', description='The compiler type in use (CC_TYPE)'): Any(*CompilerType.POSSIBLE_VALUES), Optional('artifact', description='true if --enable-artifact-builds'): bool, Optional('debug', description='true if build is debug (--enable-debug)'): bool, Optional('opt', description='true if build is optimized (--enable-optimize)'): bool, Optional('ccache', description='true if ccache is in use (--with-ccache)'): bool, Optional('sccache', description='true if ccache in use is sccache'): bool, Optional('icecream', description='true if icecream in use'): bool, }, Required('system'): { # We don't need perfect granularity here. Required('os', description='Operating system'): Any('windows', 'macos', 'linux', 'other'), Optional('cpu_brand', description='CPU brand string from CPUID'): basestring, Optional('logical_cores', description='Number of logical CPU cores present'): int, Optional('physical_cores', description='Number of physical CPU cores present'): int, Optional('memory_gb', description='System memory in GB'): int, Optional('drive_is_ssd', description='true if the source directory is on a solid-state disk'): bool, Optional('virtual_machine', description='true if the OS appears to be running in a virtual machine'): bool, }, }) Is the data collection request for default-on or default-off? kmoir> explicit opt-in required. (so, default off?) <-- yes Does the instrumentation include the addition of any new identifiers (whether anonymous or otherwise; e.g., username, random IDs, etc. See the appendix for more details)? kmoir> Not at this time

Flags: needinfo?(kmoir) → needinfo?(chutten)

Connor Sheehan [:sheehan]

Comment 34

•

6 years ago

Thank you for the follow-up. The categories of the collections range from Category 1 to Category 2. datareview+ with the provided information.

Flags: needinfo?(chutten)

Comment 35

•

6 years ago

I've submitted a pull request with a request for assistance finding a formatting error in the Parquet schema here: https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/191.

See Also: → https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/191

Connor Sheehan [:sheehan]

Comment 36

•

6 years ago

The formatting error is resolved and I am now awaiting review on the PR.

Assignee

Comment 37

•

6 years ago

This can be closed, also the PR in comment 36 has been merged.

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED