Closed Bug 1275482 Opened 9 years ago Closed 9 years ago

get a whitelist of public fields from supersearch

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lonnen, Assigned: adrian)

References

Details

Attachments

(1 file)

Before we actually start sending data, Privacy would like a white list of fields we will send along with descriptions of what each field contains. Adrian has pointed out: > The list is visible here in the admin panel: https://crash-stats.mozilla.com/admin/supersearch-fields/ > > Or via the API: https://crash-stats.mozilla.com/api/SuperSearchFields/ > > You're only interested in fields with a namespace of "processed_crash" and a truthy is_returned value. > > Note that this list defines what we expose, not what we store. This is a good start, but we need descriptions for each field in here.
Assignee: nobody → adrian
Beyond descriptions, it would be best to have a schema so that we can do validation of incoming data. mreid recommended it to me in an email exchange. here is the relevant portion: > If you like, you could develop a JSONSchema descriptor for this data and send a PR to https://github.com/mozilla-services/mozilla-pipeline-schemas - that'll be helpful for validating the JSON blobs when reading, and avoid some of the "extremely defensive programming" that's needed without a schema. So I suppose we want both a human readable set of descriptions and a JSON schema to satisfy this bug.
I'm gathering some documentation to work on this, putting them here for now. metadata: https://developer.mozilla.org/en-US/docs/Crash_Data_Analysis json_dump: https://bugzilla.mozilla.org/show_bug.cgi?id=573100
Note that :njn is also very much interested (and volunteering) to give each field a description. See https://bugzilla.mozilla.org/show_bug.cgi?id=1275799 That particular bug is about these being displayed on the report index page, but why not just add a description to each field in SuperSearchFields?
I have a JSON Schema that describes most of the processed crash: https://gist.github.com/adngdb/0f59e66aa9a057b1beeda6508e5bcaa5 This file will be used as the validation schema for Telemetry, and will be used as our white list on the Socorro side. Any field or subfield that is not listed in there will _not_ be sent to Telemetry. This work is required before we can send anything over to Telemetry. It excludes: minidumps, memory_report, most of the fields containing processor metadata, all the sensitive fields (email, url, exploitability), and things that have never been used in Socorro. Some fields are lacking a description, they are marked with "@@ TODO @@". I am unable to understand those, so I would like to get help. I am needinfo'ing some people here. Could you please help with: 1. filling the missing descriptions (ideally, by forking the gist); 2. verifying that I did not write anything stupid; 2. and verifying I did not forget anything that should really make it to Telemetry? Thanks!
Flags: needinfo?(ted)
Flags: needinfo?(n.nethercote)
Flags: needinfo?(benjamin)
Can somebody describe to me what the goal is of sending data through telemetry? I imagine that it's to allow engineers to use spark and/or redash to perform interesting custom queries which aren't easily possible using just the supersearch API? I'm concerned because it sounds like we're talking about only the processed crash, but in general people are going to want both the metadata and the processed crash together. I'd like us to send them through as a unit. Why is memory_report excluded? It is intentionally anonymized and should be accessible to employees, and in fact could be quite helpful. Does this include *all* the json_dumps (for additional_minidumps), or just the main one? I didn't the other ones mentioned in this schema, but they can be very important. I couldn't solve either of the TODOs, but I did add some comments: https://gist.github.com/bsmedberg/9e354c68724cd9bbaf67f6644ed9f0dd
Flags: needinfo?(benjamin)
I filled in a few TODOs when Adrian mentioned this on IRC.
Flags: needinfo?(ted)
I don't have anything to add beyond what Ted and Benjamin added.
Flags: needinfo?(n.nethercote)
Hey Benjamin -- I've had increasing numbers of requests to expose crash data for analysis via re:dash. Our ideas for the future of supersearch look a lot like re:dash anyhow, so it seems prudent to try it now. Further, platform has been asking questions that are difficult or impossible with super search that could be enabled with Spark. The fastest path to supporting most of these queries is to hook up the public subset of the data in the data warehouse. After consulting with privacy, they would like documentation of each field we will send before we transmit any data. We are starting with the processed crash because it is better documented and the basis of existing reports. We believe this will enable us to move more quickly and then parallelize some of the future development tasks likely to follow this. We can add the raw crash data in a future iteration.
@adrian -- Where did this leave off?
I need to add the few things that Benjamin mentioned are missing, like memory_report. And I need to test that JSON schema file I have is valid, and then I don't really know what the next steps are. Will finish that asap!
I believe this is a good start to share with privacy, and to get started on our processor parts. We might want to make some changes before sending actual data to Telemetry, notably if we want to use the opportunity to clean up that data (with better names and conventions for example) and prepare to merge the raw and processed crash into one document.
Commit pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/9cd658cb117571ca5e9d14a85039fa5ffb653835 Bug 1275482 - Added a processed crash JSON schema file. (#3382) r=lonnen
Adrian, Why is this not resolved? Is the file not ready?
Flags: needinfo?(adrian)
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(adrian)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: