Closed
Bug 1275482
Opened 9 years ago
Closed 9 years ago
get a whitelist of public fields from supersearch
Categories
(Socorro :: General, task)
Socorro
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: lonnen, Assigned: adrian)
References
Details
Attachments
(1 file)
Before we actually start sending data, Privacy would like a white list of fields we will send along with descriptions of what each field contains.
Adrian has pointed out:
> The list is visible here in the admin panel: https://crash-stats.mozilla.com/admin/supersearch-fields/
>
> Or via the API: https://crash-stats.mozilla.com/api/SuperSearchFields/
>
> You're only interested in fields with a namespace of "processed_crash" and a truthy is_returned value.
>
> Note that this list defines what we expose, not what we store.
This is a good start, but we need descriptions for each field in here.
Reporter | ||
Updated•9 years ago
|
Assignee: nobody → adrian
Reporter | ||
Comment 1•9 years ago
|
||
Beyond descriptions, it would be best to have a schema so that we can do validation of incoming data.
mreid recommended it to me in an email exchange. here is the relevant portion:
> If you like, you could develop a JSONSchema descriptor for this data and send a PR to https://github.com/mozilla-services/mozilla-pipeline-schemas - that'll be helpful for validating the JSON blobs when reading, and avoid some of the "extremely defensive programming" that's needed without a schema.
So I suppose we want both a human readable set of descriptions and a JSON schema to satisfy this bug.
Assignee | ||
Comment 2•9 years ago
|
||
I'm gathering some documentation to work on this, putting them here for now.
metadata: https://developer.mozilla.org/en-US/docs/Crash_Data_Analysis
json_dump: https://bugzilla.mozilla.org/show_bug.cgi?id=573100
Comment 3•9 years ago
|
||
Note that :njn is also very much interested (and volunteering) to give each field a description.
See https://bugzilla.mozilla.org/show_bug.cgi?id=1275799
That particular bug is about these being displayed on the report index page, but why not just add a description to each field in SuperSearchFields?
Assignee | ||
Comment 4•9 years ago
|
||
I have a JSON Schema that describes most of the processed crash: https://gist.github.com/adngdb/0f59e66aa9a057b1beeda6508e5bcaa5
This file will be used as the validation schema for Telemetry, and will be used as our white list on the Socorro side. Any field or subfield that is not listed in there will _not_ be sent to Telemetry. This work is required before we can send anything over to Telemetry.
It excludes: minidumps, memory_report, most of the fields containing processor metadata, all the sensitive fields (email, url, exploitability), and things that have never been used in Socorro.
Some fields are lacking a description, they are marked with "@@ TODO @@". I am unable to understand those, so I would like to get help. I am needinfo'ing some people here. Could you please help with:
1. filling the missing descriptions (ideally, by forking the gist);
2. verifying that I did not write anything stupid;
2. and verifying I did not forget anything that should really make it to Telemetry?
Thanks!
Flags: needinfo?(ted)
Flags: needinfo?(n.nethercote)
Flags: needinfo?(benjamin)
Comment 5•9 years ago
|
||
Can somebody describe to me what the goal is of sending data through telemetry? I imagine that it's to allow engineers to use spark and/or redash to perform interesting custom queries which aren't easily possible using just the supersearch API?
I'm concerned because it sounds like we're talking about only the processed crash, but in general people are going to want both the metadata and the processed crash together. I'd like us to send them through as a unit.
Why is memory_report excluded? It is intentionally anonymized and should be accessible to employees, and in fact could be quite helpful.
Does this include *all* the json_dumps (for additional_minidumps), or just the main one? I didn't the other ones mentioned in this schema, but they can be very important.
I couldn't solve either of the TODOs, but I did add some comments: https://gist.github.com/bsmedberg/9e354c68724cd9bbaf67f6644ed9f0dd
Flags: needinfo?(benjamin)
Comment 6•9 years ago
|
||
I filled in a few TODOs when Adrian mentioned this on IRC.
Flags: needinfo?(ted)
![]() |
||
Comment 7•9 years ago
|
||
I don't have anything to add beyond what Ted and Benjamin added.
Flags: needinfo?(n.nethercote)
Reporter | ||
Comment 8•9 years ago
|
||
Hey Benjamin --
I've had increasing numbers of requests to expose crash data for analysis via re:dash. Our ideas for the future of supersearch look a lot like re:dash anyhow, so it seems prudent to try it now. Further, platform has been asking questions that are difficult or impossible with super search that could be enabled with Spark.
The fastest path to supporting most of these queries is to hook up the public subset of the data in the data warehouse. After consulting with privacy, they would like documentation of each field we will send before we transmit any data.
We are starting with the processed crash because it is better documented and the basis of existing reports. We believe this will enable us to move more quickly and then parallelize some of the future development tasks likely to follow this. We can add the raw crash data in a future iteration.
Reporter | ||
Comment 9•9 years ago
|
||
@adrian -- Where did this leave off?
Assignee | ||
Comment 10•9 years ago
|
||
I need to add the few things that Benjamin mentioned are missing, like memory_report. And I need to test that JSON schema file I have is valid, and then I don't really know what the next steps are. Will finish that asap!
Assignee | ||
Comment 11•9 years ago
|
||
I believe this is a good start to share with privacy, and to get started on our processor parts.
We might want to make some changes before sending actual data to Telemetry, notably if we want to use the opportunity to clean up that data (with better names and conventions for example) and prepare to merge the raw and processed crash into one document.
Comment 12•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/socorro
https://github.com/mozilla/socorro/commit/9cd658cb117571ca5e9d14a85039fa5ffb653835
Bug 1275482 - Added a processed crash JSON schema file. (#3382)
r=lonnen
Comment 13•9 years ago
|
||
Adrian,
Why is this not resolved? Is the file not ready?
Flags: needinfo?(adrian)
Assignee | ||
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(adrian)
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•