Closed Bug 1275482 Opened 9 years ago Closed 9 years ago

get a whitelist of public fields from supersearch

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: lonnen, Assigned: adrian)

References

Details

Attachments

(1 file)

Link to Github pull-request: https://github.com/mozilla/socorro/pull/3382 9 years ago [DEACTIVATED] Adrian Gaudebert 44 bytes, text/x-github-pull-request		Details \| Review

Lonnen :lonnen

Reporter

Description

•

9 years ago

Before we actually start sending data, Privacy would like a white list of fields we will send along with descriptions of what each field contains. Adrian has pointed out: > The list is visible here in the admin panel: https://crash-stats.mozilla.com/admin/supersearch-fields/ > > Or via the API: https://crash-stats.mozilla.com/api/SuperSearchFields/ > > You're only interested in fields with a namespace of "processed_crash" and a truthy is_returned value. > > Note that this list defines what we expose, not what we store. This is a good start, but we need descriptions for each field in here.

Lonnen :lonnen

Reporter

Updated

•

9 years ago

Assignee: nobody → adrian

Lonnen :lonnen

Reporter

Comment 1

•

9 years ago

Beyond descriptions, it would be best to have a schema so that we can do validation of incoming data. mreid recommended it to me in an email exchange. here is the relevant portion: > If you like, you could develop a JSONSchema descriptor for this data and send a PR to https://github.com/mozilla-services/mozilla-pipeline-schemas - that'll be helpful for validating the JSON blobs when reading, and avoid some of the "extremely defensive programming" that's needed without a schema. So I suppose we want both a human readable set of descriptions and a JSON schema to satisfy this bug.

[DEACTIVATED] Adrian Gaudebert

Assignee

Comment 2

•

9 years ago

I'm gathering some documentation to work on this, putting them here for now. metadata: https://developer.mozilla.org/en-US/docs/Crash_Data_Analysis json_dump: https://bugzilla.mozilla.org/show_bug.cgi?id=573100

Peter Bengtsson [:peterbe]

Comment 3

•

9 years ago

Note that :njn is also very much interested (and volunteering) to give each field a description. See https://bugzilla.mozilla.org/show_bug.cgi?id=1275799 That particular bug is about these being displayed on the report index page, but why not just add a description to each field in SuperSearchFields?

[DEACTIVATED] Adrian Gaudebert

Assignee

Comment 4

•

9 years ago

I have a JSON Schema that describes most of the processed crash: https://gist.github.com/adngdb/0f59e66aa9a057b1beeda6508e5bcaa5 This file will be used as the validation schema for Telemetry, and will be used as our white list on the Socorro side. Any field or subfield that is not listed in there will _not_ be sent to Telemetry. This work is required before we can send anything over to Telemetry. It excludes: minidumps, memory_report, most of the fields containing processor metadata, all the sensitive fields (email, url, exploitability), and things that have never been used in Socorro. Some fields are lacking a description, they are marked with "@@ TODO @@". I am unable to understand those, so I would like to get help. I am needinfo'ing some people here. Could you please help with: 1. filling the missing descriptions (ideally, by forking the gist); 2. verifying that I did not write anything stupid; 2. and verifying I did not forget anything that should really make it to Telemetry? Thanks!

Flags: needinfo?(ted)

Flags: needinfo?(n.nethercote)

Flags: needinfo?(benjamin)

Benjamin Smedberg

Comment 5

•

9 years ago

Can somebody describe to me what the goal is of sending data through telemetry? I imagine that it's to allow engineers to use spark and/or redash to perform interesting custom queries which aren't easily possible using just the supersearch API? I'm concerned because it sounds like we're talking about only the processed crash, but in general people are going to want both the metadata and the processed crash together. I'd like us to send them through as a unit. Why is memory_report excluded? It is intentionally anonymized and should be accessible to employees, and in fact could be quite helpful. Does this include *all* the json_dumps (for additional_minidumps), or just the main one? I didn't the other ones mentioned in this schema, but they can be very important. I couldn't solve either of the TODOs, but I did add some comments: https://gist.github.com/bsmedberg/9e354c68724cd9bbaf67f6644ed9f0dd

Flags: needinfo?(benjamin)

(not currently active) Ted Mielczarek

Comment 6

•

9 years ago

I filled in a few TODOs when Adrian mentioned this on IRC.

Flags: needinfo?(ted)

Nicholas Nethercote [inactive]

Comment 7

•

9 years ago

I don't have anything to add beyond what Ted and Benjamin added.

Flags: needinfo?(n.nethercote)

Lonnen :lonnen

Reporter

Comment 8

•

9 years ago

Hey Benjamin -- I've had increasing numbers of requests to expose crash data for analysis via re:dash. Our ideas for the future of supersearch look a lot like re:dash anyhow, so it seems prudent to try it now. Further, platform has been asking questions that are difficult or impossible with super search that could be enabled with Spark. The fastest path to supporting most of these queries is to hook up the public subset of the data in the data warehouse. After consulting with privacy, they would like documentation of each field we will send before we transmit any data. We are starting with the processed crash because it is better documented and the basis of existing reports. We believe this will enable us to move more quickly and then parallelize some of the future development tasks likely to follow this. We can add the raw crash data in a future iteration.

Lonnen :lonnen

Reporter

Comment 9

•

9 years ago

@adrian -- Where did this leave off?

[DEACTIVATED] Adrian Gaudebert

Assignee

Comment 10

•

9 years ago

I need to add the few things that Benjamin mentioned are missing, like memory_report. And I need to test that JSON schema file I have is valid, and then I don't really know what the next steps are. Will finish that asap!

[DEACTIVATED] Adrian Gaudebert

Assignee

Comment 11

•

9 years ago

Attached file Link to Github pull-request: https://github.com/mozilla/socorro/pull/3382 — Details

I believe this is a good start to share with privacy, and to get started on our processor parts. We might want to make some changes before sending actual data to Telemetry, notably if we want to use the opportunity to clean up that data (with better names and conventions for example) and prepare to merge the raw and processed crash into one document.

[github robot]

Comment 12

•

9 years ago

Commit pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/9cd658cb117571ca5e9d14a85039fa5ffb653835 Bug 1275482 - Added a processed crash JSON schema file. (#3382) r=lonnen

Peter Bengtsson [:peterbe]

Comment 13

•

9 years ago

Adrian, Why is this not resolved? Is the file not ready?

Flags: needinfo?(adrian)

[DEACTIVATED] Adrian Gaudebert

Assignee

Updated

•

9 years ago

Status: NEW → RESOLVED

Closed: 9 years ago

Flags: needinfo?(adrian)

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

get a whitelist of public fields from supersearch

Categories

(Socorro :: General, task)

Tracking

(Not tracked)

People

(Reporter: lonnen, Assigned: adrian)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated

Attachment

General

Description

File Name

Content Type