Closed Bug 858245 Opened 12 years ago Closed 12 years ago

[socorro-crashstats] Removal of personal identifiable information from public REST API

Categories

(Socorro :: Webapp, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: peterbe, Assigned: peterbe)

References

Details

All sensitive data should be removed from the data the API returns. For example, fields such as "email" should simply be removed to scrubbed to blank.
Commits pushed to master at https://github.com/mozilla/socorro-crashstats https://github.com/mozilla/socorro-crashstats/commit/983cce83937ceff68f8e5780fae7b0a911daec7b Bug 858245 - Added a scrubber app to use in the API. r=peterbe https://github.com/mozilla/socorro-crashstats/commit/9cbc1543e5e5e21a2303cd855e36eacea4b018f3 Merge pull request #328 from AdrianGaudebert/scrubber Bug 858245 - Added a scrubber app to use in the API.
Assignee: nobody → adrian
OS: Mac OS X → All
Priority: -- → P1
Hardware: x86 → All
1) It would be pretty valuable for logged-in/privileged people to have full access. Having a whitelist would also really help expose new fields in the standard UI more easily. 2) For everyone else, we should use a whitelist, not a blacklist. Also I'd suggest that if somebody asks for a field they aren't allowed, we should return a 403 response.
What are we returning in this API? We have a lot of fields, a whitelist seems pretty unwieldy. We don't have very many sensitive fields, and we don't add them all that often, so maybe a blacklist is okay?
I don't think a whitelist is privacy-preserving. We add new fields occasionally, and sometimes they are sensitive; I don't want to block client changes on a server blacklist. I have a list of all the *current* fields here: http://benjamin.smedbergs.us/blog/2013-04-12/graph-of-the-day-crash-report-metadata/ Of the fields we currently have, I believe the following are sensitive: * URL * Email * Android_Logcat * PluginContentURL But I'm also not sure about these: * Android_Fingerprint * Android_Display I really don't think that a whitelist is going to be that hard to maintain.
From a privacy POV, a whitelist is surely the better approach. From the POV of a developer who wants to get fast results from a speculative annotation, a blacklist is better. This is a hard topic. It would have been good to have some kind of separation of fields from the beginning, even on the client side, so that an annotation is already marked sensitive or not when it comes to Socorro and we could just expose all that aren't marked that way (or the reverse).
Oh, and Android_Display is just the display hardware name, and Android_Fingerprint is a mashup of some generic info on the used Android OS, see e.g. bp-3ce7d41f-59fc-4fa8-b864-f6ea02130507 for those fields. BTW, why is PluginContentURL something different from URL? Socorro doesn't use the former anywhere atm, if it's good info, we might want to do something with it... All in all, I really want to get away from having all kinds of info stuffed into App Notes and having everything that is somewhat more permanent have separate fields that we can run searches or even reports against in a good way. App Notes largely has become an excuse for not having raw JSON fields be public anywhere, and e.g. the GFX people stuff things only there because they can run scripts against the daily CSVs that way. I'd like to become cleaner there for sure.
It could be an option to do both. Eg.:: class CrashesPerAdu(SocorroMiddleware): blacklist = ( 'Email', 'URL', ) ... class ReportList(SocorroMiddleware): whitelist = ( 'field1', 'field2', 'field3', ) The code that applies the scrubber would simply figure out if it the model has a `blacklist` and if so, reduce. And if it has a `whitelist`, remove everything other than those listed. Would that make things easier?
Don't do both. Let's just do a whitelist.
In conclusion, we talked about this at the Socorro meeting. Writing down a white list is the right thing to do. Adrian, does that make sense?
Assignee: adrian → peterbe
Pull request here: https://github.com/mozilla/socorro-crashstats/pull/364 Every model has a list complete list of all fields as whitelists. Some very few models allow all fields. There's actually very little that gets filtered ultimately. I think there's actually only **two fields that get removed** at the time of writing.
After talking to :smedberg we concluded that the best way forward regarding the RawCrash model is to include RawCrash and whitelist all the fields that it might return. I'll show the fields to smedberg to add to the patch review. Also, if the data isn't JSON we shouldn't return anything at all. Only if it's JSON (where it can be whitelist cleaned) it should return on the public API.
Commit pushed to master at https://github.com/mozilla/socorro-crashstats https://github.com/mozilla/socorro-crashstats/commit/cddb33b0d52024d363cb4891a1320bbd712a0ada fixes bug 858245 - Removal of personal identifiable information from public REST API, r=rhelmer,AdrianGaudebert
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Depends on: 892029
You need to log in before you can comment on or make changes to this bug.