Closed Bug 932944 Opened 11 years ago Closed 11 years ago

Add processed_crash keys to the API whitelist

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: benjamin, Assigned: peterbe)

Details

(Whiteboard: [qa-])

Attachments

(3 files)

https://crash-stats.mozilla.com/api/ProcessedCrash/?crash_id=0ecc6db0-44e1-45c7-92d1-c3e0e2131030

This is a hang report with an additional minidump. Additional minidumps should be listed from the ProcessedCrash API. There are some sub-fields which should only be available to privileged users (exploitability), but most are public.

I suspect this may be because of some API blacklist?

This makes dealing with hang reports impossible.
In this particular case, it's the upload_file_minidump_browser subtree, but it really can be any upload_file_minidump_* key.
Attached file log
Here is the log from /var/log/messages when it disconnects. This time it took 11 minutes to timeout, instead of what I previously remember which was 3-5 minutes.
oops, wrong bug
need to change the API whitelist to include the following keys:

upload_file_minidump_browser
upload_file_minidump_flash1
upload_file_minidump_flash2

:bsmedberg are there others?
Summary: Non-main minidump not showing up in api/ProcessedCrash → Add processed_crash keys to the API whitelist
Currently no, but it would be better to just add upload_file_minidump_*.

Also what does this mean for the exploitability key? Does the whitelist apply to subtrees like this?
Until the adoption of the json MDSW, there is no "sensitive" data stored in the processed_crash in HBase.  This whitelist in the UI is a secondary system for redaction, aware only of top level keys and literals matches (no wild cards).  Here in this bug, its behavior is shown to be conservative and somewhat overzealous.  This is not necessarily a bad thing.

A new flexible tree-aware blacklist redaction system is being deployed in the crash storage system.  That system is employed by the middleware which, in turn, is employed by the UI.   Asking the crash storage system for a processed crash gets in return, a redacted version of the crash.  The "sensitive" branch (anywhere is the tree) is pruned along with several other keys.  See Bug 930324 for details on this new system.

The new redaction system also introduces a unredacted middleware API for fetching an unsensored processed_crash.  This new system is not currently employed by the UI.  If it is, in the future, it will be an authenticated/authorized service only.
(In reply to K Lars Lohn [:lars] [:klohn] from comment #6)
> Until the adoption of the json MDSW, there is no "sensitive" data stored in
> the processed_crash in HBase.

Are you sure? I'm pretty sure that email, URL, and the "exploitability" key are all in there.

  This whitelist in the UI is a secondary
> system for redaction, aware only of top level keys and literals matches (no
> wild cards).  Here in this bug, its behavior is shown to be conservative and
> somewhat overzealous.  This is not necessarily a bad thing.

No, I just want these things added to the API whitelist.
> No, I just want these things added to the API whitelist.

yes, this will be done

> Are you sure? I'm pretty sure that email, URL, and the "exploitability" key are all in there.

if they are there, then it is a bug.  The HBase code for saving processed crashes currently in production contains this code fragment:  

        def save_processed(self, processed_crash):
            @self._wrap_in_transaction
            def transaction(client, processed_crash=processed_crash):
                processed_crash = processed_crash.copy()
                # ...
                for k in self.config.forbidden_keys:
                   if k in processed_crash:
                        del processed_crash[k]

with the value of ```self.config.forbidden_keys``` set to:

    destination.storage0.forbidden_keys: ['email', 'url', 'user_id', 'exploitability']

ad hoc exploring pulling processed crashes directly from HBase, I cannot find evidence that these keys are present.  If you have a counter example, please let me know so that I may make sure the problem is corrected.
Those forbidden keys are all stored in our processed crashes in elasticsearch. Perhaps that's what Benjamin was referring to. 

However, I didn't know those keys where not in our processed crashes in HBase. That could potentially be a problem because our current backfilling app for elasticsearch is based on those processed crashes from HBase. We could be missing a lot of sensitive data that we want to have when backfilling.
> with the value of ```self.config.forbidden_keys``` set to:
> 
>     destination.storage0.forbidden_keys: ['email', 'url', 'user_id',
> 'exploitability']

That's surprising to me. I thought that we perhaps didn't include email/url in the hbase processed crash because they are plain duplicates of the information from the raw crash. But why are we blacklisting certain fields? In particular does this mean that hbase doesn't have the exploitability data at all? 

There are certainly other fields with PII, so it's not as if this blacklist is by itself a security measure.
> But why are we blacklisting certain fields?

I've traced the blacklisting of 'email', 'url', 'user_id' via a single 'sanitize' method to May of 2009, predating our adoption of HBase.  Removal of those keys from the processed crash goes further back, to at least February of 2009.  The comment of that commit that resulted in the removal of that info from the processed_crash was,"we're no longer allowed to save the provided email address".  I do not have records of who made the edict, but I do recall that it was considered an "emergency" issue resulting in immediate action.  

The 'sanitizing' of the processed crash has continued to this day, having been ported from the original old processor to the Processo2012 implementation.  Aside from adding 'exploitability' to the list of 'forbidden_keys', the redaction system of Bug 930310 is the first re-visitation of that code since it was first written.  Because the json MDSW puts sensitive information into the processed_crash, and we want that data retrievable, I made the decision to allow sensitive information into the processed_crash.  In this latest version, redaction happens at retrieval time rather than storage time.  

If we need to change this system, then we should act quickly, it is to be pushed to production this week.  We're poised to adopt the new json MDSW.  If the redaction methods from Bug 930310 are wrong/undesirable/insufficient then json MDSW is blocked.

> In particular does this mean that hbase doesn't have the exploitability data at all?

That is correct, 'exploitability' is not in HBase.
(In reply to K Lars Lohn [:lars] [:klohn] from comment #12)
> The comment of that commit that resulted in the removal
> of that info from the processed_crash was,"we're no longer allowed to save
> the provided email address".  I do not have records of who made the edict,
> but I do recall that it was considered an "emergency" issue resulting in
> immediate action.

That sounds like a privacy decision to me - do we need to have a privacy review on storing this data to HBase with jMDSW?
With the new system from bug 930310, the exploitability data would now be in hbase?

I presume that the last paragraph of comment 12 was a question for Laura? I don't remember the history in 2009 unless there were bug numbers associated with those commits, but we certainly are storing email addresses and URLs in hbase already (in the raw crash), so it's not like we're storing anything *more* now.
Flags: needinfo?(laura)
> With the new system from bug 930310, the exploitability data would now be in hbase?

yes, in the code associated with Bug 930310, 'exploitablity' will be saved in the processed_crash within HBase.  That key is removed (along with the others on the blacklist) from the return result at the time of a call to the middleware 'get_processed_crash' service.  There is a PR pending that adds a new service to return an unredacted processed crash complete with _all_ the sensitive fields.
Of course, I've again mis-pasted a bug number.  In each of my previous posts where I mention Bug 930310, I meant to have referred to Bug 930324.
Assignee: nobody → lars
Target Milestone: --- → 65
Attachment #825856 - Flags: review?(rhelmer) → review+
Commit pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/631a304e14adfe4de27eae9441572b7cdabeb6bf
Merge pull request #1643 from twobraids/whitelist

Fixes Bug 932944 - added multidump support to whitelist
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Bumping to verified as [qa-].
Status: RESOLVED → VERIFIED
Whiteboard: [qa-]
Discussed redaction with lars offline.
Flags: needinfo?(laura)
Flags: needinfo?(laura)
This was marked FIXED, but I still don't see the other minidumps when I load the URL from comment 0, nor a newer one I just tried to load: https://crash-stats.mozilla.com/api/ProcessedCrash/?crash_id=e167ea3d-8732-4bae-a403-352e32131205
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
peterbe: I've verified that the underlying crash storage for HBase is delivering the correct data.  I've verified the fields in question are in the white list.   However, it doesn't appear that the additions to the whitelist have been honored.   Do you have suggestions as to where to look next?
Flags: needinfo?(peterbe)
(In reply to K Lars Lohn [:lars] [:klohn] from comment #21)
> peterbe: I've verified that the underlying crash storage for HBase is
> delivering the correct data.  I've verified the fields in question are in
> the white list.   However, it doesn't appear that the additions to the
> whitelist have been honored.   Do you have suggestions as to where to look
> next?

If it's not in https://github.com/mozilla/socorro/blob/master/webapp-django/crashstats/crashstats/models.py#L721 that would be the next thing to work on. 
This bug was kinda messy and some many chefs in the kitchen I ducked out of the conversation so I'm not entirely sure what's been going on.
Flags: needinfo?(peterbe)
Assignee: lars → peterbe
whitelist should contain:

upload_file_minidump_browser
upload_file_minidump_flash1
upload_file_minidump_flash2
json_dump
It really needs to be upload_file_minidump_<anythinghere>. If we need to special-case the whitelist for upload_file_minidump, we should do that.
"special-case"?? 
If, by that, you mean do support something like `upload_file_minidump_*` we'll have to do that then. 

At the moment, the white list filtering does not support anything smarter than a list of keys.
Is that what you mean by "special-case" (see comment above). That we need support for wildcards?
Flags: needinfo?(benjamin)
Yes. I'm not sure that generic wildcard support is worth it if we could hardcode this special key.
Flags: needinfo?(benjamin)
I noticed that those 4 were added to RawCrash too
https://github.com/mozilla/socorro/commit/14a2b9800ec9b2b82793a86b3664c326e1fda275
I'll change that to `upload_file_minidump_*` too. Right lars?
Flags: needinfo?(lars)
Attached file Github PR
Commit pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/a59342e8f6c3663e03671322bc23dbcf3d766e4e
fixes bug 932944 - Add processed_crash keys to the API whitelist, r=adrian
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Target Milestone: 65 → 72
Flags: needinfo?(lars)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: