Open Bug 1266099 Opened 8 years ago Updated 7 years ago

Store parts or all of the `json_dump` in Elasticsearch

Categories

(Socorro :: Backend, task)

task
Not set
normal

Tracking

(Not tracked)

People

(Reporter: adrian, Unassigned)

Details

We currently redact away the json_dump field from our processed crashes before pushing them into Elasticsearch. We do that because that field is huuuuge and would cost us quite a lot of money in disk space in AWS. However, that field contains data that would be greatly beneficial to our users. We thus want to find a solution to have that useful data in Elasticsearch while increasing too much the disk space we use. 

We have several options to achieve that.

1. Store the `json_dump` only for a reduced amount of time, for example, 2 weeks. That means users would be able to access it for recent crashes, but not for things older than 2 weeks. This solution is not trivial and will require some serious Elasticsearch-related work. 

2. Ted proposed today on IRC that we store the json_dump amputated of its "threads" and with just the top 10 frames of the "crashing_thread". We already have a class that removes the `json_dump`, it wouldn't be hard to make one that removes parts of it instead. We need to see what is needed and what is not, and see if the space impact is big enough that we can afford doing that. 

Any other proposal is very welcome.
Clarification: `crashing_thread` is already limited to 10 frames, so simply removing the `threads` key should greatly reduce the size of the json_dump.
I tested an arbitrary processed crash from about:crashes on my Linux machine:
```
$ curl -L 'https://crash-stats.mozilla.com/api/ProcessedCrash/?crash_id=a998eb6c-2606-4e63-9a1b-08242160330&datatype=processed' > processed-crash.json
$ ls -l processed-crash.json 
-rw-rw-r-- 1 luser luser 140871 Apr 20 09:54 processed-crash.json

$ python -c "import json; j = json.load(open('processed-crash.json', 'rb')); del j['json_dump']['threads']; json.dump(j, open('processed-crash-no-threads.json', 'wb'), indent=0)"

$ ls -l processed-crash-no-threads.json
-rw-rw-r-- 1 luser luser 59479 Apr 20 14:16 processed-crash-no-threads.json
```

The resulting JSON with the `threads` key removed is less than half the size of the original in this case. I suspect it's likely to be higher on average, since some crashes have a lot of threads, and some threads have a lot of stack frames.
Can we just include a specific set of fields that we really care about searching on? In the short term I just care about those three specific memory-related fields.

Longer-term it might be nice to have the module list, but that may be solvable without putting the data into ES.
Obviously for your short-term use case with the memory info a targeted fix is warranted. I think that if we're going to bother putting any of the json_dump in ES we might as well put almost everything in, so that we don't wind up having to maintain a whitelist and continually adding things as people realize they want them.
There's multiple things that would be interesting from in there - module list is among them (but it's pretty large), the registers are another I'd be interested to have search capability in (ideally, things like "crashes which have any register starting in 0xe5e5e5" would be interesting to look at, for example), and I recently got a question about being about to search for the thread_count.
Assignee: adrian → nobody
You need to log in before you can comment on or make changes to this bug.