Open
Bug 1266099
Opened 9 years ago
Updated 8 years ago
Store parts or all of the `json_dump` in Elasticsearch
Categories
(Socorro :: Backend, task)
Socorro
Backend
Tracking
(Not tracked)
NEW
People
(Reporter: adrian, Unassigned)
Details
We currently redact away the json_dump field from our processed crashes before pushing them into Elasticsearch. We do that because that field is huuuuge and would cost us quite a lot of money in disk space in AWS. However, that field contains data that would be greatly beneficial to our users. We thus want to find a solution to have that useful data in Elasticsearch while increasing too much the disk space we use.
We have several options to achieve that.
1. Store the `json_dump` only for a reduced amount of time, for example, 2 weeks. That means users would be able to access it for recent crashes, but not for things older than 2 weeks. This solution is not trivial and will require some serious Elasticsearch-related work.
2. Ted proposed today on IRC that we store the json_dump amputated of its "threads" and with just the top 10 frames of the "crashing_thread". We already have a class that removes the `json_dump`, it wouldn't be hard to make one that removes parts of it instead. We need to see what is needed and what is not, and see if the space impact is big enough that we can afford doing that.
Any other proposal is very welcome.
Comment 1•9 years ago
|
||
Clarification: `crashing_thread` is already limited to 10 frames, so simply removing the `threads` key should greatly reduce the size of the json_dump.
Comment 2•9 years ago
|
||
I tested an arbitrary processed crash from about:crashes on my Linux machine:
```
$ curl -L 'https://crash-stats.mozilla.com/api/ProcessedCrash/?crash_id=a998eb6c-2606-4e63-9a1b-08242160330&datatype=processed' > processed-crash.json
$ ls -l processed-crash.json
-rw-rw-r-- 1 luser luser 140871 Apr 20 09:54 processed-crash.json
$ python -c "import json; j = json.load(open('processed-crash.json', 'rb')); del j['json_dump']['threads']; json.dump(j, open('processed-crash-no-threads.json', 'wb'), indent=0)"
$ ls -l processed-crash-no-threads.json
-rw-rw-r-- 1 luser luser 59479 Apr 20 14:16 processed-crash-no-threads.json
```
The resulting JSON with the `threads` key removed is less than half the size of the original in this case. I suspect it's likely to be higher on average, since some crashes have a lot of threads, and some threads have a lot of stack frames.
Comment 3•9 years ago
|
||
Can we just include a specific set of fields that we really care about searching on? In the short term I just care about those three specific memory-related fields.
Longer-term it might be nice to have the module list, but that may be solvable without putting the data into ES.
Comment 4•9 years ago
|
||
Obviously for your short-term use case with the memory info a targeted fix is warranted. I think that if we're going to bother putting any of the json_dump in ES we might as well put almost everything in, so that we don't wind up having to maintain a whitelist and continually adding things as people realize they want them.
![]() |
||
Comment 5•9 years ago
|
||
There's multiple things that would be interesting from in there - module list is among them (but it's pretty large), the registers are another I'd be interested to have search capability in (ideally, things like "crashes which have any register starting in 0xe5e5e5" would be interesting to look at, for example), and I recently got a question about being about to search for the thread_count.
Reporter | ||
Updated•8 years ago
|
Assignee: adrian → nobody
You need to log in
before you can comment on or make changes to this bug.
Description
•