Open Bug 1266099 Opened 9 years ago Updated 8 years ago

Store parts or all of the `json_dump` in Elasticsearch

Tracking

(Not tracked)

Status:

NEW

People

(Reporter: adrian, Unassigned)

Details

[DEACTIVATED] Adrian Gaudebert

Reporter

Description

•

9 years ago

We currently redact away the json_dump field from our processed crashes before pushing them into Elasticsearch. We do that because that field is huuuuge and would cost us quite a lot of money in disk space in AWS. However, that field contains data that would be greatly beneficial to our users. We thus want to find a solution to have that useful data in Elasticsearch while increasing too much the disk space we use. We have several options to achieve that. 1. Store the `json_dump` only for a reduced amount of time, for example, 2 weeks. That means users would be able to access it for recent crashes, but not for things older than 2 weeks. This solution is not trivial and will require some serious Elasticsearch-related work. 2. Ted proposed today on IRC that we store the json_dump amputated of its "threads" and with just the top 10 frames of the "crashing_thread". We already have a class that removes the `json_dump`, it wouldn't be hard to make one that removes parts of it instead. We need to see what is needed and what is not, and see if the space impact is big enough that we can afford doing that. Any other proposal is very welcome.

(not currently active) Ted Mielczarek

Comment 1

•

9 years ago

Clarification: `crashing_thread` is already limited to 10 frames, so simply removing the `threads` key should greatly reduce the size of the json_dump.

(not currently active) Ted Mielczarek

Comment 2

•

9 years ago

I tested an arbitrary processed crash from about:crashes on my Linux machine: ``` $ curl -L 'https://crash-stats.mozilla.com/api/ProcessedCrash/?crash_id=a998eb6c-2606-4e63-9a1b-08242160330&datatype=processed' > processed-crash.json $ ls -l processed-crash.json -rw-rw-r-- 1 luser luser 140871 Apr 20 09:54 processed-crash.json $ python -c "import json; j = json.load(open('processed-crash.json', 'rb')); del j['json_dump']['threads']; json.dump(j, open('processed-crash-no-threads.json', 'wb'), indent=0)" $ ls -l processed-crash-no-threads.json -rw-rw-r-- 1 luser luser 59479 Apr 20 14:16 processed-crash-no-threads.json ``` The resulting JSON with the `threads` key removed is less than half the size of the original in this case. I suspect it's likely to be higher on average, since some crashes have a lot of threads, and some threads have a lot of stack frames.

Benjamin Smedberg

Comment 3

•

9 years ago

Can we just include a specific set of fields that we really care about searching on? In the short term I just care about those three specific memory-related fields. Longer-term it might be nice to have the module list, but that may be solvable without putting the data into ES.

(not currently active) Ted Mielczarek

Comment 4

•

9 years ago

Obviously for your short-term use case with the memory info a targeted fix is warranted. I think that if we're going to bother putting any of the json_dump in ES we might as well put almost everything in, so that we don't wind up having to maintain a whitelist and continually adding things as people realize they want them.

Robert Kaiser

Comment 5

•

9 years ago

There's multiple things that would be interesting from in there - module list is among them (but it's pretty large), the registers are another I'd be interested to have search capability in (ideally, things like "crashes which have any register starting in 0xe5e5e5" would be interesting to look at, for example), and I recently got a question about being about to search for the thread_count.

[DEACTIVATED] Adrian Gaudebert

Reporter

Updated

•

8 years ago

Assignee: adrian → nobody

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Store parts or all of the `json_dump` in Elasticsearch

Categories

(Socorro :: Backend, task)

Tracking

(Not tracked)

People

(Reporter: adrian, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated