stop saving fields not in the schema to Elasticsearch
Categories
(Socorro :: Processor, task, P2)
Tracking
(Not tracked)
People
(Reporter: willkg, Assigned: willkg)
References
Details
Attachments
(1 file)
The Elasticsearch crash storage builds a document using the schema in super_search_fields.py
. Then, it adds any fields in the raw and processed crash data objects that aren't in the schema.
The advantage to doing it this way is that future us can search data in fields that aren't in the schema as issues come up. This has helped us significantly in a few circumstances over the years, but doesn't come up very often and we could work around it if we couldn't do this anymore.
The disadvantages are three fold:
First, it uses up space in Elasticsearch.
Second, when we do add new fields to the schema, there's all this existing data already and often the way it was indexed is different than how we specify in the schema. We can't add the right bits to the document mapping because it conflicts with bits already there.
Third, we're indexing a bunch of data we don't know anything about and that seems insane.
This bug covers looking into looking into figuring out whether we should continue this practice.
Assignee | ||
Comment 1•5 years ago
|
||
Making this a P3--it's not super urgent.
We have document sizes so we can see egregious non-schema field values. Further, if we make any changes, we can see the effects of those changes.
I think I want to look at how much additional space we're using in Elasticsearch. Probably best to compute the document size with everything and then with just fields in the schemas and spit out the delta. I can probably look at that locally and see whether it's worth pursuing further.
After that, I want to figure out how I could figure things out if we didn't have this data in Elasticsearch. We'd have to pull raw and processed crash data from AWS S3. How much longer would that take? Does it cost more?
It's entirely possible the answer here is obvious, but it's worth verifying regardless.
Assignee | ||
Comment 2•4 years ago
|
||
I want to do a quick sanity check on this:
- Tweak the code so it builds an ES document that is restricted to things in the schema and emit that as a gauge metric.
- Process 1000 Firefox nightly crashes locally.
- Analysis on the "doc size" and "doc restricted to schema size":
- Sum doc size
- Sum restricted doc size
- Sum of delta
- Maximum delta
I think that should be straight-forward to do and will be illuminating.
Grabbing this to do soon since it could result in a useful reduction in ES costs/usage.
Assignee | ||
Comment 3•4 years ago
|
||
I looked at 1000 Firefox crash reports and 100 Fenix crash reports. The latter is more likely to have keys that aren't in super search fields. The former is more likely to have "interesting things".
Key | Value |
---|---|
Total records | 1,100 |
Total original bytes | 18,116,819 |
Total strict bytes | 14,902,828 |
Delta | 3,213,991 (17.74%) |
Mean delta | 2,770 |
Max delta | 184,372 |
The total "savings" is about 17% if we restrict to what's in super search fields.
The interesting item is the max delta. There's one outlier in the documents which has a ton of data in it.
I can look at other slices of crash reports, but I feel like this is good enough for a go/no-go on the idea. I don't think this is a lot of work to implement if we decide to go with it.
Brian: What do you think? Is 17% worth it and/or helpful?
Comment 4•4 years ago
|
||
Purely from a cost perspective, needing 17% less storage in 6 months for elasticsearch is not worth it.
Your other arguments in favor of doing it sound good though, especially if it's simple to implement.
Assignee | ||
Comment 5•4 years ago
|
||
I think it'd take me a day or two to figure out and test. I'll look into it soon.
Assignee | ||
Comment 6•4 years ago
|
||
I have no idea what I meant by comment #5, so I'm going to move forward with removing this functionality.
I think I can do everything in this bug, so I'm going to un-tracker-ify it.
Assignee | ||
Comment 7•4 years ago
|
||
Assignee | ||
Comment 8•4 years ago
|
||
Assignee | ||
Comment 9•4 years ago
|
||
This went out with bug #1644734. Marking as FIXED.
Description
•