Closed Bug 1624345 Opened 5 years ago Closed 4 years ago

stop saving fields not in the schema to Elasticsearch

Categories

(Socorro :: Processor, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(1 file)

The Elasticsearch crash storage builds a document using the schema in super_search_fields.py. Then, it adds any fields in the raw and processed crash data objects that aren't in the schema.

The advantage to doing it this way is that future us can search data in fields that aren't in the schema as issues come up. This has helped us significantly in a few circumstances over the years, but doesn't come up very often and we could work around it if we couldn't do this anymore.

The disadvantages are three fold:

First, it uses up space in Elasticsearch.

Second, when we do add new fields to the schema, there's all this existing data already and often the way it was indexed is different than how we specify in the schema. We can't add the right bits to the document mapping because it conflicts with bits already there.

Third, we're indexing a bunch of data we don't know anything about and that seems insane.

This bug covers looking into looking into figuring out whether we should continue this practice.

Making this a P3--it's not super urgent.

We have document sizes so we can see egregious non-schema field values. Further, if we make any changes, we can see the effects of those changes.

I think I want to look at how much additional space we're using in Elasticsearch. Probably best to compute the document size with everything and then with just fields in the schemas and spit out the delta. I can probably look at that locally and see whether it's worth pursuing further.

After that, I want to figure out how I could figure things out if we didn't have this data in Elasticsearch. We'd have to pull raw and processed crash data from AWS S3. How much longer would that take? Does it cost more?

It's entirely possible the answer here is obvious, but it's worth verifying regardless.

Priority: -- → P3

I want to do a quick sanity check on this:

  1. Tweak the code so it builds an ES document that is restricted to things in the schema and emit that as a gauge metric.
  2. Process 1000 Firefox nightly crashes locally.
  3. Analysis on the "doc size" and "doc restricted to schema size":
    1. Sum doc size
    2. Sum restricted doc size
    3. Sum of delta
    4. Maximum delta

I think that should be straight-forward to do and will be illuminating.

Grabbing this to do soon since it could result in a useful reduction in ES costs/usage.

Assignee: nobody → willkg
Status: NEW → ASSIGNED
Priority: P3 → P2

I looked at 1000 Firefox crash reports and 100 Fenix crash reports. The latter is more likely to have keys that aren't in super search fields. The former is more likely to have "interesting things".

Key Value
Total records 1,100
Total original bytes 18,116,819
Total strict bytes 14,902,828
Delta 3,213,991 (17.74%)
Mean delta 2,770
Max delta 184,372

The total "savings" is about 17% if we restrict to what's in super search fields.

The interesting item is the max delta. There's one outlier in the documents which has a ton of data in it.

I can look at other slices of crash reports, but I feel like this is good enough for a go/no-go on the idea. I don't think this is a lot of work to implement if we decide to go with it.

Brian: What do you think? Is 17% worth it and/or helpful?

Flags: needinfo?(bpitts)

Purely from a cost perspective, needing 17% less storage in 6 months for elasticsearch is not worth it.

Your other arguments in favor of doing it sound good though, especially if it's simple to implement.

Flags: needinfo?(bpitts)

I think it'd take me a day or two to figure out and test. I'll look into it soon.

I have no idea what I meant by comment #5, so I'm going to move forward with removing this functionality.

I think I can do everything in this bug, so I'm going to un-tracker-ify it.

Summary: [tracker] stop saving fields not in the schema to Elasticsearch → stop saving fields not in the schema to Elasticsearch
Blocks: 1635508

This went out with bug #1644734. Marking as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: