Investigate our Elasticsearch database continually requiring more nodes

RESOLVED FIXED

Status

RESOLVED FIXED
2 years ago
9 months ago

People

(Reporter: adrian, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

2 years ago
In the last few months, Miles has had to add new nodes to our prod Elasticsearch cluster regularly. We cannot grow exponentially, especially since our data is not supposed to get much bigger over time.

This bug is to investigate if and what might be wrong with our current system. It might be a configuration issue, or a data issue, or a retention issue, who knows? Let's figure it out.
(Reporter)

Comment 1

2 years ago
I did a few checks. It is not a retention policy problem, our cron jobs does a good job of removing old indices very regularly. 

However, by looking at the data, I found out that there's a new field in the raw crash called "StackTraces", and that is sometimes very big. See for example: https://crash-stats.mozilla.com/rawdumps/71284857-19a8-4bd7-9ce9-d84f50170504.json (requires to be logged in). 

I suspect that field is being stored as a string, and since we are still using ES 1.4, that string is analyzed and put in the field_data, thus consuming a lot of memory and causing our cluster to continually require more and more nodes. It would be interesting to know when that field first started being sent in the raw crash, to see if there's a correlation. 

I can think of two possible solutions at the moment:
1. change the Elasticsearch crash storage to remove that field from the raw_crash, or
2. change our configuration so that this field is not analyzed anymore. It will still be stored in ES, and thus still consume disk space, but (if I understand correctly) won't be put in field_data anymore, and won't consume memory.

Miles, what do you think about this? Does it sound likely to you? I am missing something?
Flags: needinfo?(miles)
It's a bit hard for me to say. If this field is in fact new (where new is within the last <6 months), that would explain our needs for increased capacity (as older indices start having that field).

Do we need this field to be analyzed? Do our users care if it goes away entirely?
I don't know the answers to those questions, but I think it would be worth testing an index where we don't have those fields and comparing JVM heap usage / fielddata size growth with previous indices.
Flags: needinfo?(miles)
(Reporter)

Comment 3

2 years ago
Benjamin, I'm failing to find context about the "StackTraces" field that now appears in the raw crash. Could you provide some pointers about it? When was it first added? What's its value in Socorro? Can we ignore it? Thanks!
Flags: needinfo?(benjamin)

Comment 4

2 years ago
I don't really understand why that's showing up in the crash-stats metadata: that's what we're sending in the crash ping. Unless the point is to compare the crash-stats data with the crash ping data? In any case I can't imagine any reason for this to be indexed in ES, but I'll defer to gsvelto.
Flags: needinfo?(benjamin) → needinfo?(gsvelto)
That field was added in bug 1280477 and it's used to communicate the stack trace of a crash back to Gecko so that it can be sent in the crash ping. I hadn't thought that putting it in the .extra file would have caused it to end up in the data consumed by Socorro; I apologize for the disruption it caused.

Anyway on the Socorro side it's useless (in fact it's completely redundant), you can simply drop it from any future crash submission and delete whatever instance of it is still in the database.
Flags: needinfo?(gsvelto)
(Reporter)

Comment 6

2 years ago
OK, thanks for the context! 

Miles, so that field started being sent about 7 months ago, I guess that would correlate with our cluster size growing up. 

I'm going to make a patch to remove that field from the raw crash before storing in Elasticsearch.
Assignee: nobody → adrian

Comment 8

a year ago
Commit pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/2faa9aed3ba4ee747e843d12439ea7b7c7e259ff
Fixes bug 1362048 - Redacted away StackTraces from raw_crash in ES. (#3771)

* Fixes bug 1362048 - Redacted away StackTraces from raw_crash in ES.
The raw_crash nowadays contain a key called "StackTraces" that contains a pretty big string. Storing it in Elasticsearch has a pretty high cost in terms of disk space and memory. This thus adds a new redactor, specifically made for the raw_crash, that we want to use every time we send a crash to Elasticsearch. .

* Better variable usage.

Updated

a year ago
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
(Reporter)

Comment 9

a year ago
While verifying this I found out there's another huge field related to Telemetry that we store: `TelemetryEnvironment`. 

Gabriele, I assume this is the same kind of field, intended to be sent to Telemetry only, but that ends up being sent to Socorro as well?
Status: RESOLVED → REOPENED
Flags: needinfo?(gsvelto)
Resolution: FIXED → ---

Comment 10

a year ago
No, quite the opposite. The telemetry environment is supposed to eventually become the canonical source for correlation information, with structured data about installed addons/plugins/graphics environment, etc.
Flags: needinfo?(gsvelto)
(Reporter)

Comment 11

a year ago
Note that the removal of the StackTraces field works on stage.
(Reporter)

Comment 12

a year ago
Benjamin, if I understand correctly, it is a bit like a dump? It contains a lot of data about the crash in a "raw" form, and Socorro should interpret that data in its processing step and put it into other fields of the processed crash at some point in the future? 

As it is right now, that field is a string in our systems, and is thus not convenient to query. Do you want us to expose it through Super Search? If not, I think it is fine to stop storing it in our Elasticsearch database for the time being, considering that it is quite big.

Comment 13

a year ago
It's JSON, so it should probably be parsed out as JSON and included in ES like that. The schema and fields are documented at https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/environment.html
Yeah, as Benjamin said TelemetryEnvironment predates my work by a long time and is meant to convey information about the environment in which a crash has happened. It's also used so that we can feed it back to the crash ping in case of main process crashes but I think it might be relevant for Socorro too.

While it does contain some duplicate data compared to the other crash annotations we can't just use it in their place because it's annotation is not added immediately at startup so it's not guaranteed to be there for all startup crashes:

https://dxr.mozilla.org/mozilla-central/rev/3e166b6838931b3933ca274331f9e0e115af5cc0/toolkit/components/telemetry/TelemetryStartup.js#34
(Reporter)

Comment 15

a year ago
The TelemetryEnvironment field is very big, and takes a lot of disk space in our Elasticsearch database. We do not use it at the moment in Socorro, and I haven't heard of any plans to use it in the future. Can we clarify why we should be storing that field in our databases and what the value is for you? Do you care about the entire content of that field, or just a few bits?
Flags: needinfo?(gsvelto)
Flags: needinfo?(benjamin)

Comment 16

a year ago
The telemetry environment is intended to become the canonical record of correlation data, and it is the same across crash-stats and other telemetry pings. So our long-term plan is to get rid of the current addon and settings fields and use the telemetry environment as the source of truth.
Flags: needinfo?(benjamin)
Flags: needinfo?(gsvelto)
(Reporter)

Updated

10 months ago
Assignee: adrian → nobody
Switching this to the infra component.
Component: Backend → Infra
Given the scope of this bug, I'm marking this resolved. ~6 months ago, we discovered and resolved issues that caused malformed data to be submitted to elasticsearch. After some code fixes, we managed to significantly reduce the field data size used by our indices, which was the primary motivator behind scaling up our elastisearch cluster.
Status: REOPENED → RESOLVED
Last Resolved: a year ago9 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.