Closed Bug 1362048 Opened 7 years ago Closed 6 years ago

Investigate our Elasticsearch database continually requiring more nodes

Categories

(Socorro :: Infra, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: adrian, Unassigned)

Details

Attachments

(1 file)

In the last few months, Miles has had to add new nodes to our prod Elasticsearch cluster regularly. We cannot grow exponentially, especially since our data is not supposed to get much bigger over time.

This bug is to investigate if and what might be wrong with our current system. It might be a configuration issue, or a data issue, or a retention issue, who knows? Let's figure it out.
I did a few checks. It is not a retention policy problem, our cron jobs does a good job of removing old indices very regularly. 

However, by looking at the data, I found out that there's a new field in the raw crash called "StackTraces", and that is sometimes very big. See for example: https://crash-stats.mozilla.com/rawdumps/71284857-19a8-4bd7-9ce9-d84f50170504.json (requires to be logged in). 

I suspect that field is being stored as a string, and since we are still using ES 1.4, that string is analyzed and put in the field_data, thus consuming a lot of memory and causing our cluster to continually require more and more nodes. It would be interesting to know when that field first started being sent in the raw crash, to see if there's a correlation. 

I can think of two possible solutions at the moment:
1. change the Elasticsearch crash storage to remove that field from the raw_crash, or
2. change our configuration so that this field is not analyzed anymore. It will still be stored in ES, and thus still consume disk space, but (if I understand correctly) won't be put in field_data anymore, and won't consume memory.

Miles, what do you think about this? Does it sound likely to you? I am missing something?
Flags: needinfo?(miles)
It's a bit hard for me to say. If this field is in fact new (where new is within the last <6 months), that would explain our needs for increased capacity (as older indices start having that field).

Do we need this field to be analyzed? Do our users care if it goes away entirely?
I don't know the answers to those questions, but I think it would be worth testing an index where we don't have those fields and comparing JVM heap usage / fielddata size growth with previous indices.
Flags: needinfo?(miles)
Benjamin, I'm failing to find context about the "StackTraces" field that now appears in the raw crash. Could you provide some pointers about it? When was it first added? What's its value in Socorro? Can we ignore it? Thanks!
Flags: needinfo?(benjamin)
I don't really understand why that's showing up in the crash-stats metadata: that's what we're sending in the crash ping. Unless the point is to compare the crash-stats data with the crash ping data? In any case I can't imagine any reason for this to be indexed in ES, but I'll defer to gsvelto.
Flags: needinfo?(benjamin) → needinfo?(gsvelto)
That field was added in bug 1280477 and it's used to communicate the stack trace of a crash back to Gecko so that it can be sent in the crash ping. I hadn't thought that putting it in the .extra file would have caused it to end up in the data consumed by Socorro; I apologize for the disruption it caused.

Anyway on the Socorro side it's useless (in fact it's completely redundant), you can simply drop it from any future crash submission and delete whatever instance of it is still in the database.
Flags: needinfo?(gsvelto)
OK, thanks for the context! 

Miles, so that field started being sent about 7 months ago, I guess that would correlate with our cluster size growing up. 

I'm going to make a patch to remove that field from the raw crash before storing in Elasticsearch.
Assignee: nobody → adrian
Commit pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/2faa9aed3ba4ee747e843d12439ea7b7c7e259ff
Fixes bug 1362048 - Redacted away StackTraces from raw_crash in ES. (#3771)

* Fixes bug 1362048 - Redacted away StackTraces from raw_crash in ES.
The raw_crash nowadays contain a key called "StackTraces" that contains a pretty big string. Storing it in Elasticsearch has a pretty high cost in terms of disk space and memory. This thus adds a new redactor, specifically made for the raw_crash, that we want to use every time we send a crash to Elasticsearch. .

* Better variable usage.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
While verifying this I found out there's another huge field related to Telemetry that we store: `TelemetryEnvironment`. 

Gabriele, I assume this is the same kind of field, intended to be sent to Telemetry only, but that ends up being sent to Socorro as well?
Status: RESOLVED → REOPENED
Flags: needinfo?(gsvelto)
Resolution: FIXED → ---
No, quite the opposite. The telemetry environment is supposed to eventually become the canonical source for correlation information, with structured data about installed addons/plugins/graphics environment, etc.
Flags: needinfo?(gsvelto)
Note that the removal of the StackTraces field works on stage.
Benjamin, if I understand correctly, it is a bit like a dump? It contains a lot of data about the crash in a "raw" form, and Socorro should interpret that data in its processing step and put it into other fields of the processed crash at some point in the future? 

As it is right now, that field is a string in our systems, and is thus not convenient to query. Do you want us to expose it through Super Search? If not, I think it is fine to stop storing it in our Elasticsearch database for the time being, considering that it is quite big.
It's JSON, so it should probably be parsed out as JSON and included in ES like that. The schema and fields are documented at https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/environment.html
Yeah, as Benjamin said TelemetryEnvironment predates my work by a long time and is meant to convey information about the environment in which a crash has happened. It's also used so that we can feed it back to the crash ping in case of main process crashes but I think it might be relevant for Socorro too.

While it does contain some duplicate data compared to the other crash annotations we can't just use it in their place because it's annotation is not added immediately at startup so it's not guaranteed to be there for all startup crashes:

https://dxr.mozilla.org/mozilla-central/rev/3e166b6838931b3933ca274331f9e0e115af5cc0/toolkit/components/telemetry/TelemetryStartup.js#34
The TelemetryEnvironment field is very big, and takes a lot of disk space in our Elasticsearch database. We do not use it at the moment in Socorro, and I haven't heard of any plans to use it in the future. Can we clarify why we should be storing that field in our databases and what the value is for you? Do you care about the entire content of that field, or just a few bits?
Flags: needinfo?(gsvelto)
Flags: needinfo?(benjamin)
The telemetry environment is intended to become the canonical record of correlation data, and it is the same across crash-stats and other telemetry pings. So our long-term plan is to get rid of the current addon and settings fields and use the telemetry environment as the source of truth.
Flags: needinfo?(benjamin)
Flags: needinfo?(gsvelto)
Assignee: adrian → nobody
Switching this to the infra component.
Component: Backend → Infra
Given the scope of this bug, I'm marking this resolved. ~6 months ago, we discovered and resolved issues that caused malformed data to be submitted to elasticsearch. After some code fixes, we managed to significantly reduce the field data size used by our indices, which was the primary motivator behind scaling up our elastisearch cluster.
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: