(Moving this from https://github.com/mozilla/socorro-infra/issues/244) We've had issues over the last couple of months where Elasticsearch shards run out of memory and start failing on queries causing results to be some percentage off of what they should be. https://bugzilla.mozilla.org/show_bug.cgi?id=1288179 https://bugzilla.mozilla.org/show_bug.cgi?id=1276690 Last week, JP added two nodes to the ES cluster and said he'd add monitoring for it, too. This bug covers adding monitoring to the ES cluster.
Assigning JP since he said he's working on it now.
Assignee: nobody → jschneider
Marking infra bugs that are important to get fixed asap as P1.
Priority: -- → P1
To test the monitoring in stage, temporarily kill an ES node (e.g. the master one) in the STAGE ES cluster and this page: https://crash-stats.allizom.org/monitoring/healthcheck/ should stop working.
Additionally, I've setup monitoring in datadog. It can be accessed via Dashboards-->All Dashboards-->Elasticsearch (right side, toward bottom).
See Also: → bug 1298035
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1298035
You need to log in before you can comment on or make changes to this bug.