SuperSearch partial downtime July 11 2017

RESOLVED FIXED

Status

Socorro
Webapp
RESOLVED FIXED
3 months ago
3 months ago

People

(Reporter: peterbe, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

3 months ago
We currently have a memory bloat in the ES cluster on prod. 
At the time of writing, we have 1 of 20 shards failing. 

See https://sentry.prod.mozaws.net/operations/socorro-prod/issues/622977/ (ConnectionErrors) and https://sentry.prod.mozaws.net/operations/socorro-prod/issues/344569/ (monitoring health check noticing some shards failing)
(Reporter)

Comment 1

3 months ago
I've made a Status Message warning 
("The ElasticSearch cluster is currently partially failing. See Bug 1380128 Some SuperSearch reporting yields 1/20th too few crashes.")
note-to-self: https://crash-stats.mozilla.com/admin/status/
(Reporter)

Comment 2

3 months ago
Healthcheck reports it back to working again: https://crash-stats.mozilla.com/monitoring/healthcheck/
(Reporter)

Comment 3

3 months ago
Status message now disabled. Stability list was announced about the partial outage. 

All is well again.
Status: NEW → RESOLVED
Last Resolved: 3 months ago
Resolution: --- → FIXED
(Reporter)

Comment 4

3 months ago
Interesting side-note; the Custom search (which is how I can see the _shards counts via the webapp) seems to cache responses. I avoided the caching by adding another index to the default custom query.
This is why we can't have nice things: https://pageshot.net/P33ar2CBs7bScZdM/app.datadoghq.com
(Reporter)

Comment 6

3 months ago
(In reply to Miles Crabill [:miles] from comment #5)
> This is why we can't have nice things:
> https://pageshot.net/P33ar2CBs7bScZdM/app.datadoghq.com

Part of me optimistically dreams that once we're in ES 5, we'll have much better control of the JVM heap. I.e. that ES does that better for us.
You need to log in before you can comment on or make changes to this bug.