Closed Bug 1380128 Opened 7 years ago Closed 7 years ago

SuperSearch partial downtime July 11 2017

Categories

(Socorro :: Webapp, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: peterbe, Unassigned)

Details

We currently have a memory bloat in the ES cluster on prod. 
At the time of writing, we have 1 of 20 shards failing. 

See https://sentry.prod.mozaws.net/operations/socorro-prod/issues/622977/ (ConnectionErrors) and https://sentry.prod.mozaws.net/operations/socorro-prod/issues/344569/ (monitoring health check noticing some shards failing)
I've made a Status Message warning 
("The ElasticSearch cluster is currently partially failing. See Bug 1380128 Some SuperSearch reporting yields 1/20th too few crashes.")
note-to-self: https://crash-stats.mozilla.com/admin/status/
Healthcheck reports it back to working again: https://crash-stats.mozilla.com/monitoring/healthcheck/
Status message now disabled. Stability list was announced about the partial outage. 

All is well again.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Interesting side-note; the Custom search (which is how I can see the _shards counts via the webapp) seems to cache responses. I avoided the caching by adding another index to the default custom query.
(In reply to Miles Crabill [:miles] from comment #5)
> This is why we can't have nice things:
> https://pageshot.net/P33ar2CBs7bScZdM/app.datadoghq.com

Part of me optimistically dreams that once we're in ES 5, we'll have much better control of the JVM heap. I.e. that ES does that better for us.
You need to log in before you can comment on or make changes to this bug.