963245 - MDN: ElasticHttpError: Non-OK response returned (500):

Reporter

Description

•

12 years ago

ElasticHttpError: Non-OK response returned (500): u'SearchPhaseExecutionException[Failed to execute phase [query], total failure; shardFailures {[_na_][mdnprod-main_index][0]: No active shards}{[_na_][mdnprod-main_index][1]: No active shards}{[_na_][mdnprod-main_index][2]: No active shards}{[_na_][mdnprod-main_index][3]: No active shards}{[_na_][mdnprod-main_index][4]: No active shards}]' https://errormill.mozilla.org/mdn/mdn/group/154108/

Luke Crouch [:groovecoder]

Reporter

Updated

•

12 years ago

URL: https://developer.mozilla.org/en-US/s...

Brandon Burton [:solarce]

Updated

•

12 years ago

Assignee: server-ops-webops → bburton

Brandon Burton [:solarce]

Comment 1

•

12 years ago

We apologize for the disruption in service. Nagios alerted the MOC about the cluster health and this was escalated to WebOps as your bug was being filed. The ElasticSearch cluster that provides services to MDN encountered a cluster-wide issue with trying to perform garbage collection due to hitting it's memory ceiling in combination with a particular index for another service I'm investigating what caused the issues with the particular index as well as what tuning and upgrades are available, as many OOM fixes are in 0.90.x and later. https://developer.mozilla.org/en-US/search?q=string is now returning results as expected for me. Please let me know if I can answer any questions.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Luke Crouch [:groovecoder]

Reporter

Comment 2

•

12 years ago

This is back. :( https://errormill.mozilla.org/mdn/mdn/group/146983/

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Brandon Burton [:solarce]

Comment 3

•

12 years ago

(In reply to Luke Crouch [:groovecoder] from comment #2) > This is back. :( > > https://errormill.mozilla.org/mdn/mdn/group/146983/ Still working on fixing the other index, which is causing some cluster instability, will leave this open until fully resolved but the mdn_prod index is now green

Brandon Burton [:solarce]

Updated

•

12 years ago

Comment 4

•

12 years ago

Cluster status is now green.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

Adrian J Fernandez [:Aj]

Comment 5

•

12 years ago

This occurred again @ 12:44 pm PST. Solarce is currently investigating.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Brandon Burton [:solarce]

Comment 6

•

12 years ago

Today's issue has been resolved, the mdn index was only available for a few minutes bug 963824 is tracking the plan for permanent resolution of the issue

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

Luke Crouch [:groovecoder]

Reporter

Comment 7

•

12 years ago

Getting these errors again. https://rpm.newrelic.com/accounts/263620/applications/3172075/traced_errors/1484382839

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Chris Turra [:cturra]

Comment 8

•

12 years ago

looks to be related to bug 993671. :cyliang comments on how the permanent fix is an es upgrade. i know there is a bug tracking that, but off the top of my head don't have context on it.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

Luke Crouch [:groovecoder]

Reporter

Comment 9

•

12 years ago

Again: https://rpm.newrelic.com/accounts/263620/applications/3172075/traced_errors/1492808873 Do we need an ES reboot?

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Luke Crouch [:groovecoder]

Reporter

Comment 10

•

12 years ago

Fixed by :cyliang.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

C. Liang [:cyliang]

Comment 11

•

12 years ago

ES cossetted and rebooted. The upgrade bug in question is 963824. Discussions (outside of Bugzilla) about ameliorating this issue in other ways have started.

Luke Crouch [:groovecoder]

Reporter

Comment 12

•

12 years ago

Again: https://rpm.newrelic.com/accounts/263620/applications/3172075/traced_errors/1496813634

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Brandon Burton [:solarce]

Comment 13

•

12 years ago

Continued issues with 0.20.x GC pauses are affecting this uptime, due to the shared nature and mixed use of the SCL3 cluster, moving forward with giving MDN its own cluster for prod, this is being tracked in bug 995457

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

C. Liang [:cyliang]

Comment 14

•

11 years ago

This happened again today. (For future reference, bugs 1000488, 1000489, 1000490, and 1000493 have the Nagios alert texts.)

Luke Crouch [:groovecoder]

Reporter

Comment 15

•

11 years ago

And again today. Can we get another restart? https://rpm.newrelic.com/accounts/263620/applications/3172075/traced_errors/1567702582

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Brandon Burton [:solarce]

Comment 16

•

11 years ago

Passing to :cyliang as she's been working on the final resolution

Assignee: bburton → server-ops-webops

Severity: blocker → normal

Status: REOPENED → RESOLVED

Closed: 12 years ago → 11 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

MDN: ElasticHttpError: Non-OK response returned (500):

Categories

(Infrastructure & Operations Graveyard :: WebOps: Community Platform, task)

Tracking

(Not tracked)

People

(Reporter: groovecoder, Unassigned)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Updated