Closed Bug 963245 Opened 12 years ago Closed 11 years ago

MDN: ElasticHttpError: Non-OK response returned (500):

Categories

(Infrastructure & Operations Graveyard :: WebOps: Community Platform, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: groovecoder, Unassigned)

References

()

Details

ElasticHttpError: Non-OK response returned (500): u'SearchPhaseExecutionException[Failed to execute phase [query], total failure; shardFailures {[_na_][mdnprod-main_index][0]: No active shards}{[_na_][mdnprod-main_index][1]: No active shards}{[_na_][mdnprod-main_index][2]: No active shards}{[_na_][mdnprod-main_index][3]: No active shards}{[_na_][mdnprod-main_index][4]: No active shards}]' https://errormill.mozilla.org/mdn/mdn/group/154108/
Assignee: server-ops-webops → bburton
We apologize for the disruption in service. Nagios alerted the MOC about the cluster health and this was escalated to WebOps as your bug was being filed. The ElasticSearch cluster that provides services to MDN encountered a cluster-wide issue with trying to perform garbage collection due to hitting it's memory ceiling in combination with a particular index for another service I'm investigating what caused the issues with the particular index as well as what tuning and upgrades are available, as many OOM fixes are in 0.90.x and later. https://developer.mozilla.org/en-US/search?q=string is now returning results as expected for me. Please let me know if I can answer any questions.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Luke Crouch [:groovecoder] from comment #2) > This is back. :( > > https://errormill.mozilla.org/mdn/mdn/group/146983/ Still working on fixing the other index, which is causing some cluster instability, will leave this open until fully resolved but the mdn_prod index is now green
See Also: → 963311
Cluster status is now green.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
This occurred again @ 12:44 pm PST. Solarce is currently investigating.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Today's issue has been resolved, the mdn index was only available for a few minutes bug 963824 is tracking the plan for permanent resolution of the issue
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
looks to be related to bug 993671. :cyliang comments on how the permanent fix is an es upgrade. i know there is a bug tracking that, but off the top of my head don't have context on it.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Fixed by :cyliang.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
ES cossetted and rebooted. The upgrade bug in question is 963824. Discussions (outside of Bugzilla) about ameliorating this issue in other ways have started.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Continued issues with 0.20.x GC pauses are affecting this uptime, due to the shared nature and mixed use of the SCL3 cluster, moving forward with giving MDN its own cluster for prod, this is being tracked in bug 995457
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
This happened again today. (For future reference, bugs 1000488, 1000489, 1000490, and 1000493 have the Nagios alert texts.)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Passing to :cyliang as she's been working on the final resolution
Assignee: bburton → server-ops-webops
Severity: blocker → normal
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.