elasticsearch8.webapp.phx1.mozilla.com is 503 status

RESOLVED FIXED

Status

Infrastructure & Operations
WebOps: Other
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: dgarvey, Assigned: jakem)

Tracking

Details

(Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/700] )

(Reporter)

Description

4 years ago
I rebooted the box for elasticsearch8.webapp.phx1.mozilla.com and noticed it is not in the cluster.

Updated

4 years ago
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/700]
(Reporter)

Comment 1

4 years ago
[root@elasticsearch8.webapp.phx1 ~]# curl http://localhost:9200/
{
  "ok" : true,
  "status" : 503,
  "name" : "elasticsearch8_phx1",
  "version" : {
    "number" : "0.90.10",
    "build_hash" : "0a5781f44876e8d1c30b6360628d59cb2a7a2bbb",
    "build_timestamp" : "2014-01-10T10:18:37Z",
    "build_snapshot" : false,
    "lucene_version" : "4.6"
  },
  "tagline" : "You Know, for Search"
}
[root@elasticsearch8.webapp.phx1 ~]# 


Also, It appears that elasticsearch7 has the same issues. I don't know how rebooting #8 caused #7 to drop out of the cluster?

Comment 2

4 years ago
elasticsearch7 looks like it doesn't know who the master of the cluster is.  The cluster may have been in split-brain mode (where it thought that elasticsearch8 was master and now elasticsearch7 thinks that it is alone in the world).  If elasticsearch7 thinks that it is alone, it probably needs a reboot to re-ping for a master.

/var/log/elasticsearch/es_prod_phx1.log on elasticsearch7 shows:

[2014-08-07 05:15:13,961][WARN ][discovery.zen.ping.multicast] [elasticsearch7_phx1] received ping response ping_response{target [[elasticsearch8_phx1][Z4sE-xVNRhGhGpxkhVr_2w][inet[/10.8.81.149:9300]]], master [null], cluster_name[es_prod_phx1]} with no matching id [10160831]

Updated

4 years ago
See Also: → bug 1050028
Duplicate of this bug: 1050028

Updated

4 years ago
Assignee: server-ops-webops → nmaul
(Assignee)

Comment 4

4 years ago
After 'yum-wrapper upgrade' on ES7 and a reboot, it still can't find the cluster.

I'm switching this cluster from multicast to unicast host discovery. The downside is each host has to be listed, but this is relatively easy to do in puppet. The upside is, IME on other clusters, it seems to be a good bit more reliable at discovering hosts. I'll update again if/when anything comes of this.
(Assignee)

Comment 5

4 years ago
Okay, 7 came up and joined, and has taken over several shards.

That gets us out of the emergency situation of no redundancy (ES needs at least 3 nodes).

I've disabled puppet for 2 days on ES8, as well as chkconfig off'd elasticsearch on it. This is to try to prevent it from coming back online and then disrupting things by failing randomly.
(Assignee)

Updated

4 years ago
Depends on: 1050013
(meh meant to update earlier..)

Original Bug was 1050028 but closed that one as this latest one has more information.
As elasticsearch7.webapp.phx1 needed to be rebooted, following updated:
Storage: P410i Slot: 0 [5.70] -> [6.40]
BIOS: 05/05/2011 -> 07/02/2013
(Assignee)

Comment 7

4 years ago
Back in service. Seems to be a problem caused by certain kernels and auditd? That work is happening in bug 1050013, nothing more to do here.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.