Closed Bug 1050116 Opened 10 years ago Closed 10 years ago

elasticsearch8.webapp.phx1.mozilla.com is 503 status

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dgarvey, Assigned: nmaul)

References

Details

(Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/700] )

I rebooted the box for elasticsearch8.webapp.phx1.mozilla.com and noticed it is not in the cluster.
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/700]
[root@elasticsearch8.webapp.phx1 ~]# curl http://localhost:9200/
{
  "ok" : true,
  "status" : 503,
  "name" : "elasticsearch8_phx1",
  "version" : {
    "number" : "0.90.10",
    "build_hash" : "0a5781f44876e8d1c30b6360628d59cb2a7a2bbb",
    "build_timestamp" : "2014-01-10T10:18:37Z",
    "build_snapshot" : false,
    "lucene_version" : "4.6"
  },
  "tagline" : "You Know, for Search"
}
[root@elasticsearch8.webapp.phx1 ~]# 


Also, It appears that elasticsearch7 has the same issues. I don't know how rebooting #8 caused #7 to drop out of the cluster?
elasticsearch7 looks like it doesn't know who the master of the cluster is.  The cluster may have been in split-brain mode (where it thought that elasticsearch8 was master and now elasticsearch7 thinks that it is alone in the world).  If elasticsearch7 thinks that it is alone, it probably needs a reboot to re-ping for a master.

/var/log/elasticsearch/es_prod_phx1.log on elasticsearch7 shows:

[2014-08-07 05:15:13,961][WARN ][discovery.zen.ping.multicast] [elasticsearch7_phx1] received ping response ping_response{target [[elasticsearch8_phx1][Z4sE-xVNRhGhGpxkhVr_2w][inet[/10.8.81.149:9300]]], master [null], cluster_name[es_prod_phx1]} with no matching id [10160831]
Assignee: server-ops-webops → nmaul
After 'yum-wrapper upgrade' on ES7 and a reboot, it still can't find the cluster.

I'm switching this cluster from multicast to unicast host discovery. The downside is each host has to be listed, but this is relatively easy to do in puppet. The upside is, IME on other clusters, it seems to be a good bit more reliable at discovering hosts. I'll update again if/when anything comes of this.
Okay, 7 came up and joined, and has taken over several shards.

That gets us out of the emergency situation of no redundancy (ES needs at least 3 nodes).

I've disabled puppet for 2 days on ES8, as well as chkconfig off'd elasticsearch on it. This is to try to prevent it from coming back online and then disrupting things by failing randomly.
(meh meant to update earlier..)

Original Bug was 1050028 but closed that one as this latest one has more information.
As elasticsearch7.webapp.phx1 needed to be rebooted, following updated:
Storage: P410i Slot: 0 [5.70] -> [6.40]
BIOS: 05/05/2011 -> 07/02/2013
Back in service. Seems to be a problem caused by certain kernels and auditd? That work is happening in bug 1050013, nothing more to do here.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.