Closed Bug 797859 Opened 12 years ago Closed 11 years ago

jenkins times out connecting to ES an awful lot

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task, P1)

All
Other

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: bburton)

Details

(Whiteboard: [triaged 20121004])

Since September 25th or so, most of the sumo-master build test runs with Jenkins have failed with TimeoutErrors when trying to connect to whatever ES cluster Jenkins uses.

https://ci.mozilla.org/job/sumo-master/

Most of the yellow dots on the left side are timeout errors.

We've got the timeout set to 5 seconds. That should be plenty of time to establish a connection with ES. So we're thinking something is fishy with either Jenkins or with ES.

Can someone take a look at the ES cluster that Jenkins is using and see if it's functioning and/or needs a restart?
IIRC :jason helped us troubleshoot some index problems on there once
-> major, this is disrupting our ability to deploy code.
Severity: normal → major
I am investigating
Assignee: server-ops-webops → bburton
Severity: major → normal
Status: NEW → ASSIGNED
Priority: -- → P1
Whiteboard: [triaged 20121004]
Adding Mike and Rehan to the cc: list.
Per email to webdev@ and webqa@ I want to restart ES in the morning as a first step and we need to upgrade ES as a long term fix
Group: infra
Whiteboard: [triaged 20121004] → [triaged 20121004][waiting][es restart]
ES was restarted at 10AM and looks happy after the restart, let me know how a new build goes

[root@jenkins1.dmz.phx1 ~]# curl -v http://localhost:9200/_cluster/health?pretty=true
* About to connect() to localhost port 9200 (#0)
*   Trying ::1... connected
* Connected to localhost (::1) port 9200 (#0)
> GET /_cluster/health?pretty=true HTTP/1.1
> User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2
> Host: localhost:9200
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: application/json; charset=UTF-8
< Content-Length: 271
< 
{
  "cluster_name" : "jenkins",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 25,
  "active_shards" : 25,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0
* Connection #0 to host localhost left intact
* Closing connection #0
There is now an ElasticSearch 0.20.x service available, just use the hostname 'jenkins-es20' in your tests.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: [triaged 20121004][waiting][es restart] → [triaged 20121004]
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.