High variability of elasticsearch responsiveness in phx

RESOLVED FIXED

Status

mozilla.org Graveyard
Server Operations
RESOLVED FIXED
6 years ago
3 years ago

People

(Reporter: jbalogh, Unassigned)

Tracking

Details

(Reporter)

Description

6 years ago
I'm running a couple simple queries in a loop against elasticsearch while dumitru is upgrading the disks. The queries are essentially doing a count against an index.

elasticsearch returns the time a query took in its response object, and it claims that these queries are consistently taking 1-3ms. But the total elapsed time that Python sees is around 60ms, varying up above 130ms and higher, sometimes hitting our 1 second timeout on the socket. I've seen a lower bound of 6ms for the elapsed query time.

When I run the same loop locally the elapsed times are pretty much constant, varying only by about 3ms in either direction. I don't know if we should start looking at elastic or at the network in between, but we need to figure this out before we can start shipping elasticsearch. 

This would also affect /monitor once elasticsearch goes on there in production.
(Reporter)

Comment 1

6 years ago
Running this, which hits elasticsearch's stupid info page usually responds in under 10ms, but sometimes goes over 100ms, and I saw a few over 3 seconds:

while true; do time curl -sI elasticsearch1:9200>/dev/null; sleep 1; done

Let me know what else I can do to help debug this.
Blocks: 681773
while true; do time curl -sI elasticsearch1:9200>/dev/null; sleep 1; done

Running the above locally on the machine (elasticsearch1.metrics.sjc1.mozilla.com) gave me a 3-5ms range.

@jbalogh: Are you running the "while" script on the machine locally or some other machine inside SJC?

The 3-node cluster is indexing buildbot and twitter which might cause some delay during peak times.
(Reporter)

Comment 3

6 years ago
This is for webheads in phx talking to elasticsearch1 in phx.
(Reporter)

Updated

6 years ago
Summary: High variability of elasticsearch responsiveness → High variability of elasticsearch responsiveness in phx
I don't suppose we have access to shell in to those boxes do we?  What is the full hostname you are running this test against? (elasticsearch1.???) Can we hit the REST port from other machines or is it only the webheads that can talk to it?

You mentioned in the description that you were doing this test while dumitru was upgrading disks.  If the cluster was undergoing significant changes (i.e. nodes going up and down) I'd expect there to be several slow requests.  Is this a problem during a normal steady state?
(Reporter)

Comment 5

6 years ago
I ran the script again and highlighted the highs and lows: http://pastebin.mozilla.org/1316544

It happens consistently, not just with nodes going up and down.

The IP is 10.8.81.212
(Reporter)

Comment 6

6 years ago
oremj bumped the max mem from 1G to 22G so this may feel better now.
wow. yeah, having enough memory is an important piece. :) Anurag can provide some memory analysis information tomorrow that will help you guys make sure you have enough memory per node, but I expect that 22gb should be a good start. :)

Comment 8

6 years ago
Hopefully it won't require any more than that or we'll need to buy new hardware :-(

Jeff, can you run your script again? To see if this fixed the latency? If we are still seeing issues, then it is probably zeus that is causing the lag.
(Reporter)

Comment 9

6 years ago
The output looks pretty much the same: http://pastebin.mozilla.org/1319574.

Also, an average of 50ms is pretty slow. I'd like to run the code locally to see if that's what we can really expect, or if there's something in the middle causing trouble.
If 50ms is slow, what is your desired response curve maybe as median, 95th and 99th percentiles?

oremj, does zeus offer any performance statistics that break down the time spent in zeus vs the backend?
(Reporter)

Comment 11

6 years ago
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #10)
> If 50ms is slow, what is your desired response curve maybe as median, 95th
> and 99th percentiles?

Totally just making stuff up, but I'd like to see median: 10ms, 95%: <20ms, 99%: <50ms.
Okay.  So the most interesting thing is that the response time measured by ES and the observed response time if connecting locally is < 10ms.  It sounds like we definitely should look at zeus as the first place to investigate.

Almost all of the requests in your list are 40ms or greater.  There are only a handful of requests less than 40ms, and all of those are very quick, 10ms or lower.  That seems to indicate to me an arbitrary and near constant overhead.
Zeus in phx has known performance problems right now, so I wouldn't be surprised at all if that is causing the extra latency. To confirm we could run this script directly against an ES server.
(In reply to Jeremy Orem [oremj@mozilla.com] from comment #13)
> Zeus in phx has known performance problems right now,

Tracking bug for that?  ETA?
(Reporter)

Comment 15

6 years ago
We ran the test from a box connecting directly to ES.

Under Python, ES reports a query time of 1-3ms but the whole thing takes 43-45ms.

Using curl on the command line ES reports the same query time but the whole thing takes 8-9ms.

It sounds like the latency I'm seeing is down to Python now. Thanks everyone!
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.