Closed Bug 1315983 Opened 8 years ago Closed 3 years ago

es_indexing_timeout ignored: always uses 60 seconds

Categories

(Webtools Graveyard :: DXR, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: fubar, Unassigned, NeedInfo)

Details

      No description provided.
Randomly and entirely too often builds are failing with the following:

Traceback (most recent call last):
  File "/home/jenkins/dxr/dxr/build.py", line 577, in index_chunk
    index_file(tree, tree_indexers, path, es, index)
  File "/home/jenkins/dxr/dxr/build.py", line 548, in index_file
    es.bulk(chunk, index=index, doc_type=LINE)
  File "/home/jenkins/venv/local/lib/python2.7/site-packages/pyelasticsearch/client.py", line 93, in decorate
    return func(*args, query_params=query_params, **kwargs)
  File "/home/jenkins/venv/local/lib/python2.7/site-packages/pyelasticsearch/client.py", line 448, in bulk
    query_params=query_params)
  File "/home/jenkins/venv/local/lib/python2.7/site-packages/pyelasticsearch/client.py", line 281, in send_request
    raise exc.info
ReadTimeoutError: HTTPConnectionPool(host='node47.bunker.scl3.mozilla.com', port=9200): Read timed out. (read timeout=60)

Logging on the ES cluster, at least with version 1.4.4, is of no help; client connection logging, even at trace level, literally amounts to 'a client at [IP] connected to [port]'.
The default timeout for Urllib3HttpConnection is 10s, so that isn't it at first glance. However, on ConnectionError (urllib3/es seem to lump a number of read and connect timeouts together) the connection is supposed to be put on hold for 'dead_timeout' seconds and retried, and that is 60s by default on es.ConnectionPool().

Additionally, es.Transort has a retry_on_timeout arg (default False) to decide whether or not a timeout should trigger a retry on a different node.

Lastly, I *think* es.bulk() as a timeout arg but I can't tell what the default value is.

Erik, is there a way that we can enable retry_on_timeout and/or increase dead_timeout?
Flags: needinfo?(erik)
Flags: needinfo?(kev)
I just dug into the ES library code a bit. It'd been awhile.

retry_on_timeout is always True; it's hard-coded that way in pyelasticsearch.

dead_timeout is just the number of seconds before a node that was determined to be unresponsive sits in its "penalty box" before becoming eligible for future connections; nothing hangs around and blocks on it.

Digging further into urllib3 now, since that's where the error directly comes from...
The exception is *ultimately* being thrown from urllib3.connectionpool...urlopen().
elasticsearch.transport.Transport.perform_request() gets its timeout value from, oddly enough, a request_timeout *URL query param*. Bizarre. [Ed: At this point, I hoped the problem could be blamed on my overlooking this oddness. But hold onto your hats, folks!] So urllib3.connectionpool.HTTPConnectionPool.urlopen() is receiving no timeout value and going with its default, which has a couple of fallbacks but in this case takes the passed-in timeout object from elasticsearch.connection.http_urllib3.Urllib3HttpConnection.__init__().

Interlude:
How I hate the elasticsearch lib. There's such incestuous passing-along of kwargs that it's nearly impossible to determine where anything comes from; it effectively makes any high-level kwargs into a set of global variables, from which innumerable constructors peel bits off as they worm their way down the call chain. It's one reason I stuck it out with pyelasticsearch for so long before finally delegating at least the transport later. Maybe I should have persisted.

Anyway:
Urllib3HttpConnection.__init__() makes no mention of a timeout arg, of course, but takes it as one of its big bucket of kwargs, which it passes along to its base class, Connection, which peels off one called "timeout". So let's see what timeout kwarg gets passed to Urllib3HttpConnection.__init__(). Well, Urllib3HttpConnection is instantiated by Transport.set_connections(), which of course just passes it a big bucket of kwargs, which in turn comes from an instance var that was socked away at some time previous. Let's see if I can guess when that was. Ah, of course, it's the kwargs that were passed to Transport on *its* construction, to which my ElasticSearch constructor actually does pass a value (default 60, hmm) in the right place. That, in turn, is instantiated by DXR's index_and_deploy_tree(), which passes in es_indexing_timeout (also default 60) from the config. That sure-enough looks like 120 to me (https://github.com/mozilla-platform-ops/dxr-infra/blob/master/dxr.config#L12). So I have no idea what's going on. I'll have to try to reproduce this locally and troubleshoot it that way.
Flags: needinfo?(erik)
Wow. In my limited poking in the code I could tell it was a maze of twisty little passages, all alike, but clearly it was worse than I might have imagined. 

I had started to think that we were seeing these as a precursor to the Seamicro chassis failing, as its catastrophic failure mode is the backplane and network going boom, but we had another timeout this morning. :-(

Trace logging had been re-enabled so I'll go look...  oh wait. The connection failure overnight (~13hrs ago) was to node76, but it hasn't logged anything since I added node80 22hrs ago. Well, I suppose that makes sense since it's a connection timeout, but I find it unlikely that node76 hasn't done ANYTHING in the last day.
Since this scuttles a fair number of large, time-consuming builds and the actual cause is so elusive, I think it's worth a try at just catching and retrying these timeouts. That way, we're also armored against utterly failed requests, in addition to ones that just take awhile.
Actually, it should already be retrying on failure, given our configuration: https://github.com/mozilla-platform-ops/dxr-infra/blob/b65c085270e3c6fbf1c1fb053ffd99e28a5e4a35/dxr.config#L13. elasticseach-py's http_urllib3.py's perform_request catches urllib3's ReadTimeoutError and raises ConnectionTimeout. But it's ReadTimeoutError we see in the logs, not ConnectionTimeout, so apparently it's not getting caught. That bears chasing.
Never mind. elasticsearch-py's http_urllib3.py's perform_request catches ReadTimeoutError and wraps it in a ConnectionTimeout. pyelsticsearch's send_request() then pulls the inner exception back out and re-raises it. So it catches one type and raises another. Everything is as expected. My reading of the code suggests that it is indeed retrying properly, but it's hitting its max retry and even that is failing.
Summary: ReadTimeoutError connecting to ES cluster → es_indexing_timeout ignored: always uses 60 seconds
In an effort to try and help elasticsearch-py a little bit, I've updated dxr.config to use a load balancer address for ES, instead of the individual host lists. This should also help a bit when patching time comes around and we have to restart the client machines.

DXR is no longer available. Searchfox is now replacing it.
See meta bug 1669906 & https://groups.google.com/g/mozilla.dev.platform/c/jDRjrq3l-CY for more details.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.