681773 - [traceback] TimeoutError: Request timed out after 1.000000 seconds

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Description

•

13 years ago

Queries are (seemingly) randomly timing out at times, after a 1-second threshold (https://builder.addons.mozilla.org/search/?q=selection&type=a&copies=2 hit it once, but they vary):

Traceback (most recent call last):

  File "/data/amo_python/www/builder/flightdeck/vendor/lib/python/django/core/handlers/base.py", line 111, in get_response
    response = callback(request, *callback_args, **callback_kwargs)

  File "/data/amo_python/www/builder/flightdeck/apps/search/views.py", line 46, in search
    facets['everyone_total'] = len(qs)

  File "/data/amo_python/www/builder/flightdeck/vendor/src/elasticutils/elasticutils/__init__.py", line 253, in __len__
    return len(self._do_search())

  File "/data/amo_python/www/builder/flightdeck/vendor/src/elasticutils/elasticutils/__init__.py", line 356, in _do_search
    hits = self.raw()

  File "/data/amo_python/www/builder/flightdeck/vendor/src/elasticutils/elasticutils/__init__.py", line 374, in raw
    hits = es.search(qs, settings.ES_INDEX, self.type._meta.db_table)

  File "/data/amo_python/www/builder/flightdeck/vendor/src/pyes/pyes/es.py", line 806, in search
    return self._query_call("_search", body, indexes, doc_types, **query_params)

  File "/data/amo_python/www/builder/flightdeck/vendor/src/pyes/pyes/es.py", line 246, in _query_call
    response = self._send_request('GET', path, body, querystring_args)

  File "/data/amo_python/www/builder/flightdeck/vendor/src/pyes/pyes/es.py", line 205, in _send_request
    response = self.connection.execute(request)

  File "/data/amo_python/www/builder/flightdeck/vendor/src/pyes/pyes/connection_http.py", line 167, in _client_call
    return getattr(conn.client, attr)(*args, **kwargs)

  File "/data/amo_python/www/builder/flightdeck/vendor/src/pyes/pyes/connection_http.py", line 59, in execute
    response = self.client.urlopen(Method._VALUES_TO_NAMES[request.method], uri, body=request.body, headers=request.headers)

  File "/data/amo_python/www/builder/flightdeck/vendor/src/pyes/pyes/urllib3/connectionpool.py", line 286, in urlopen
    raise TimeoutError("Request timed out after %f seconds" % self.timeout)

TimeoutError: Request timed out after 1.000000 seconds

Sean McArthur [:seanmonstar]

Updated

•

13 years ago

Assignee: nobody → smcarthur

Priority: -- → P1

Target Milestone: --- → Builder 0.9.10

Sean McArthur [:seanmonstar]

Comment 1

•

13 years ago

We've only recently started see this on production, and don't see it on stage. Why would that be? Is something holding up the production cluster?

Or should we just up the timeout to something higher? I cringe at search taking several seconds to return results, though.

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 2

•

13 years ago

I'm betting this is dependent on, or at least heavily related to, bug 683270.

Depends on: 683270

Daniel Buchner [:dbuc]

Updated

•

13 years ago

Target Milestone: Builder 0.9.10 → Builder 0.9.11

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 3

•

13 years ago

Has anyone seen this, since bug 683270 was fixed?  I haven't seen any recent tracebacks, and I've also run non-volatile tests on search in production, without any session issues/timeouts.

Daniel Buchner [:dbuc]

Comment 4

•

13 years ago

I have not observed this bug in prod since the fix went in last week. Sean, have you seen this?

Sean McArthur [:seanmonstar]

Comment 5

•

13 years ago

I have not. Perhaps the increase in memory from 2GB to 22GB was the fix.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 6

•

13 years ago

Verified FIXED; thanks, all!

Status: RESOLVED → VERIFIED

Daniel Buchner [:dbuc]

Comment 7

•

13 years ago

This is again happening on prod, please advise.

Assignee: smcarthur → aschaar

Status: VERIFIED → REOPENED

Resolution: FIXED → ---

Target Milestone: Builder 0.9.11 → Future

Sean McArthur [:seanmonstar]

Comment 8

•

13 years ago

I don't believe we're using it wrong. I'd say that if anything, it's something IT could look at.

Andy McKay says that TimeoutError's are the most common thing they see on AMO also.

Arron Schaar [:arron]

Assignee

Comment 10

•

13 years ago

Sounds like we need to increase the timeout to reduce the errors, right?

If we do increase it, what is a good timeout?  5, 10, 30 seconds?

Sean McArthur [:seanmonstar]

Comment 11

•

13 years ago

You really want to wait 30 seconds for the search page to load (or possible still error?)

In looking at the statsd graph. pretty much all successful calls are less than 100ms. My thoughts are that if it's going to error, error sooner rather than later.

Somtimes timeouts just happen.

Jeremy Orem [:oremj]

Comment 12

•

13 years ago

I agree with Sean. If we let the requests sit around for a long time we'll just end up hitting max clients on the web servers.

Daniel Buchner [:dbuc]

Comment 13

•

13 years ago

The issue is not whether or not we want the search page to error. The question is, what is a reasonable timeout. With a high degree of absolution I can say 1 second is way too low. Let's do a minimum of 10 seconds, instrument and log the various steps in the rendering of the search route, and try to understand just how far off the apparent mean of 100ms (per Sean's investigation) we are when these timeouts are thrown.

The worst user flow, is one where they see a 500 error because something was going to take 1.5 seconds and instead of serving the page, we bork them.

Daniel Buchner [:dbuc]

Comment 14

•

13 years ago

Ignore terribly errant use of the word "absolution", that was absolutely horrible ;)

Wil Clouser [:clouserw]

Comment 15

•

13 years ago

I agree we should be upping our logging and investigating why anything is taking over 100ms.

I disagree with upping the timeout, fwiw.

Andy McKay

Comment 16

•

13 years ago

The timeout is 5 seconds on AMO and we get fails regularly. Don't think there's much value in upping the timeout. Is there any pattern on searches that are slow?

Arron Schaar [:arron]

Assignee

Comment 17

•

13 years ago

Sean, 

Where is the statd page that shows the query times and the types of queries that take longer?

Sean McArthur [:seanmonstar]

Comment 18

•

13 years ago

The statsd graph doesn't show what the queries were, it simply logs how long each took. To see each query, you'd need to comb the logs as well.

https://graphite-sjc.mozilla.org/render/?width=586&height=308&from=-7days&title=7%20days&target=stats.timers.builder.search.mean

Arron Schaar [:arron]

Assignee

Comment 19

•

13 years ago

I would like to add a timeout to the ES call and then make the request timeout a little longer.  

That way we can isolate the ES timeouts, catch them, log them, and find out if there are specific queries that take longer than others.  This way we can prevent the high level ugly 500 error to the user.

Sean McArthur [:seanmonstar]

Comment 20

•

13 years ago

Does ES have a separate timeout setting that can be set? Can it be set per-index?

Jeff Balogh (:jbalogh)

Comment 21

•

13 years ago

(In reply to Arron Schaar [:arron] from comment #19)
> I would like to add a timeout to the ES call and then make the request
> timeout a little longer.  

That's the current behavior.

Arron Schaar [:arron]

Assignee

Comment 22

•

13 years ago

Thanks, Jeff.  I went back and read this bug from the top.  I was under the impression that the timeout was for the entire web request.  My mistake.

Sean mentioned that AMO has a ES timeout of 5s and that these errors are fairly common.  Do we know why they happen?  What happens if we retry the query after a short duration? 

Jeremy:  How close are we to the max clients on the hardware side?

Arron Schaar [:arron]

Assignee

Comment 23

•

13 years ago

I updated the search query code path to retry if a timeout exception is thrown.  I'm also logging the error and query params.  If the query times out twice in a row, then I raise the exception and throw a 500.

Hopefully this will reduce the number of 500s that get returned to the client.  One other possible improvement might be to wait 50-250ms before the retry.

This is basically a band-aid to improve user experience until we figure out why ES is timing out.

Arron Schaar [:arron]

Assignee

Comment 24

•

13 years ago

Retry code is in place and we increased the timeout to 5s.   We haven't seen a timeout traceback in a few weeks.

Status: REOPENED → RESOLVED

Closed: 13 years ago → 13 years ago

Resolution: --- → FIXED

Stephen Donner [:stephend] Not actively reading bugmail

Reporter

Comment 25

•

12 years ago

Verified; I haven't seen this in a long time, either.

Status: RESOLVED → VERIFIED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: addons.mozilla.org → addons.mozilla.org Graveyard