Closed Bug 681773 Opened 13 years ago Closed 13 years ago

[traceback] TimeoutError: Request timed out after 1.000000 seconds

Categories

(addons.mozilla.org Graveyard :: Add-on Builder, defect, P1)

defect

Tracking

(Not tracked)

VERIFIED FIXED
Future

People

(Reporter: stephend, Assigned: aschaar)

References

()

Details

Queries are (seemingly) randomly timing out at times, after a 1-second threshold (https://builder.addons.mozilla.org/search/?q=selection&type=a&copies=2 hit it once, but they vary):

Traceback (most recent call last):

  File "/data/amo_python/www/builder/flightdeck/vendor/lib/python/django/core/handlers/base.py", line 111, in get_response
    response = callback(request, *callback_args, **callback_kwargs)

  File "/data/amo_python/www/builder/flightdeck/apps/search/views.py", line 46, in search
    facets['everyone_total'] = len(qs)

  File "/data/amo_python/www/builder/flightdeck/vendor/src/elasticutils/elasticutils/__init__.py", line 253, in __len__
    return len(self._do_search())

  File "/data/amo_python/www/builder/flightdeck/vendor/src/elasticutils/elasticutils/__init__.py", line 356, in _do_search
    hits = self.raw()

  File "/data/amo_python/www/builder/flightdeck/vendor/src/elasticutils/elasticutils/__init__.py", line 374, in raw
    hits = es.search(qs, settings.ES_INDEX, self.type._meta.db_table)

  File "/data/amo_python/www/builder/flightdeck/vendor/src/pyes/pyes/es.py", line 806, in search
    return self._query_call("_search", body, indexes, doc_types, **query_params)

  File "/data/amo_python/www/builder/flightdeck/vendor/src/pyes/pyes/es.py", line 246, in _query_call
    response = self._send_request('GET', path, body, querystring_args)

  File "/data/amo_python/www/builder/flightdeck/vendor/src/pyes/pyes/es.py", line 205, in _send_request
    response = self.connection.execute(request)

  File "/data/amo_python/www/builder/flightdeck/vendor/src/pyes/pyes/connection_http.py", line 167, in _client_call
    return getattr(conn.client, attr)(*args, **kwargs)

  File "/data/amo_python/www/builder/flightdeck/vendor/src/pyes/pyes/connection_http.py", line 59, in execute
    response = self.client.urlopen(Method._VALUES_TO_NAMES[request.method], uri, body=request.body, headers=request.headers)

  File "/data/amo_python/www/builder/flightdeck/vendor/src/pyes/pyes/urllib3/connectionpool.py", line 286, in urlopen
    raise TimeoutError("Request timed out after %f seconds" % self.timeout)

TimeoutError: Request timed out after 1.000000 seconds
Assignee: nobody → smcarthur
Priority: -- → P1
Target Milestone: --- → Builder 0.9.10
We've only recently started see this on production, and don't see it on stage. Why would that be? Is something holding up the production cluster?

Or should we just up the timeout to something higher? I cringe at search taking several seconds to return results, though.
I'm betting this is dependent on, or at least heavily related to, bug 683270.
Depends on: 683270
Target Milestone: Builder 0.9.10 → Builder 0.9.11
Has anyone seen this, since bug 683270 was fixed?  I haven't seen any recent tracebacks, and I've also run non-volatile tests on search in production, without any session issues/timeouts.
I have not observed this bug in prod since the fix went in last week. Sean, have you seen this?
I have not. Perhaps the increase in memory from 2GB to 22GB was the fix.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Verified FIXED; thanks, all!
Status: RESOLVED → VERIFIED
This is again happening on prod, please advise.
Assignee: smcarthur → aschaar
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
Target Milestone: Builder 0.9.11 → Future
I don't believe we're using it wrong. I'd say that if anything, it's something IT could look at.

Andy McKay says that TimeoutError's are the most common thing they see on AMO also.
Sounds like we need to increase the timeout to reduce the errors, right?

If we do increase it, what is a good timeout?  5, 10, 30 seconds?
You really want to wait 30 seconds for the search page to load (or possible still error?)

In looking at the statsd graph. pretty much all successful calls are less than 100ms. My thoughts are that if it's going to error, error sooner rather than later.

Somtimes timeouts just happen.
I agree with Sean. If we let the requests sit around for a long time we'll just end up hitting max clients on the web servers.
The issue is not whether or not we want the search page to error. The question is, what is a reasonable timeout. With a high degree of absolution I can say 1 second is way too low. Let's do a minimum of 10 seconds, instrument and log the various steps in the rendering of the search route, and try to understand just how far off the apparent mean of 100ms (per Sean's investigation) we are when these timeouts are thrown.

The worst user flow, is one where they see a 500 error because something was going to take 1.5 seconds and instead of serving the page, we bork them.
Ignore terribly errant use of the word "absolution", that was absolutely horrible ;)
I agree we should be upping our logging and investigating why anything is taking over 100ms.

I disagree with upping the timeout, fwiw.
The timeout is 5 seconds on AMO and we get fails regularly. Don't think there's much value in upping the timeout. Is there any pattern on searches that are slow?
Sean, 

Where is the statd page that shows the query times and the types of queries that take longer?
The statsd graph doesn't show what the queries were, it simply logs how long each took. To see each query, you'd need to comb the logs as well.

https://graphite-sjc.mozilla.org/render/?width=586&height=308&from=-7days&title=7%20days&target=stats.timers.builder.search.mean
I would like to add a timeout to the ES call and then make the request timeout a little longer.  

That way we can isolate the ES timeouts, catch them, log them, and find out if there are specific queries that take longer than others.  This way we can prevent the high level ugly 500 error to the user.
Does ES have a separate timeout setting that can be set? Can it be set per-index?
(In reply to Arron Schaar [:arron] from comment #19)
> I would like to add a timeout to the ES call and then make the request
> timeout a little longer.  

That's the current behavior.
Thanks, Jeff.  I went back and read this bug from the top.  I was under the impression that the timeout was for the entire web request.  My mistake.

Sean mentioned that AMO has a ES timeout of 5s and that these errors are fairly common.  Do we know why they happen?  What happens if we retry the query after a short duration? 

Jeremy:  How close are we to the max clients on the hardware side?
I updated the search query code path to retry if a timeout exception is thrown.  I'm also logging the error and query params.  If the query times out twice in a row, then I raise the exception and throw a 500.

Hopefully this will reduce the number of 500s that get returned to the client.  One other possible improvement might be to wait 50-250ms before the retry.

This is basically a band-aid to improve user experience until we figure out why ES is timing out.
Retry code is in place and we increased the timeout to 5s.   We haven't seen a timeout traceback in a few weeks.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Verified; I haven't seen this in a long time, either.
Status: RESOLVED → VERIFIED
Product: addons.mozilla.org → addons.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.