[traceback] TimeoutError: Request timed out after 5.000000 seconds

RESOLVED FIXED in 6.1.6

Status

addons.mozilla.org Graveyard
Public Pages
--
major
RESOLVED FIXED
6 years ago
2 years ago

People

(Reporter: krupa, Assigned: aphadke)

Tracking

unspecified
6.1.6
Dependency tree / graph

Details

(URL)

(Reporter)

Description

6 years ago
steps to reproduce:
1. Load https://addons.allizom.org/z/en-US/firefox/es/extensions/?sort=name

reproducible? sometimes. Usually, the pages loads fine on reload.

traceback details:
Traceback (most recent call last):

 File "/data/amo_python/www/preview/zamboni/vendor/src/django/django/core/handlers/base.py", line 111, in get_response
   response = callback(request, *callback_args, **callback_kwargs)

 File "/data/amo_python/www/preview/zamboni/vendor/src/django-mobility/mobility/decorators.py", line 23, in wrapper
   return f(request, *args, **kw)

 File "/data/amo_python/www/preview/zamboni/apps/browse/views.py", line 193, in es_extensions
   addons = amo.utils.paginate(request, qs)

 File "/data/amo_python/www/preview/zamboni/apps/amo/utils.py", line 109, in paginate
   paginated = p.page(page)

 File "/data/amo_python/www/preview/zamboni/vendor/src/django/django/core/paginator.py", line 38, in page
   number = self.validate_number(number)

 File "/data/amo_python/www/preview/zamboni/vendor/src/django/django/core/paginator.py", line 29, in validate_number
   if number > self.num_pages:

 File "/data/amo_python/www/preview/zamboni/vendor/src/django/django/core/paginator.py", line 61, in _get_num_pages
   if self.count == 0 and not self.allow_empty_first_page:

 File "/data/amo_python/www/preview/zamboni/vendor/src/django/django/core/paginator.py", line 49, in _get_count
   self._count = self.object_list.count()

 File "/data/amo_python/www/preview/zamboni/apps/amo/search.py", line 78, in count
   num = self._do_search().count

 File "/data/amo_python/www/preview/zamboni/apps/amo/search.py", line 133, in _do_search
   hits = es.search(qs, settings.ES_INDEX, self.type._meta.app_label)

 File "/data/amo_python/www/preview/zamboni/vendor/lib/python/pyes/es.py", line 841, in search
   return self._query_call("_search", body, indexes, doc_types, **query_params)

 File "/data/amo_python/www/preview/zamboni/vendor/lib/python/pyes/es.py", line 251, in _query_call
   response = self._send_request('GET', path, body, querystring_args)

 File "/data/amo_python/www/preview/zamboni/vendor/lib/python/pyes/es.py", line 208, in _send_request
   response = self.connection.execute(request)

 File "/data/amo_python/www/preview/zamboni/vendor/lib/python/pyes/connection_http.py", line 167, in _client_call
   return getattr(conn.client, attr)(*args, **kwargs)

 File "/data/amo_python/www/preview/zamboni/vendor/lib/python/pyes/connection_http.py", line 59, in execute
   response = self.client.urlopen(Method._VALUES_TO_NAMES[request.method], uri, body=request.body, headers=request.headers)

 File "/data/amo_python/www/preview/zamboni/vendor/lib/python/pyes/urllib3/connectionpool.py", line 286, in urlopen
   raise TimeoutError("Request timed out after %f seconds" % self.timeout)

TimeoutError: Request timed out after 5.000000 seconds
Dave: can you tell us anything about this?

We were getting this error previously when the box couldn't access ES at all (firewall) but if it works on reload that may not be the case here
Assignee: nobody → dd
I can only guess at theories:

1. Perhaps a single web-head is having issues?  This isn't likely, and oremj can quickly confirm/deny if this is possible by running a curl against elastic search.
2. It's a 5 second timeout - Is there a possibility that ElasticSearch's frontend can get blocked?  It shouldn't be getting that much load so this is alarming, but do we need to put more boxes up in the staging environment, and loadbalance them?  Not sure that this would fix that issue entirely.

Anurag - any ideas about #2 - what would cause ES to not be responsive in 5 seconds?
(Assignee)

Comment 3

6 years ago
from the logs:
[2011-06-27 11:48:45,167][DEBUG][action.search.type       ] [Jude the Entropic Man] [preview-amo][3], node[wQKqYug3RCmniboFHzqOuw], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@256ffe4d]
org.elasticsearch.search.query.QueryPhaseExecutionException: [preview-amo][3]: query[ConstantScore(*:*)],from[0],size[20],sort[<custom:"name": org.elasticsearch.index.field.data.strings.StringFieldDataType$1@75680c1>]: Query Failed [Failed to execute main query]
	at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:215)
	at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:222)
	at org.elasticsearch.search.action.SearchServiceTransportAction.sendExecuteQuery(SearchServiceTransportAction.java:134)
	at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.sendExecuteFirstPhase(TransportSearchQueryThenFetchAction.java:76)
	at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.performFirstPhase(TransportSearchTypeAction.java:192)
	at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.access$000(TransportSearchTypeAction.java:75)
	at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.run(TransportSearchTypeAction.java:151)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:619)
Caused by: java.io.IOException: Can't sort on string types with more than one value per doc, or more than one token per field
	at org.elasticsearch.index.field.data.strings.StringOrdValFieldDataComparator.setNextReader(StringOrdValFieldDataComparator.java:119)
	at org.apache.lucene.search.TopFieldCollector$OneComparatorNonScoringCollector.setNextReader(TopFieldCollector.java:95)
	at org.elasticsearch.common.lucene.search.FilteredCollector.setNextReader(FilteredCollector.java:58)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:516)
	at org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:177)
	at org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:143)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:426)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:349)
	at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:209)
	... 9 more


seems that we are running a query that's not supported by ES, need to discuss this with devs, what channel should I pop-in?
(In reply to comment #3)
> seems that we are running a query that's not supported by ES

Sorting by name is from https://addons.allizom.org/z/en-US/firefox/es/extensions/?sort=name and that works fine (except when it times out).

The other page I've seen timeouts on is https://addons.allizom.org/z/en-US/firefox/i/.

Most of the time those pages are fine, but sometimes they times out. Like Dave said, this is alarming since there's very little load coming from preview.

> need to discuss this with devs, what channel should I pop-in?

I'm in #metrics and #amo.
(Assignee)

Comment 5

6 years ago
I don't see anything in the logs, can u guys send me the approx. time when u got the timeout?
(In reply to comment #5)
> I don't see anything in the logs, can u guys send me the approx. time when u
> got the timeout?

Three timeouts today:

2011-06-27 12:54
2011-06-27 09:06
2011-06-27 11:28
(Assignee)

Comment 7

6 years ago
This is all that I see in the logs, nothing unusual, not sure what might be happening here, are u guys hitting this via VIP or direct server?

[2011-06-27 09:05:53,006][DEBUG][monitor.jvm              ] [Jude the Entropic Man] [gc][Copy][25138] took [5ms]/[1.3h], reclaimed [16.6mb], leaving [503.4mb] used, max [1.6gb]
[2011-06-27 09:06:06,592][DEBUG][indices.memory           ] [Jude the Entropic Man] marking shard [input][3] as active
[2011-06-27 09:06:06,593][DEBUG][indices.memory           ] [Jude the Entropic Man] recalculating shard indexing buffer (reason=shards became active/inactive), total is [153.3mb] with [3] active 
shards, each shard set to [51.1mb][2011-06-27 09:06:12,264][DEBUG][monitor.jvm              ] [Jude the Entropic Man] [gc][Copy][25141] took [188ms]/[1.3h], reclaimed [16.5mb], leaving [504.6mb] used, max [1.6gb]
[2011-06-27 09:06:13,264][DEBUG][monitor.jvm              ] [Jude the Entropic Man] [gc][Copy][25148] took [4ms]/[1.3h], reclaimed [16.6mb], leaving [504.7mb] used, max [1.6gb]
[2011-06-27 09:06:36,593][DEBUG][indices.memory           ] [Jude the Entropic Man] marking shard [input][2] as active
[2011-06-27 09:06:36,593][DEBUG][indices.memory           ] [Jude the Entropic Man] recalculating shard indexing buffer (reason=shards became active/inactive), total is [153.3mb] with [4] active 
shards, each shard set to [38.3mb]
[2011-06-27 09:06:47,294][DEBUG][monitor.jvm              ] [Jude the Entropic Man] [gc][Copy][25149] took [17ms]/[1.3h], reclaimed [14.8mb], leaving [506.4mb] used, max [1.6gb]
[2011-06-27 09:06:59,313][DEBUG][monitor.jvm              ] [Jude the Entropic Man] [gc][Copy][25150] took [230ms]/[1.3h], reclaimed [15mb], leaving [508mb] used, max [1.6gb]
[2011-06-27 09:07:00,313][DEBUG][monitor.jvm              ] [Jude the Entropic Man] [gc][Copy][25151] took [96ms]/[1.3h], reclaimed [16.4mb], leaving [508.2mb] used, max [1.6gb]
Hitting directly:

['elasticsearch1.webapp.phx1.mozilla.com:9200']
(Reporter)

Updated

6 years ago
Blocks: 654034
We need to solve this asap, it's going to block our push on Thursday.  What do you need from us?
Severity: normal → blocker
(Assignee)

Comment 10

6 years ago
@clouserw:
Try hitting the server mentioned in comment #8 in prod.

For staging:
add the timeout argument and set it to 60secs, http://www.elasticsearch.org/guide/reference/api/search/request-body.html I don't see any timeout errors in ES logs, see comment #7, so not sure what's the issue out here. A larger timeout might help us see some msgs in the logs..

BTW, what happens when timeout occurs? the page renders and ES query returns blank or page doesn't render at all? Could this be a VIP issue?
(In reply to comment #10)
> @clouserw:
> Try hitting the server mentioned in comment #8 in prod.

What does "try hitting" mean?  If it's in prod, oremj will have to do it.

> For staging:
> add the timeout argument and set it to 60secs,
> http://www.elasticsearch.org/guide/reference/api/search/request-body.html I
> don't see any timeout errors in ES logs, see comment #7, so not sure what's
> the issue out here. A larger timeout might help us see some msgs in the
> logs..

Dave, can you tell us what to add for this?  I think last time we tried to add the setting we had to monkeypatch a bunch of stuff in the library also (it was just for testing), but I forgot the details.

> BTW, what happens when timeout occurs? the page renders and ES query returns
> blank or page doesn't render at all? Could this be a VIP issue?

There is a traceback and the app loads the 500 page.  I have no idea about VIP - oremj can help with infra stuff.
(Assignee)

Comment 12

6 years ago
@clouserw/@oremj: please use ['elasticsearch1.webapp.phx1.mozilla.com:9200'] once we go to production

wrt timeout errors: @oremj / @davedash: please let me know ur replies so we can narrow this thing down.

thx.
We're reverting this page from using ES so this won't be a blocker.  If I can help with anything though, please let me know.
Severity: blocker → major
aphadke - there's no vip we're hitting the ES box directly.

clouserw - we should catch the errors as a failback and show a video of potch singing AC/DC

As far as the timeout goes... when we use ES we need to set timeout there:

https://github.com/aparo/pyes/blob/master/pyes/es.py#L100

I can build that into elasticutils... it could read a setting called ES_TIMEOUT.  I'll do that, and update zamboni accordingly.
(In reply to comment #14)
> I can build that into elasticutils... it could read a setting called
> ES_TIMEOUT.  I'll do that, and update zamboni accordingly.

Yes please. Timeouts much greater than 100ms are not acceptable for production.
(Assignee)

Comment 16

6 years ago
@jeff: this is only to narrow down to the problem....
(Assignee)

Comment 17

6 years ago
wrt comment #12, that was a bad copy paste:

Stage server:
['elasticsearch1.webapp.phx1.mozilla.com:9200']

Prod server:
elasticsearch1.metrics.sjc1.mozilla.com:9200

sorry for the confusion.
Okay that's here:

$YOUR-VENDOR-REPO-NAME-HERE/commit/bbe3db0

oremj, can you set:

ES_TIMEOUT=60

in settings_local.py

for stage... hopefully krupa's constant hammering on PAMO will result in some errors.

jbalogh, are the query times graphed in statsd?
(In reply to comment #18)
> jbalogh, are the query times graphed in statsd?

Of course! https://graphite-sjc.mozilla.org/render/?width=586&height=308&_salt=1309290020.65&target=stats.timers.pamo.search.lower&target=stats.timers.pamo.search.mean&target=stats.timers.pamo.search.upper_90&from=-4hours

It's not too useful since there's very little traffic.
Can we get any stats out of the VM or VM host that is running ES? Maybe the VM is not getting enough resources?
Who is running the VM?  I thought that was all metrics
Target Milestone: 6.1.4 → 6.1.5
We just hit a timeout after 60 seconds at [Tue, 5 Jul 2011 10:37:41 -0700 (PDT)].

(In reply to comment #20)
> Can we get any stats out of the VM or VM host that is running ES? Maybe the
> VM is not getting enough resources?

I think that was the conclusion after talking to aphadke in #metrics.
Assignee: dd → aphadke
(Assignee)

Comment 23

6 years ago
@jbalogh:
I have created a new 2-node elasticsearch cluster for you guys. Please change the URLs to point at:
http://node16.generic.metrics.sjc1.mozilla.com:9200/

This cluster is lot beefier than the current one you are using. IO/memory/disk issues, if any, should no longer be an issue with this new instance.
Target Milestone: 6.1.5 → 6.1.6
Connected to the new cluster in bug 669465. Thanks aphadke!
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Product: addons.mozilla.org → addons.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.