Closed
Bug 712069
Opened 13 years ago
Closed 12 years ago
Retry failed elastic live indexing somehow
Categories
(support.mozilla.org :: Search, defect, P1)
support.mozilla.org
Search
Tracking
(Not tracked)
RESOLVED
FIXED
2012.7
People
(Reporter: willkg, Assigned: willkg)
References
Details
(Whiteboard: u=dev c=search p=2)
Occasionally when Jenkins runs, it dies when indexing fixtures. For example: Traceback (most recent call last): File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/django/django/utils/unittest/case.py", line 339, in run testMethod() File "/var/lib/jenkins/jobs/sumo-master/workspace/../workspace/apps/questions/tests/test_es.py", line 74, in test_question_one_answer_deleted question.save() File "/var/lib/jenkins/jobs/sumo-master/workspace/apps/questions/models.py", line 101, in save super(Question, self).save(*args, **kwargs) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/django/django/db/models/base.py", line 460, in save self.save_base(using=using, force_insert=force_insert, force_update=force_update) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/django/django/db/models/base.py", line 570, in save_base created=(not record_exists), raw=raw, using=using) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/django/django/dispatch/dispatcher.py", line 172, in send response = receiver(signal=self, sender=sender, **named) File "/var/lib/jenkins/jobs/sumo-master/workspace/apps/questions/models.py", line 263, in update_question_in_index index_questions.delay([instance.id]) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/task/base.py", line 324, in delay return self.apply_async(args, kwargs) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/task/base.py", line 422, in apply_async return self.apply(args, kwargs, task_id=task_id) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/task/base.py", line 583, in apply retval = trace.execute() File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/execute/trace.py", line 76, in execute retval = self._trace() File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/execute/trace.py", line 86, in _trace propagate=self.propagate) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/execute/trace.py", line 34, in trace return cls(states.SUCCESS, retval=fun(*args, **kwargs)) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/task/base.py", line 227, in __call__ return self.run(*args, **kwargs) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/app/__init__.py", line 141, in run return fun(*args, **kwargs) File "/var/lib/jenkins/jobs/sumo-master/workspace/apps/questions/tasks.py", line 97, in index_questions es_search.index_doc(es_search.extract_question(q)) File "/var/lib/jenkins/jobs/sumo-master/workspace/apps/questions/es_search.py", line 108, in index_doc id=doc['id'], bulk=bulk, force_insert=force_insert) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/packages/pyes/pyes/es.py", line 718, in index return self._send_request(request_method, path, doc, querystring_args) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/packages/pyes/pyes/es.py", line 223, in _send_request raise_if_error(response.status, decoded) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/packages/pyes/pyes/convert_errors.py", line 77, in raise_if_error raise pyes.exceptions.ElasticSearchException(error, status, result) ElasticSearchException: RejectedExecutionException[Rejected execution after waiting 120 ms for task [class org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1] to be executed.] This can probably happen whenever ES goes on holiday. The problem here is that we're adding incremental index updating to Kitsune on post_save and pre_delete. If one of those indexing tasks fails like the above, then that document in the index is stale. This affects search results. We need to handle this better. Maybe tossing failed index tasks back in the queue to try again later? Maybe keeping a list and going through them every 20 minutes to try updating failed index tasks again?
Assignee | ||
Updated•13 years ago
|
Summary: deal with failed indexing → deal with failed indexing for elastic search
Comment 1•13 years ago
|
||
Simply retrying the task once won't suffice. The test failures here are caused by ES not responding 7 times in a row: https://ci.mozilla.org/job/sumo-master/1255/testReport/junit/junit/.
Updated•12 years ago
|
Summary: deal with failed indexing for elastic search → Retry failed elastic live indexing somehow
Assignee | ||
Comment 2•12 years ago
|
||
I think we need to figure out what to do about this sooner rather than later else potentially suffer stale indexes. I think when incremental indexing fails, the celery task should put the indexing task into a queue somewhere. Then every 20 minutes, we have a cronjob check the queue and re-try all the things that failed. Successes get nixed from the queue. Failures stay in the queue.
Whiteboard: u=dev c=search s=2012.3 p=
Updated•12 years ago
|
Whiteboard: u=dev c=search s=2012.3 p= → u=dev c=search s=2012.3 p=2
Assignee | ||
Comment 4•12 years ago
|
||
The ES we connect to for Jenkins seems much better these days. Also, we raised the ES_INDEXING_TIMEOUT to 30 seconds. I looked at what happens when things fail. As near as I can tell, when an indexing celery task fails, we should get an email. We haven't gotten any emails since live indexing was turned on in production. Given all that, I think we can either lower the priority on this one or push it off altogether. i.e. I think we should bump this out of the sprint.
Comment 5•12 years ago
|
||
Making this a P3 with a high probability of getting punted.
Priority: P1 → P3
Comment 6•12 years ago
|
||
Is this a candidate for WONTFIX?
Whiteboard: u=dev c=search s=2012.3 p=2 → u=dev c=search s= p=2
Assignee | ||
Comment 7•12 years ago
|
||
I think this is something we're going to want to address in some way. The problem this creates is that if live indexing fails, then the document it failed for is stale in the index and with the current system it won't get updated until we happen to do another reindexing. We don't have this problem with Sphinx because our Sphinx code reindexes every 20 minutes. So, I think we should address this, but it's something I'm comfortable with pushing off while we have other more pressing issues since we haven't seen a live indexing failure so far.
Assignee | ||
Comment 8•12 years ago
|
||
Putting this one back in the queue! We had a few hours a few weeks ago where all the indexing was failing and that creates the problem where now we have a bunch of stale documents out there. This should get figured out. Couple of possibilities: 1. celery can retry tasks that fail. definitely worth doing that. 2. for tasks that continue failing through some non-trivial period, we need some fallback plan. maybe these are so few that we can reindex them by hand with an admin page? maybe add non-destructive reindexing support?
Assignee: nobody → willkg
Whiteboard: u=dev c=search s= p=2 → u=dev c=search p=2
Target Milestone: --- → 2012.6
Assignee | ||
Updated•12 years ago
|
Priority: P3 → P1
Assignee | ||
Comment 9•12 years ago
|
||
(Just some data I want to capture.) Failure periods for elastic search live indexing over the last 3 months: 1/31/2012 11:45am to 6:45pm 6h 2/29/2012 12:34am to 3:48am 3h 2/29/2012 9:26am to 10:19am < 1h 3/24/2012 9:57am to 10:08am < 1h
Updated•12 years ago
|
Target Milestone: 2012.6 → 2012.7
Assignee | ||
Comment 10•12 years ago
|
||
For now, I'm going to wrap the tasks in a logarithmic retry decorator. That's item 1 in comment #8. Then we can see if that's good enough. I'm balking at dealing with the complexity in code for handling all possible ES outages right now. I'm not entirely sure it's needed, plus if we have that problem, then we've probably got other problems to deal with, too.
Assignee | ||
Comment 11•12 years ago
|
||
Landed in master in https://github.com/mozilla/kitsune/commit/b3445cbb5251741fcdb5fc25c6422abff28950a3 Pushed to stage and production.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•