712069 - Retry failed elastic live indexing somehow

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Description

•

14 years ago

Occasionally when Jenkins runs, it dies when indexing fixtures. For example: Traceback (most recent call last): File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/django/django/utils/unittest/case.py", line 339, in run testMethod() File "/var/lib/jenkins/jobs/sumo-master/workspace/../workspace/apps/questions/tests/test_es.py", line 74, in test_question_one_answer_deleted question.save() File "/var/lib/jenkins/jobs/sumo-master/workspace/apps/questions/models.py", line 101, in save super(Question, self).save(*args, **kwargs) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/django/django/db/models/base.py", line 460, in save self.save_base(using=using, force_insert=force_insert, force_update=force_update) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/django/django/db/models/base.py", line 570, in save_base created=(not record_exists), raw=raw, using=using) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/django/django/dispatch/dispatcher.py", line 172, in send response = receiver(signal=self, sender=sender, **named) File "/var/lib/jenkins/jobs/sumo-master/workspace/apps/questions/models.py", line 263, in update_question_in_index index_questions.delay([instance.id]) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/task/base.py", line 324, in delay return self.apply_async(args, kwargs) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/task/base.py", line 422, in apply_async return self.apply(args, kwargs, task_id=task_id) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/task/base.py", line 583, in apply retval = trace.execute() File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/execute/trace.py", line 76, in execute retval = self._trace() File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/execute/trace.py", line 86, in _trace propagate=self.propagate) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/execute/trace.py", line 34, in trace return cls(states.SUCCESS, retval=fun(*args, **kwargs)) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/task/base.py", line 227, in __call__ return self.run(*args, **kwargs) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/app/__init__.py", line 141, in run return fun(*args, **kwargs) File "/var/lib/jenkins/jobs/sumo-master/workspace/apps/questions/tasks.py", line 97, in index_questions es_search.index_doc(es_search.extract_question(q)) File "/var/lib/jenkins/jobs/sumo-master/workspace/apps/questions/es_search.py", line 108, in index_doc id=doc['id'], bulk=bulk, force_insert=force_insert) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/packages/pyes/pyes/es.py", line 718, in index return self._send_request(request_method, path, doc, querystring_args) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/packages/pyes/pyes/es.py", line 223, in _send_request raise_if_error(response.status, decoded) File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/packages/pyes/pyes/convert_errors.py", line 77, in raise_if_error raise pyes.exceptions.ElasticSearchException(error, status, result) ElasticSearchException: RejectedExecutionException[Rejected execution after waiting 120 ms for task [class org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1] to be executed.] This can probably happen whenever ES goes on holiday. The problem here is that we're adding incremental index updating to Kitsune on post_save and pre_delete. If one of those indexing tasks fails like the above, then that document in the index is stale. This affects search results. We need to handle this better. Maybe tossing failed index tasks back in the queue to try again later? Maybe keeping a list and going through them every 20 minutes to try updating failed index tasks again?

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Updated

•

14 years ago

Summary: deal with failed indexing → deal with failed indexing for elastic search

Erik Rose [:erik][:erikrose]

Comment 1

•

14 years ago

Simply retrying the task once won't suffice. The test failures here are caused by ES not responding 7 times in a row: https://ci.mozilla.org/job/sumo-master/1255/testReport/junit/junit/.

Erik Rose [:erik][:erikrose]

Updated

•

14 years ago

Summary: deal with failed indexing for elastic search → Retry failed elastic live indexing somehow

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 2

•

14 years ago

I think we need to figure out what to do about this sooner rather than later else potentially suffer stale indexes. I think when incremental indexing fails, the celery task should put the indexing task into a queue somewhere. Then every 20 minutes, we have a cronjob check the queue and re-try all the things that failed. Successes get nixed from the queue. Failures stay in the queue.

Whiteboard: u=dev c=search s=2012.3 p=

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 3

•

14 years ago

Bumping important ES bugs up to P1.

Priority: -- → P1

Ricky Rosario [:rrosario, :r1cky]

Updated

•

14 years ago

Whiteboard: u=dev c=search s=2012.3 p= → u=dev c=search s=2012.3 p=2

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 4

•

14 years ago

The ES we connect to for Jenkins seems much better these days. Also, we raised the ES_INDEXING_TIMEOUT to 30 seconds. I looked at what happens when things fail. As near as I can tell, when an indexing celery task fails, we should get an email. We haven't gotten any emails since live indexing was turned on in production. Given all that, I think we can either lower the priority on this one or push it off altogether. i.e. I think we should bump this out of the sprint.

Ricky Rosario [:rrosario, :r1cky]

Comment 5

•

14 years ago

Making this a P3 with a high probability of getting punted.

Priority: P1 → P3

Ricky Rosario [:rrosario, :r1cky]

Comment 6

•

14 years ago

Is this a candidate for WONTFIX?

Whiteboard: u=dev c=search s=2012.3 p=2 → u=dev c=search s= p=2

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 7

•

14 years ago

I think this is something we're going to want to address in some way. The problem this creates is that if live indexing fails, then the document it failed for is stale in the index and with the current system it won't get updated until we happen to do another reindexing. We don't have this problem with Sphinx because our Sphinx code reindexes every 20 minutes. So, I think we should address this, but it's something I'm comfortable with pushing off while we have other more pressing issues since we haven't seen a live indexing failure so far.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 8

•

13 years ago

Putting this one back in the queue! We had a few hours a few weeks ago where all the indexing was failing and that creates the problem where now we have a bunch of stale documents out there. This should get figured out. Couple of possibilities: 1. celery can retry tasks that fail. definitely worth doing that. 2. for tasks that continue failing through some non-trivial period, we need some fallback plan. maybe these are so few that we can reindex them by hand with an admin page? maybe add non-destructive reindexing support?

Assignee: nobody → willkg

Whiteboard: u=dev c=search s= p=2 → u=dev c=search p=2

Target Milestone: --- → 2012.6

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Updated

•

13 years ago

Blocks: 729688

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Updated

•

13 years ago

Priority: P3 → P1

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 9

•

13 years ago

(Just some data I want to capture.) Failure periods for elastic search live indexing over the last 3 months: 1/31/2012 11:45am to 6:45pm 6h 2/29/2012 12:34am to 3:48am 3h 2/29/2012 9:26am to 10:19am < 1h 3/24/2012 9:57am to 10:08am < 1h

Ricky Rosario [:rrosario, :r1cky]

Updated

•

13 years ago

Target Milestone: 2012.6 → 2012.7

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 10

•

13 years ago

For now, I'm going to wrap the tasks in a logarithmic retry decorator. That's item 1 in comment #8. Then we can see if that's good enough. I'm balking at dealing with the complexity in code for handling all possible ES outages right now. I'm not entirely sure it's needed, plus if we have that problem, then we've probably got other problems to deal with, too.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 11

•

13 years ago

Landed in master in https://github.com/mozilla/kitsune/commit/b3445cbb5251741fcdb5fc25c6422abff28950a3 Pushed to stage and production.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Bugzilla

Retry failed elastic live indexing somehow

Categories

(support.mozilla.org :: Search, defect, P1)

Tracking

(Not tracked)

People

(Reporter: willkg, Assigned: willkg)

References

Details

(Whiteboard: u=dev c=search p=2)

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Updated

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Updated

Comment 9

Updated

Comment 10

Comment 11