Closed Bug 712069 Opened 13 years ago Closed 12 years ago

Retry failed elastic live indexing somehow

Categories

(support.mozilla.org :: Search, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED
2012.7

People

(Reporter: willkg, Assigned: willkg)

References

Details

(Whiteboard: u=dev c=search p=2)

Occasionally when Jenkins runs, it dies when indexing fixtures.  For example:


Traceback (most recent call last):
  File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/django/django/utils/unittest/case.py", line 339, in run
    testMethod()
  File "/var/lib/jenkins/jobs/sumo-master/workspace/../workspace/apps/questions/tests/test_es.py", line 74, in test_question_one_answer_deleted
    question.save()
  File "/var/lib/jenkins/jobs/sumo-master/workspace/apps/questions/models.py", line 101, in save
    super(Question, self).save(*args, **kwargs)
  File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/django/django/db/models/base.py", line 460, in save
    self.save_base(using=using, force_insert=force_insert, force_update=force_update)
  File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/django/django/db/models/base.py", line 570, in save_base
    created=(not record_exists), raw=raw, using=using)
  File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/django/django/dispatch/dispatcher.py", line 172, in send
    response = receiver(signal=self, sender=sender, **named)
  File "/var/lib/jenkins/jobs/sumo-master/workspace/apps/questions/models.py", line 263, in update_question_in_index
    index_questions.delay([instance.id])
  File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/task/base.py", line 324, in delay
    return self.apply_async(args, kwargs)
  File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/task/base.py", line 422, in apply_async
    return self.apply(args, kwargs, task_id=task_id)
  File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/task/base.py", line 583, in apply
    retval = trace.execute()
  File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/execute/trace.py", line 76, in execute
    retval = self._trace()
  File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/execute/trace.py", line 86, in _trace
    propagate=self.propagate)
  File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/execute/trace.py", line 34, in trace
    return cls(states.SUCCESS, retval=fun(*args, **kwargs))
  File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/task/base.py", line 227, in __call__
    return self.run(*args, **kwargs)
  File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/src/celery/celery/app/__init__.py", line 141, in run
    return fun(*args, **kwargs)
  File "/var/lib/jenkins/jobs/sumo-master/workspace/apps/questions/tasks.py", line 97, in index_questions
    es_search.index_doc(es_search.extract_question(q))
  File "/var/lib/jenkins/jobs/sumo-master/workspace/apps/questions/es_search.py", line 108, in index_doc
    id=doc['id'], bulk=bulk, force_insert=force_insert)
  File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/packages/pyes/pyes/es.py", line 718, in index
    return self._send_request(request_method, path, doc, querystring_args)
  File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/packages/pyes/pyes/es.py", line 223, in _send_request
    raise_if_error(response.status, decoded)
  File "/var/lib/jenkins/jobs/sumo-master/workspace/vendor/packages/pyes/pyes/convert_errors.py", line 77, in raise_if_error
    raise pyes.exceptions.ElasticSearchException(error, status, result)
ElasticSearchException: RejectedExecutionException[Rejected execution after waiting 120 ms for task [class org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1] to be executed.]


This can probably happen whenever ES goes on holiday.  The problem here is that we're adding incremental index updating to Kitsune on post_save and pre_delete.  If one of those indexing tasks fails like the above, then that document in the index is stale.  This affects search results.

We need to handle this better.  Maybe tossing failed index tasks back in the queue to try again later?  Maybe keeping a list and going through them every 20 minutes to try updating failed index tasks again?
Summary: deal with failed indexing → deal with failed indexing for elastic search
Simply retrying the task once won't suffice. The test failures here are caused by ES not responding 7 times in a row: https://ci.mozilla.org/job/sumo-master/1255/testReport/junit/junit/.
Summary: deal with failed indexing for elastic search → Retry failed elastic live indexing somehow
I think we need to figure out what to do about this sooner rather than later else potentially suffer stale indexes.

I think when incremental indexing fails, the celery task should put the indexing task into a queue somewhere. Then every 20 minutes, we have a cronjob check the queue and re-try all the things that failed. Successes get nixed from the queue. Failures stay in the queue.
Whiteboard: u=dev c=search s=2012.3 p=
Bumping important ES bugs up to P1.
Priority: -- → P1
Whiteboard: u=dev c=search s=2012.3 p= → u=dev c=search s=2012.3 p=2
The ES we connect to for Jenkins seems much better these days.

Also, we raised the ES_INDEXING_TIMEOUT to 30 seconds.

I looked at what happens when things fail. As near as I can tell, when an indexing celery task fails, we should get an email. We haven't gotten any emails since live indexing was turned on in production.

Given all that, I think we can either lower the priority on this one or push it off altogether. i.e. I think we should bump this out of the sprint.
Making this a P3 with a high probability of getting punted.
Priority: P1 → P3
Is this a candidate for WONTFIX?
Whiteboard: u=dev c=search s=2012.3 p=2 → u=dev c=search s= p=2
I think this is something we're going to want to address in some way. The problem this creates is that if live indexing fails, then the document it failed for is stale in the index and with the current system it won't get updated until we happen to do another reindexing.

We don't have this problem with Sphinx because our Sphinx code reindexes every 20 minutes.

So, I think we should address this, but it's something I'm comfortable with pushing off while we have other more pressing issues since we haven't seen a live indexing failure so far.
Putting this one back in the queue! We had a few hours a few weeks ago where all the indexing was failing and that creates the problem where now we have a bunch of stale documents out there.

This should get figured out.

Couple of possibilities:

1. celery can retry tasks that fail. definitely worth doing that.

2. for tasks that continue failing through some non-trivial period, we need some fallback plan. maybe these are so few that we can reindex them by hand with an admin page? maybe add non-destructive reindexing support?
Assignee: nobody → willkg
Whiteboard: u=dev c=search s= p=2 → u=dev c=search p=2
Target Milestone: --- → 2012.6
Priority: P3 → P1
(Just some data I want to capture.)

Failure periods for elastic search live indexing over the last 3 months:

1/31/2012  11:45am to 6:45pm    6h
2/29/2012  12:34am to 3:48am    3h
2/29/2012   9:26am to 10:19am   < 1h
3/24/2012   9:57am to 10:08am   < 1h
Target Milestone: 2012.6 → 2012.7
For now, I'm going to wrap the tasks in a logarithmic retry decorator. That's item 1 in comment #8.

Then we can see if that's good enough. I'm balking at dealing with the complexity in code for handling all possible ES outages right now. I'm not entirely sure it's needed, plus if we have that problem, then we've probably got other problems to deal with, too.
Landed in master in https://github.com/mozilla/kitsune/commit/b3445cbb5251741fcdb5fc25c6422abff28950a3

Pushed to stage and production.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.