Closed Bug 720935 Opened 12 years ago Closed 12 years ago

Parallelize elastic indexing

Categories

(support.mozilla.org :: Search, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED
2012.10

People

(Reporter: erik, Assigned: willkg)

References

Details

(Whiteboard: u=dev c=search p=2)

We'll have to split this up into multiple celery tasks to index in a reasonable amount of time if we want to do all the wiki rendering needed by bug 710469. Python is already the bottleneck during indexing (on my laptop), so we could probably go to at least a few processes without wrecking our DB servers. How to predict this? Load graphs?

Celery doesn't seem to have any per-task concurrency controls built in, but we could make some out of thingies that start more tasks from the end of finishing tasks. (We don't want to fill all 32 workers with long-running indexing tasks and starve everything else out.)
Target Milestone: --- → 2012Q1
Celery does have a rate limit. If you experimentally found that each task took about 1 minute, you could set the rate limit to 4/min, to essentially let 4 run in parallel.
Of course, how often, if ever, will we run the full index in production once we're past the testing phase?
Adding to third sprint so it doesn't fall off the radar. We can discuss during planning or before.
OS: Mac OS X → All
Hardware: x86 → All
Whiteboard: u=dev c=search s=2012.3 p=
It'll probably be a 1/(ln x) falloff as our use of ES stabilizes. Any time we change the mapping, we'll generally need to do a full reindex. We might not always have to delete the old index (and thus have downtime), if the new code is backward-compatible. We'd need to tweak the reindexing code to support such an "incremental overwrite" mode. Parallelization becomes important when we have a downtime-needing reindex that takes forever single-threaded.
For future reference, a single-threaded index run in production...

* affects the ES cluster graphs to an undetectable degree
* causes an increase of maybe 2 more my_sql_innodb_active_transactions and mysql_innodb_current_transactions on one of the slaves. The rest of the slaves appear unaffected. (I don't know offhand if Django keeps the same connection open for the whole job, but that would explain it.)

Digging deeper into the (questionably) affected slave (https://ganglia.mozilla.org/phx1/?r=day&c=SUMO+Databases&h=support4.db.phx1.mozilla.com&mc=4), I see no obvious bumps in any of the graphs beginning at the time reindexing started (11:17 Pacific today).

The upshot: if we decide parallelizing is worthwhile, we don't have to be afraid of overloading anything.
The other thing to consider is that we could pretty straightforwardly have 2 indexes—one old and one new—and build the new while the old putts along. Then we can atomically switch indexes using ES aliases. This would all, of course, be a lot easier with continuous deployment.
Whiteboard: u=dev c=search s=2012.3 p= → u=dev c=search s=2012.4 p=
Whiteboard: u=dev c=search s=2012.4 p= → u=dev c=search s=2012.4 p=2
We've done a few things in the last week that improve indexing a bunch. My vote is we push this off until we need it.

Ipso facto, we should remove it from the 2012.4 sprint.
We may need to revisit when we add the wiki->html. Otherwise, this we don't need this complexity at the moment with indexing taking less than tweny-some minutes right now.
Whiteboard: u=dev c=search s=2012.4 p=2 → u=dev c=search s= p=2
I'm throwing this in the 2012.10 sprint. One of the things that's likely to happen when Matt starts doing search work is that we might have to reindex more often. Given that, making indexing take less time will save us a bunch of time in the long run.

The gist of it is to change things so that get_indexable() returns the min and max id of the list to search and push the "should i index/unindex this?" handling code to the index() and unindex() methods. After doing that, we can use ComposedList() to figure out the chunks and create 3 or 4 celery tasks that take a chunk, expand that into a list of ids by model, and then index those ids.

The only thing I don't know what to do about offhand is the admin progress bar. I think I might have each task report progress in a separate redis thing and show multiple progress bars.

Pretty sure it's between a small and a medium task in size, so I'll make it a 2 pointer.
Assignee: nobody → willkg
Whiteboard: u=dev c=search s= p=2 → u=dev c=search p=2
Target Milestone: 2012Q1 → 2012.10
Making this a P1. It's important that we get this done.
Priority: -- → P1
I'm almost done. Parallel indexing works, however I need to adjust it so as to make it clearer from the admin page that an indexing batch has been kicked off and also make it less likely we can accidentally kick off two indexing batches at the same time. I've got code for the latter already, but it's flakey and the gaps are too big. If celery is behind, it's too easy to screw things up. So I want to rewrite it to be more reliable.

Anyhow, it's getting there. Probably needs another day of work.
Priority: P1 → --
Rrr... somehow I nixed the priority when I commented last. I'm not sure why that sort of thing keeps happening, but it's irritating.
Priority: -- → P1
Landed in master in bb8348f19b6a3d8009bd6f54e323e4db0dca7a41 .
Pushed to production.

Reindexing went from around 50 minutes to around 20 minutes with the changes. That's pretty cool, plus it'll scale better as the number of questions continues to increase.

Marking as FIXED.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.