720935 - Parallelize elastic indexing

Reporter

Description

•

12 years ago

We'll have to split this up into multiple celery tasks to index in a reasonable amount of time if we want to do all the wiki rendering needed by bug 710469. Python is already the bottleneck during indexing (on my laptop), so we could probably go to at least a few processes without wrecking our DB servers. How to predict this? Load graphs?

Celery doesn't seem to have any per-task concurrency controls built in, but we could make some out of thingies that start more tasks from the end of finishing tasks. (We don't want to fill all 32 workers with long-running indexing tasks and starve everything else out.)

Erik Rose [:erik][:erikrose]

Reporter

Updated

•

12 years ago

Target Milestone: --- → 2012Q1

James Socol [:jsocol, :james]

Comment 1

•

12 years ago

Celery does have a rate limit. If you experimentally found that each task took about 1 minute, you could set the rate limit to 4/min, to essentially let 4 run in parallel.

James Socol [:jsocol, :james]

Comment 2

•

12 years ago

Of course, how often, if ever, will we run the full index in production once we're past the testing phase?

Ricky Rosario [:rrosario, :r1cky]

Comment 3

•

12 years ago

Adding to third sprint so it doesn't fall off the radar. We can discuss during planning or before.

OS: Mac OS X → All

Hardware: x86 → All

Whiteboard: u=dev c=search s=2012.3 p=

Erik Rose [:erik][:erikrose]

Reporter

Comment 4

•

12 years ago

It'll probably be a 1/(ln x) falloff as our use of ES stabilizes. Any time we change the mapping, we'll generally need to do a full reindex. We might not always have to delete the old index (and thus have downtime), if the new code is backward-compatible. We'd need to tweak the reindexing code to support such an "incremental overwrite" mode. Parallelization becomes important when we have a downtime-needing reindex that takes forever single-threaded.

Erik Rose [:erik][:erikrose]

Reporter

Comment 5

•

12 years ago

For future reference, a single-threaded index run in production...

* affects the ES cluster graphs to an undetectable degree
* causes an increase of maybe 2 more my_sql_innodb_active_transactions and mysql_innodb_current_transactions on one of the slaves. The rest of the slaves appear unaffected. (I don't know offhand if Django keeps the same connection open for the whole job, but that would explain it.)

Digging deeper into the (questionably) affected slave (https://ganglia.mozilla.org/phx1/?r=day&c=SUMO+Databases&h=support4.db.phx1.mozilla.com&mc=4), I see no obvious bumps in any of the graphs beginning at the time reindexing started (11:17 Pacific today).

The upshot: if we decide parallelizing is worthwhile, we don't have to be afraid of overloading anything.

Erik Rose [:erik][:erikrose]

Reporter

Comment 6

•

12 years ago

The other thing to consider is that we could pretty straightforwardly have 2 indexes—one old and one new—and build the new while the old putts along. Then we can atomically switch indexes using ES aliases. This would all, of course, be a lot easier with continuous deployment.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Updated

•

12 years ago

Blocks: 722359

Ricky Rosario [:rrosario, :r1cky]

Updated

•

12 years ago

Whiteboard: u=dev c=search s=2012.3 p= → u=dev c=search s=2012.4 p=

Ricky Rosario [:rrosario, :r1cky]

Updated

•

12 years ago

Whiteboard: u=dev c=search s=2012.4 p= → u=dev c=search s=2012.4 p=2

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 7

•

12 years ago

We've done a few things in the last week that improve indexing a bunch. My vote is we push this off until we need it.

Ipso facto, we should remove it from the 2012.4 sprint.

Ricky Rosario [:rrosario, :r1cky]

Comment 8

•

12 years ago

We may need to revisit when we add the wiki->html. Otherwise, this we don't need this complexity at the moment with indexing taking less than tweny-some minutes right now.

Whiteboard: u=dev c=search s=2012.4 p=2 → u=dev c=search s= p=2

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 9

•

12 years ago

I'm throwing this in the 2012.10 sprint. One of the things that's likely to happen when Matt starts doing search work is that we might have to reindex more often. Given that, making indexing take less time will save us a bunch of time in the long run.

The gist of it is to change things so that get_indexable() returns the min and max id of the list to search and push the "should i index/unindex this?" handling code to the index() and unindex() methods. After doing that, we can use ComposedList() to figure out the chunks and create 3 or 4 celery tasks that take a chunk, expand that into a list of ids by model, and then index those ids.

The only thing I don't know what to do about offhand is the admin progress bar. I think I might have each task report progress in a separate redis thing and show multiple progress bars.

Pretty sure it's between a small and a medium task in size, so I'll make it a 2 pointer.

Assignee: nobody → willkg

Whiteboard: u=dev c=search s= p=2 → u=dev c=search p=2

Target Milestone: 2012Q1 → 2012.10

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 10

•

12 years ago

Making this a P1. It's important that we get this done.

Priority: -- → P1

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 11

•

12 years ago

I'm almost done. Parallel indexing works, however I need to adjust it so as to make it clearer from the admin page that an indexing batch has been kicked off and also make it less likely we can accidentally kick off two indexing batches at the same time. I've got code for the latter already, but it's flakey and the gaps are too big. If celery is behind, it's too easy to screw things up. So I want to rewrite it to be more reliable.

Anyhow, it's getting there. Probably needs another day of work.

Priority: P1 → --

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 12

•

12 years ago

Rrr... somehow I nixed the priority when I commented last. I'm not sure why that sort of thing keeps happening, but it's irritating.

Priority: -- → P1

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 13

•

12 years ago

In a pull request: https://github.com/mozilla/kitsune/pull/623

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 14

•

12 years ago

Landed in master in bb8348f19b6a3d8009bd6f54e323e4db0dca7a41 .

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 15

•

12 years ago

Oops. In github url-speak, that's https://github.com/mozilla/kitsune/commit/bb8348f19b6a3d8009bd6f54e323e4db0dca7a41

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 16

•

12 years ago

Pushed to production.

Reindexing went from around 50 minutes to around 20 minutes with the changes. That's pretty cool, plus it'll scale better as the number of questions continues to increase.

Marking as FIXED.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Bugzilla

Quick Search

Parallelize elastic indexing

Categories

(support.mozilla.org :: Search, defect, P1)

Tracking

(Not tracked)

People

(Reporter: erik, Assigned: willkg)

References

Details

(Whiteboard: u=dev c=search p=2)

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Updated

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16