Closed Bug 825031 Opened 12 years ago Closed 11 years ago

change indexing task chunk size

Categories

(support.mozilla.org :: Search, defect, P4)

defect

Tracking

(Not tracked)

RESOLVED FIXED
2013Q1

People

(Reporter: willkg, Assigned: willkg)

Details

(Whiteboard: u=sumo-team c=search p= s=)

Currently the indexing task chunk size is 50,000 which means it splits all the things that need to be indexed into chunks of 50,000 items. Currently that creates 6 celery tasks for reindexing everything.

Recently, we've upgraded mysql and elastic. Probably makes sense to change the task size.

If we changed it to 40,000, that'll be 7 tasks: 1 kb, 1 forums, 5 questions.

If we changed it to 30,000, that'll be 9 tasks: 1 kb, 1 forums, 7 questions.
I can't tell the up-/downsides of this or the time requirements. Would this parallelize indexing and bring the total time needed for indexing down? Could someone elaborate?
This is an engineering thing.

Generally speaking, as long as the number of tasks is lower than the number of available celery workers, then all the tasks will run at once, the tasks will be smaller in size, and therefore it'll take less time to index.

It's an easy change to make in the code. It's probably a half-day project because we'll need to test it out in production and probably want to see whether the index-time numbers change.
Thanks for the info, Will. Adding this to the next sprint.
Priority: -- → P4
Whiteboard: u=sumo-team c=search p= s=2013.2
Target Milestone: --- → 2013Q1
In pull request: https://github.com/mozilla/kitsune/pull/1049
Assignee: nobody → willkg
Decided to try 20k for now. Landed in master in https://github.com/mozilla/kitsune/commit/fbfc5a10c6b777f7aadb91c7e3788343aed93071

Next:

1. deploy to staging
2. run non-destructive reindex on staging (which might make staging sad because it doesn't have a good ES cluster. we might want to make the chunk size configurable, but that's a project for another bug.)
3. deploy to production
4. run non-destructive reindex on production while watching ganglia

If ganglia doesn't show massive spike and the indexing times are lower than the previous reindexing, then we should stay at 20k. Otherwise, we'll adjust as necessary.
Pushed it to stage and reindexed there. During the reindexing, phx1 got very sad and lonely and all kinds of problems ensued. It had done enough indexing that I decided it was worth pushing to prod and testing there, so I did.

Previous indexing pass (50k chunk size) had 7 tasks and took 24 minutes. This indexing pass (20k chunk size) had 13 tasks and took 18 minutes. It's possible that it took longer that it normally will because phx1 was still recovering.

I watched the mysql and ES graphs and they look fine. The biggest surge was the mysql graph network one. But since all the tasks execute in parallel, the number of tasks shouldn't affect network load.

Ergo, we're going to leave it at 20k.

As a side note, when we update elasticutils to a version that uses pyelasticsearch, I suspect we'll see some performance wins there.

Also, as a clarification, this work ONLY affects indexing from the admin. It doesn't affect indexing from cron jobs or normal site activity.

Marking as FIXED.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Taking this out of the 2013.2 sprint since it's done now.
Whiteboard: u=sumo-team c=search p= s=2013.2 → u=sumo-team c=search p= s=
You need to log in before you can comment on or make changes to this bug.