Closed Bug 767949 Opened 13 years ago Closed 12 years ago

[tracker] Search needs minor TLC to stay in good shape

Categories

(Participation Infrastructure :: Phonebook, defect)

defect
Not set
major

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: glob, Unassigned)

References

()

Details

searching for anything always returns zero results. eg. https://mozillians.org/en-US/search?q=glob should find me.
I can confirm this. This is not happening on stage. https://mozillians.allizom.org/en-US/search?q=glob
Both stage and prod should be at the same tip: commit hash 1c9d0a5230 - https://github.com/mozilla/mozillians/tags https://mozillians.allizom.org/media/revision_info.txt Chris, Is commit 1c9d0a5230 the one you see as the latest tip for prod? If not, what is the latest tip/tag used?
Depends on: 768008
Filed bug 768008 to look at the ES server health.
OK, so we've found out a bit more: A search for "james" brings up me. A search for "aakash" or "glob" doesn't bring up anyone. (Both should bring up one person.) A search for "byron" doesn't bring up anyone (should bring up glob). A search for "bmo" should bring up several people, but brings up edmorley. A search for all unvouched users works. A search for all users with photos works. So... something isn't getting indexed, or a mapping is wrong, or it isn't getting searched. UserProfile.search: https://github.com/mozilla/mozillians/blob/2012-06-21/apps/users/models.py#L188 UserProfile index tasks: https://github.com/mozilla/mozillians/blob/2012-06-21/apps/users/models.py#L250 /search view: https://github.com/mozilla/mozillians/blob/2012-06-21/apps/phonebook/views.py#L149 Help us, WillKG, you're our only hope! (Seriously I don't know enough about ElasticUtils to debug this quickly.)
OS: Mac OS X → All
Hardware: x86 → All
A search for 'j' seems to bring up lots of people on both prod and -dev.
Summary: search always returns no results → Search not returning expected results
Matt mentioned getting an error email with a TimeoutError of 1 second so I looked at that first. Looks like Mozillians is using elasticutils default timeout which is 1 second for both querying and indexing. For comparison, with SUMO we use 5 seconds for querying and 30 seconds for indexing. This is a likely candidate for causing the problems you're seeing in this bug. Where do index task errors go? One thing you could do is have someone in IT reindex. There's a cron job called "index_all_profiles". I think if you run that in production, it'll try to reindex everything. It'll use a 1 second time out for indexing, though, so it's possible that a bunch of those indexing tasks will error out. But it looks like it's non-destructive (i.e. it doesn't wipe the index and then reindex), so I think even if a bunch error out, your index will be less stale and that's good.
Target Milestone: --- → 2012-06-27
I opened up bug #768230 with IT to <insert verb here> the celery logs for Mozillians in production.
My thoughts =========== I looked at the logs that I got from bug #768230. The log starts in February 2012. There are 25,868 indexing tasks of which 455 failed with a TimeoutError since the log started. If you just look at June 2012, there are 12,506 indexing tasks of which 324 failed with a TimeoutError. That's not a huge percentage, but it's still not good. Fixing this is actually kind of hard because mozillians uses the SearchMixin in elasticutils. So you have to make the changes in elasticutils and then pull the new version into mozillians. We worked around these issues in SUMO because we have a separate branch. I haven't had a chance to fix it in elasticutils proper, yet, but hope to soon. Fixing it here is probably a couple of days of work. That covers the TimeoutError. Searching for ircname probably doesn't work well since I'm pretty sure the field is getting analyzed, but it's doing a term query rather than a text query. Pretty sure that's a mismatch and problematic. I don't see any tests for it, so I can't tell for sure. I think it's bad that the index mappings aren't declared explicitly. It's letting ES infer everything and use the defaults. For example, it's using the default string analyzers which I think is set in the elasticsearch cluster configuration. I have no idea what it is set to. In SUMO we explicitly set it to "snowball". The analyzer handles parsing, stop words, ... Where to go from here ===================== We should get IT to set up a cronjob that kicks off in the wee hours of the morning that reindexes all the profiles. ./manage.py cron index_all_profiles What that does is splits the entire list of profiles into chunks of 150 and sends each chunk to an indexing task. If a task fails with a TimeoutError, it'll try again the next night. I think this will alleviate the indexing problems "good enuf" for now. We should write up bugs covering the above problems so that we have them in a queue to work on. Beyond that, Rob Hudson and I are working on making elasticutils better based on our experiences with AMO and SUMO. That work will help Mozillians in the future, but will need someone to spend some time reworking search in Mozillians to use new versions of elasticutils.
Also, depending on the search requirements, I think this is a good candidate for using django-haystack rather than elasticutils. I haven't seen anything so far that suggests this won't work with django-haystack. Switching this over to django-haystack is probably an easier project than fixing issues in elasticutils and getting mozillians to work with new elasticutils versions.
Created bug #768531 to look into using django-haystack instead. Created bug #768536 to change the ES timeouts to something reasonable. Created bug #768539 to look at changing the analyzer for ircname, lastname and firstname. Created bug #768541 to create a cronjob that reindexes everything every night. I think that covers everything here.
I ran all the searches listed in comment #4 and they all look fine to me now. Also, I have another theory as to why search sucked: I bet no one has reindexed all the data in production in a while. So any changes that were made to the data or how it was structured probably never made it to the index. Also also, it's probably the case that there are thousands of mozillians right now. As that number grows, we'll hit a point where running a full reindexing in a cronjob is a terrible idea. Going forward, someone should monitor how many mozillians there are and the celery job queue. Maybe it makes sense to just draw a line and say, "Once we hit 50K mozillians, we should change how this works."
I'm going to make all the bugs in comment 10 blockers here and call this a tracker for fixing search in general, but for now, thanks to Will, things are back up and running, as far as I can tell. Thanks so much, Will!
Depends on: 768531, 768536, 768539
Summary: Search not returning expected results → [tracker] Search needs minor TLC to stay in good shape
Target Milestone: 2012-06-27 → ---
We've done quite some work on ElasticSearch and mozillians the last 6 months, we closed the bugs blocking this one (and others search related) and the search is now working nicely. I'm closing with one as resolved fixed. Thanks for the useful info everyone!
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
TLC has been administered - thanks glob! QA verified, prod is healthy
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.