Last Comment Bug 400986 - Optimize search performance
: Optimize search performance
Status: RESOLVED FIXED
:
Product: addons.mozilla.org Graveyard
Classification: Graveyard
Component: Public Pages (show other bugs)
: unspecified
: All All
: -- critical
: 3.4.4
Assigned To: Chris Pollett
:
Mentors:
Depends on: 401251 433741 438370
Blocks: 401849 404664 411650
  Show dependency treegraph
 
Reported: 2007-10-24 10:46 PDT by Michael Morgan [:morgamic]
Modified: 2016-02-04 14:51 PST (History)
11 users (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
patch to optimize search (18.02 KB, patch)
2008-06-09 18:39 PDT, Chris Pollett
no flags Details | Diff | Review
revised patch implementing clouserw's suggestions (18.92 KB, patch)
2008-06-10 16:26 PDT, Chris Pollett
laura: review-
Details | Diff | Review
attempt at addressing laura's concerns (18.04 KB, patch)
2008-06-10 22:50 PDT, Chris Pollett
wclouser: review+
laura: review+
Details | Diff | Review

Description Michael Morgan [:morgamic] 2007-10-24 10:46:41 PDT
Search has been destroying the database lately during peak traffic hours (esp during release) and we need to evaluate our MySQL indexing and see if there are ways to optimize our indexing.  Also any wins in the code itself would be great here.

Laura and Fred can you take a look at the search script and see what you can find for next Wed?
Comment 1 Fred Wenzel [:wenzel] 2007-10-24 14:18:47 PDT
Okay, when looking at the current code, I want to make several remarks/suggestions for improvement:

- We should discuss if we want to combine this optimization with algorithm improvements that we wanted to do anyway, so that the search results get better. Otherwise, we may need to start doing the optimization over again when we change the algorithm later. Since currently we just run a simple binary match over three fields, I suggest we at least do some improvements, like rating matches in the title higher.

- We do a lot of JOINs. Joins are slow, so maybe we should split up the single query into three pieces, one for each field searched (title, summary, description). However, merging the results may not be trivial.

- If I am not mistaken, we return an entire result set (IDs only) rather than only one page. While the app can handle a big array of integers, having the DB return more rows than necessary in the first place should be avoided. Problem: We need the final result count even so, in order to provide accurate page counts.

- The localized_string column (which is what is searched for localized strings, no matter which one of the three fields we are talking about) is currently not indexed, because it is of type TEXT which is too big for a usable index. In order to change this, we should investigate MySQL's FULLTEXT index (and search!) capabilities (also performance-wise).
Advantage: This would give us a nice floating point rating (relevance measure) that we could sort by (which is *leagues* better than a binary rating). And, it would certainly be faster/less hard on the DB than the current string-level matching.
Disadvantages: FULLTEXT indexes are not supported by the current InnoDB table type. We'd need to use MyISAM. If we require InnoDB (do we?), we can't do this. Also, I am not sure how this will behave performance-wise, but as I said, my uneducated guess is it'll be easier on the DB.


Thoughts?
Comment 2 Laura Thomson :laura 2007-10-26 07:08:16 PDT
First line of attack is to look at the logs and do some forensics.  Let's see 
1.  If Apache stayed up during the peak period for 2.0.0.8.  I noticed when we had downtime that it looked like server downtime rather than db downtime.  
2.  If mysql stayed up.  If so, we should look at the slow query logs to see what was taking all the time.
3.  We should also look in the apache error logs and see what else, if anything, that gives us.

Comment 3 Laura Thomson :laura 2007-10-26 08:25:02 PDT
Apache logs from 21st onwards are on mpt-vpn in /data/logs.  Anyone know if the older ones get archived anywhere, or are they just gone?  (What's the retention policy?)

Also, where would we find the mysql production logs? I'd like to run mysql_explain_log or similar.  The dates for those are not critical as I'm guessing the search patterns don't vary wildly (although this assumption may prove to be incorrect, it at least gives us a reasonable starting point).  Explaining the logs will help us to see which common queries are not using indexes (or the right indexes).  We may also be able to tweak the query cache.


Comment 4 Wil Clouser [:clouserw] 2007-10-26 09:18:24 PDT
Adding Jeremy to the CC - he can help you find any logs you need.
Comment 5 Laura Thomson :laura 2008-01-10 09:28:52 PST
The current status of this is that it's to be done after the PHP5 migration
(which I don't believe we have a bug for).
Comment 6 Darren VanBuren 2008-03-04 18:30:15 PST
we do have a bug for PHP 5 migration: bug 394193
Comment 7 Michael Morgan [:morgamic] 2008-05-20 17:11:32 PDT
Hey Laura, is there a plan for where we're going?
Comment 8 Basil Hashem [:baz] 2008-05-20 17:53:22 PDT
I also have some specific thoughts on the algorithm that we use to rank. We need to consider and weigh:
- Name title match
- Popularity (weekly downloads & total downloads)
- Rating
- Description match

so that we get some reasonable results. Will need to be tuned for sure.

We also need to consider if we want to do mispellings, soundex, binary operators, "-text", etc... Probably worthy of a discussion soon.

E.g. right now devs are gaming search by including info in their description about compatibility with other add-ons and get higher rankings than they should including showing up above direct name hits (e.g. https://addons.mozilla.org/en-US/firefox/addon/4908).
Comment 9 Fred Wenzel [:wenzel] 2008-05-21 02:08:41 PDT
(In reply to comment #8)
> I also have some specific thoughts on the algorithm that we use to rank. We
> need to consider and weigh:
> - Name title match
> - Popularity (weekly downloads & total downloads)
> - Rating
> - Description match
> 
> so that we get some reasonable results. Will need to be tuned for sure.

In order not to reinvent the wheel we may need to use some existing search engine library (Lucene? Sphinx?), hopefully that'll work for us and be tweakable enough.

> We also need to consider if we want to do mispellings, soundex, binary
> operators, "-text", etc... Probably worthy of a discussion soon.

As a second step, that'd be great.

> E.g. right now devs are gaming search by including info in their description
> about compatibility with other add-ons and get higher rankings than they should
> including showing up above direct name hits (e.g.
> https://addons.mozilla.org/en-US/firefox/addon/4908).

I noticed that too. There are quite a bunch of add-ons that show up for searches that don't concern their add-ons for this reason. So that the "real results" don't drown in a pool, this bug is very important.
Comment 10 Fred Wenzel [:wenzel] 2008-05-21 02:09:35 PDT
> So that the "real results" don't drown in a pool, this bug is very important.

For extra credit, insert "of fake add-ons" where appropriate.
Comment 11 Laura Thomson :laura 2008-05-21 07:06:19 PDT
With Sphinx (and the MySQL built in FTS) the weighting it has built in for ranking lets you weight one search *term* more highly than another.  To implement the type of ranking Basil is talking about we'll have to work something out ourselves.  I'd suggest:
query for exact name hits and return those
query for name or description match, order by relevance, popularity, rating

https://bugzilla.mozilla.org/show_bug.cgi?id=378657 is a bit easier because we can just run the first query.  It doesn't require FTS.

(Posting this later comment over there as well.)
Comment 12 Chris Pollett 2008-06-09 18:39:29 PDT
Created attachment 324381 [details] [diff] [review]
patch to optimize search

The provided patch is designed to speed up search as well as implement full text searching. There are a couple of things that need to be done in order to test my code. I have added to remora.sql commands to create two new tables which will be used as materialized views. These are: text_search_view and versions_summary_view.

(I also deleted the create table for t_shirt_requests since that is no longer used for anything). 

I added to remora-test-data.sql the commands to initially populate these tables if you are using the test database. 

The script in bin, update-search-views.php, should be set up as a cron job to run every 20 minutes or so. It takes a 3-4 seconds to run. It essentially recomputes the whole views each time. In my testing the queries to do this were faster than trying to incrementally update these tables. (figuring out what has changed is slow). Once we have triggers (Mysql 5) though we should use them rather than a cron job.

Full text search by default works only on strings of 4 or more characters.
http://dev.mysql.com/doc/refman/5.0/en/fulltext-fine-tuning.html
describes how this can be tweaked to shorter lengths.

My machine at home runs mysql 5, so I am not sure what will happen when my code is run on mysql 4. I suspect our code for 3.4.3 died before because we were trying to use subselect and mysql 4 doesn't like these much. My code avoids any subselects and limits the amounts of joins that need to be computed so should be reasoably fast. 

On my machine at home using the materialized view sped things up by between 5-20 times depending on what kind of searching was being done. I tested my code with the advanced search patch (Bug 372841) to test all of the additional bells and whistles of the search component. However, the patch I am posting here does not have any of the advanced search UI in it.
Comment 13 Wil Clouser [:clouserw] 2008-06-10 07:52:29 PDT
Comment on attachment 324381 [details] [diff] [review]
patch to optimize search

This feels much faster than what we have, nice work. (note: I don't have tests to back up my feeling)

Two things:

1) The default charset of the views needs to be utf8

2) Searching is restricted by locale.  Is this a feature?  My intuition says people who are searching for something don't necessarily want it restricted by the locale they are browsing in.  I'd be happy with a checkbox in the advanced search that would search all locales though if we wanted to default to just the current one.
Comment 14 Fred Wenzel [:wenzel] 2008-06-10 08:10:27 PDT
A note about your "most_recent_version" table: You just use the last created version, that however may not be the "last version" in our sense: It could be incomplete for example.
Comment 15 Chris Pollett 2008-06-10 09:41:27 PDT
Comment 13:
(1) I can set the default character set for the views to UTF-8. 
(2) Restricting by locale was to speed-up search, I could add a checkbox to advanced search for this

I will try to do this and addresss wenzel's stuff today.
Comment 16 Michael Morgan [:morgamic] 2008-06-10 09:45:33 PDT
We should get this on the test cluster and get comparative results between this and last prod update.
Comment 17 Wil Clouser [:clouserw] 2008-06-10 12:32:37 PDT
stephend can do a functional test on this if we land tonight.  Let's aim for that.
Comment 18 Chris Pollett 2008-06-10 16:26:12 PDT
Created attachment 324535 [details] [diff] [review]
revised patch implementing clouserw's suggestions

wenzel and I agreed on irc that the way I had it in the patch was consistent with the way we were doing things before. This new patch implements both of clouserw suggestions. It should not effect the default query (which restricts to current locale) and as the only other change was to make the tables use utf-8. I now add a parameter to allow one to search all locales. (Would need to connect this to advanced search at some point)
Comment 19 Laura Thomson :laura 2008-06-10 20:58:56 PDT
Comment on attachment 324535 [details] [diff] [review]
revised patch implementing clouserw's suggestions

Couple of things.

1.(isset($_SERVER['HTTP_HOST'])) 
is not a secure test.  Let's get the update script out of the document tree altogether.

2. The tshirt request stuff shouldn't be in this patch.

3.  most_recent_version and versions_summary_view should be InnoDB, not MyISAM.

4.  Let's not name these as views.  Shortly we'll have MySQL 5 and we'll actually have views, and I think this is really confusing maintainability wise.

5.  Let's get this on the load test cluster once you fix these nits.

6.  Why is the patch feedback textarea in bugzilla so small?  ;)
Comment 20 Wil Clouser [:clouserw] 2008-06-10 21:22:59 PDT
> 1.(isset($_SERVER['HTTP_HOST'])) 
> is not a secure test.  Let's get the update script out of the document tree
> altogether.

All of our other scripts use that test.  I'd rather not block changing them on this bug since we're trying to get it out asap.
Comment 21 Chris Pollett 2008-06-10 22:50:12 PDT
Created attachment 324581 [details] [diff] [review]
attempt at addressing laura's concerns

I passed on 1 per clouserw advice (I copied that part of the code from an earlier script).

2. I removed the t-shirt stuff from this patch.

3. I tried to switch these to InnoDB but InnoDB doesn't support text search. I guess our earlier conversation worrying about truncate is moot as to do text search we need these tables to be MyIsam tables.

4. I renamed these tables avoiding the word view.

5. Hopefully, will know the results tomorrow.

6. I agree it's too small.
Comment 22 Michael Morgan [:morgamic] 2008-06-10 22:53:01 PDT
(In reply to comment #20)
> > 1.(isset($_SERVER['HTTP_HOST'])) 
> > is not a secure test.  Let's get the update script out of the document tree
> > altogether.
> 
> All of our other scripts use that test.  I'd rather not block changing them on
> this bug since we're trying to get it out asap.
> 
Doesn't really matter, /bin is way outside the docroot anyway.
Comment 23 Laura Thomson :laura 2008-06-11 07:31:33 PDT
Some general followup: thanks for addressing those points Chris.

Re 1, I'm glad it's outside the document tree.  You can seed HTTP_HOST pretty easily with whatever value (or none) you like.  We should look at where else we use it in the code.

Re 3, I saw you had the main table with the FTI as MyISAM, but the other tables don't have a FTI and shouldn't need it - doesn't really matter.  You're right about the truncate though. 

I'm really excited to see how this performs in load test!  Great job Chris.
Comment 24 Chris Pollett 2008-06-12 17:56:40 PDT
checked into r15329
Comment 25 Stephen Donner [:stephend] 2008-08-08 15:26:56 PDT
oremj: we're no longer killing the DB, right?

Note You need to log in before you can comment on or make changes to this bug.