Closed Bug 447933 Opened 12 years ago Closed 11 years ago
Flat search for "better" doesn't find "better gmail" or "better gmail 2" add-ons
Sadly, I don't know if this is a regression; I hope this isn't a duplicate, either A search for "better" doesn't yield either the "Better Gmail" or "Better Gmail 2" add-ons: https://addons.mozilla.org/en-US/firefox/search?q=better However, an explicit search for "better gmail" yields both: https://addons.mozilla.org/en-US/firefox/search?q=better+gmail&cat=all (I have added this testcase to my Selenium test suite.)
Chris could you take a look at this today?
Assignee: nobody → cpollett
sure. I'll take a look at it.
Status: NEW → ASSIGNED
I think this is caused because "better" is a mysql full text stopword. So a search on it doesn't give anything back.
Correct. "Better" is not indexed, so you can't find it. However, other words *containing* the stopword (composites as "BetterGReader") will be found, which is the reason we are seeing 7 results here, all of which contain such composites. A more obvious example is searching for "the" which will mostly return *Themes* as "theme" contains "the". However, the is a stopword, so it won't be indexed or matched.
I looked at the mysql sources. The default stopword list is English only. http://www.ranks.nl/stopwords/ has suggested stopword lists for other languages. You can only have one stopword list at a time. So we can give a warning when someone is using one of the default stopwords but the behaviour might seem a little funky if the locale is not English.
Why funky? If a word is not on the list, it will end up in the index, so for example German articles "der", "die", and "das" will get indexed and thus there's no need to warn users searching for them. I guess, if "better" means something very special in another language, they'd be confused if it said, it's a common word, but we need to warn them either way.
OK, not really a comment for this bug but...let's fix this temporarily if we can... but we should really consider a more robust search solution, (does one exist?) that allows us to manage separate stoplist and give us flexibility in our ranking. Spelling suggestions, special boolean searches, etc..
We may want to investigate http://www.sphinxsearch.com/features.html
The proposed patch prints a message like: "better" as a frequently occurring word is not indexed except as a substring of longer words. for each stop-word among the search terms. The list of stop-words in the patch is a hard-coded array in search_controller because the default mysql stop-words are actually compiled into the mysql executable and are not directly accessible by querying the DB. For reasons mentioned in previous comments English stop-words are used for all locales because Mysql supports only one stop-word file and by default it uses English words.
Attachment #331584 - Flags: review?(fwenzel)
I'm sorry to say, just warning people not to use a list of words isn't going to be a good idea, even temporarily. I just posted bug 456206 a moment ago (thanks for duping; would never have guessed this one) and the problem there is that there is an extension named simply "Brief" which is apparently also a stop-word. Thus, it's impossible to search for this extension by name at all. (and a user can't install by name from FF3's Get Add-ons) A better solution is going to be needed here.
From bug 457952 comment 9: "Short term remedy: We strip stop words from the queries before executing them."
That seems reasonable. One could probably also just get rid of stopwords altogether within the mysql config file without increasing DB load that much.
Comment on attachment 331584 [details] [diff] [review] patch to warn when someone is using stop words Hi, Chris! Just r-ing this, not because your code is bad but because according to the discussion here, warning people about stop words is not the way to go.
Attachment #331584 - Flags: review?(fwenzel) → review-
(In reply to comment #15) > That seems reasonable. One could probably also just get rid of stopwords > altogether within the mysql config file without increasing DB load that much. What concerns me is that this is a server-wide setting: *Every* project using these DBs and possibly wanting to employ full text search will need to live with an empty stop word list then. CCing xb95: Mark, are the DB servers AMO runs on dedicated to the project, or do they carry other project's DBs as well? In the latter case, we'd at the very least need to be aware of the possible side effects for other projects.
The AMO databases are dedicated, they serve no other purpose (well, besides SAMO/VAMO stuff). If you want to remove stopwords, we can do that. I see no reason not to.
Thanks. We'll plan this on the webdev side, then move forward with IT (server restart, rebuilding indexes, probably asks for a maintenance window).
This might be the better solution. Otherwise, we have to maintain a list of mysql stopwords to ignore in the amo code, which strikes me as somewhat brittle.
"Tab mix plus" (no quotes) finds the add-on as expected now, as the first result, as we removed stop words. To my confusion, "better" (n.q.) still does not find "better gmail". Searching for "better gmail" (n.q.) however, finds it as the first result in the set. What is possible is that somebody searched for "better" recently so the result is still cached. We'll need to check again in a little while. If it still doesn't show up, it could be that more than 50% of all add-ons use "better" somewhere in their descriptions (cf. bug 458110 comment 13), leading to it still not being indexed, though I find that quite unlikely.
(In reply to comment #21) > What is possible is that somebody searched for "better" recently so the result > is still cached. We'll need to check again in a little while. Yup: Cache has been flushed, searching for "better" (no quotes) now returns better Gmail and better gmail 2 on top of the result set. Crowd, please cheer. :) Chris and Mark: Thank you for your support in this issue! Happily marking FIXED.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
I've also re-run my Selenium tests, and confirmed Fred's results, as well as that we haven't regressed performance (at least by Selenium's timings): we're still averaging 24 seconds for the Search testcase (http://svn.mozilla.org/addons/trunk/site/app/tests/search.html) on https://wiki.mozilla.org/QA/Tools/Selenium/AMO_Automation, over 20 runs. Verified FIXED, as I don't think there's anything left to do?
Status: RESOLVED → VERIFIED
*cheers* Just tested my particular case (bug 456206/comment 12) and searching for "Brief" now finds Brief as the first result. :) Is there some way we can track the possible performance hit over the next week or so, in addition to tests?
(In reply to comment #24) > Is there some way we can track the possible performance hit over the next week > or so, in addition to tests? I don't think this is necessary: A comparison of the search table size before and after did show an increase, however, it is still only in the single-digit (!) megabyte dimension, which is very small, in fact much smaller than I expected. Monitoring performance is more important when the queries change (bug 446122), as that can have a considerable impact on how expensive search is, and thus how well it performs.
Product: addons.mozilla.org → addons.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.