User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168 Safari/535.19 Steps to reproduce: Search e-mail for 'wedding' Actual results: Many emails were found, most of which contained 'weds'. Expected results: weds should not have been found. There should be some way to over-ride this 'clever' search. I've seen other systems use +word, or "word". Neither stop finding 'weds'.
You are correct; we have no way to disable the porter stemming algorithm. This would need to be performed as a filtering post-pass since the database stores the stemmed word. Using the plus sign might work out better, since we/SQLite already use quoting to indicate phrases and although quoting a single word is unambiguous, phrases would not be. In the specific example, I assume the problem is that although "weds" in the sense of "Alice weds Bob tomorrow" is accurately related to wedding (wiktionary: "wed (third-person singular simple present weds, present participle wedding"), the problem is that people use "Weds." as an abbreviation for Wednesday, so the results for "wedding" get swamped out by completely unrelated things?
Yes, you are correct, my mails are full of people using weds for Wednesday. I had not really clicked that 'Weds' could also be a stem variant of Wedding.
Trying to find a message recommending some 'accountants' was pretty tricky as the stemmer took that all the way down to 'account'
Somehow worse is the fact that longer, more precise matches like "we submitted" (quoted in the search box) match things like: submitted submit @submit -submit The stemming behavior should be disabled by quoting.
(In reply to M Lopez-Ibanez from comment #5) > The stemming behavior should be disabled by quoting. The problem is that under the current implementation, that information is already gone in the inverted index. There is no "submitted" in the inverted index, only "submit". Amending my previous statements in comment 1, possible solutions that don't require implementing our own fulltext-search back-end for SQLite would be to: - Try and have the tokenizer also emit an un-stemmed token. Unfortunately, this would break phrase searching, so it's not a great fix. - Populate a second full-text-search table that does not do stemming but does do case-folding/etc. So then "submitted" would be in the inverted index. The downside to this is that it would probably double the disk space used (well, a 50% increase for global-messages-db.sqlite), although I think FTS3/FTS4 may have had some enhancements where you don't need to store the body text in-place anymore, so that might be only a 25% increase. Important note: I don't hack on Thunderbird anymore at all, just including these possibilities if someone else wanted to pick up the bug.
(In reply to Andrew Sutherland (:asuth) from comment #6) > (In reply to M Lopez-Ibanez from comment #5) > > The stemming behavior should be disabled by quoting. > > The problem is that under the current implementation, that information is > already gone in the inverted index. There is no "submitted" in the inverted > index, only "submit". Speaking from ignorance, so feel free to tell me that this simply cannot work, why not just post-process the results to remove invalid matches. I see Thunderbird helpfully highlighting the matched word, so perhaps before doing that it can check that what is matched actually matches exactly what was requested. This is a bit inefficient, since the exact search could be more efficient than the partial search, plus the added overhead of detecting and removing inexact matches. But perhaps it is not actually noticeably slower. I don't find the global search specially slow (with 107M global-messages-db.sqlite), but it is annoying to not be able to find something that I know is there.
Great minds think alike :) What you propose can be done and is the option I described in comment 1. I just wanted to enumerate some additional options that could avoid a post-filtering pass. Especially in cases where we issue a LIMIT to the results, post-filtering could run into trouble where the stem of the word is extremely common, like for the example in comment 1.
Quoting definitely does not work. This functionality defeats the purpose of "searching" if it's giving me things I'm not looking for! I would propose that this is not an "enhancement" request. This goes beyond the basic functionality of searching (i.e. returning matches for exact string "needle") in a way that reduces the quality of the search results ("needle OR hay OR stack OR haystack"). - RG>
It affects me to. Search just doesn't work for me. I can't find emails with "@reviews.co.uk" "reviews.co.uk" because it find emails with "review" word in it....
Yes, also for me is a big problem. Looking for domains or email addresses thunderbird returns too much results with some parts of my search, in most of cases I need an exact match.
This NEEDS correcting - just because a bad method has been implemented is not a reason to justify it. I was searching for "Messaging" - but I get a million results for "Messages" This happens every time. USABILITIY IS KING There are other algorithms for search - even if it's slow but returns what we want it's VASTLY BETTER than what happens now. Just give a checkbox "strict match", and let us save that as a preference.
It's also missing some text. For instance, Exchange replies to meetings with messages having *~*~*~*~*~ in them. Current version of Thunderbird won't find these. For example: When: Friday, November 14, 2014 3:30 PM-4:30 PM. (UTC-05:00) Eastern Time (US & Canada) Where: Conference Room 2 WPB *~*~*~*~*~*~*~*~*~* Meeting notes go here
The official help pages of Mozilla https://support.mozilla.org/en-US/kb/global-search say: 'Search for the phrase "new Thunderbird pages". Results should include messages that contain the entire phrase' This, as this bug report says, doesn't work.
A work-around is to use the classical search "Ctrl+Shift+F", but it is limited to one account.
This issue even affects searches for specific websites. For example i searched for the website alternate.de and got matches for emails which contain the word alternatively. This feature should be disabled for terms which aren't words. I still would prefer a solution where you can disable the stem feature completely.
As a workaround with this plugin search seems to be better: https://addons.mozilla.org/en-US/thunderbird/addon/gmailui/ This is no solution though.
I hope this will be fixed soon too. I use the search a lot for searching non-English and non existing words to find important documents. It became a very time consuming task since this smart search has been enabled a few years ago. The Thunderbird versions before that saved me a lot of time. Imho, it would be nice if we can disable stemming altogether, if I want both account and accounting I will just search for account. Or make it optional if people really need it.
As a workaround you can always use the old style search (Ctrl+Shift+F). It's not as fast as it's not indexed, but there's no stemming.
(In reply to Piotr Szymkowski from comment #11) Yes, for me it is a big problem. Search for domains or email addresses Thunderbird returns too many results, with some parts of my research, in most cases, I need an exact match. @Piotr Szymkowski Yes, I work in the wedding sector: http://www.noemiwedding.com/ A work-around is to use the classical search "Ctrl+Shift+F", but it is limited to one account.
A search for "interns" provides results for "internal" (even when followed by "and external") and "international" (which can swamp results) (though curiously, "internship" is not found).
And trying to exclude some of these unwanted terms, with -[unwanted term], doesn't exclude them.
I'm also trying to search for all e-mails containing the exact literal "hp.com", but Thunderbird finds all mails with "com" (i.e.: all mails containing a dot com address :-O)... and I can't search for just "hp" (no results... probably too short?). This is a huge limitation of the global search engine that hurt me so many times.