have global search support exact matching (disabling stemming). ex: Searching for 'wedding' finds 'weds'

NEW
Unassigned

Status

Thunderbird
Search
--
enhancement
6 years ago
3 months ago

People

(Reporter: worms2, Unassigned)

Tracking

12 Branch

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

6 years ago
User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168 Safari/535.19

Steps to reproduce:

Search e-mail for 'wedding'


Actual results:

Many emails were found, most of which contained 'weds'.


Expected results:

weds should not have been found.

There should be some way to over-ride this 'clever' search. I've seen other systems use +word, or "word". Neither stop finding 'weds'.
You are correct; we have no way to disable the porter stemming algorithm.  This would need to be performed as a filtering post-pass since the database stores the stemmed word.  Using the plus sign might work out better, since we/SQLite already use quoting to indicate phrases and although quoting a single word is unambiguous, phrases would not be.

In the specific example, I assume the problem is that although "weds" in the sense of "Alice weds Bob tomorrow" is accurately related to wedding (wiktionary: "wed (third-person singular simple present weds, present participle wedding"), the problem is that people use "Weds." as an abbreviation for Wednesday, so the results for "wedding" get swamped out by completely unrelated things?
Severity: normal → enhancement
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Linux → All
Hardware: x86_64 → All
Summary: Searching for 'wedding' finds 'weds' → have global search support exact matching (disabling stemming). ex: Searching for 'wedding' finds 'weds'
(Reporter)

Comment 2

6 years ago
Yes, you are correct, my mails are full of people using weds for Wednesday. 

I had not really clicked that 'Weds' could also be a stem variant of Wedding.

Updated

6 years ago
Duplicate of this bug: 760042

Comment 4

5 years ago
Trying to find a message recommending some 'accountants' was pretty tricky as the stemmer took that all the way down to 'account'

Comment 5

4 years ago
Somehow worse is the fact that longer, more precise matches like "we submitted" (quoted in the search box) match things like:

submitted
submit
@submit
-submit

The stemming behavior should be disabled by quoting.
(In reply to M Lopez-Ibanez from comment #5)
> The stemming behavior should be disabled by quoting.

The problem is that under the current implementation, that information is already gone in the inverted index.  There is no "submitted" in the inverted index, only "submit".

Amending my previous statements in comment 1, possible solutions that don't require implementing our own fulltext-search back-end for SQLite would be to:

- Try and have the tokenizer also emit an un-stemmed token.  Unfortunately, this would break phrase searching, so it's not a great fix.

- Populate a second full-text-search table that does not do stemming but does do case-folding/etc.  So then "submitted" would be in the inverted index.  The downside to this is that it would probably double the disk space used (well, a 50% increase for global-messages-db.sqlite), although I think FTS3/FTS4 may have had some enhancements where you don't need to store the body text in-place anymore, so that might be only a 25% increase.


Important note: I don't hack on Thunderbird anymore at all, just including these possibilities if someone else wanted to pick up the bug.

Comment 7

4 years ago
(In reply to Andrew Sutherland (:asuth) from comment #6)
> (In reply to M Lopez-Ibanez from comment #5)
> > The stemming behavior should be disabled by quoting.
> 
> The problem is that under the current implementation, that information is
> already gone in the inverted index.  There is no "submitted" in the inverted
> index, only "submit".

Speaking from ignorance, so feel free to tell me that this simply cannot work, why not just post-process the results to remove invalid matches. I see Thunderbird helpfully highlighting the matched word, so perhaps before doing that it can check that what is matched actually matches exactly what was requested. This is a bit inefficient, since the exact search could be more efficient than the partial search, plus the added overhead of detecting and removing inexact matches. But perhaps it is not actually noticeably slower. I don't find the global search specially slow (with 107M global-messages-db.sqlite), but it is annoying to not be able to find something that I know is there.
Great minds think alike :)  What you propose can be done and is the option I described in comment 1.  I just wanted to enumerate some additional options that could avoid a post-filtering pass.  Especially in cases where we issue a LIMIT to the results, post-filtering could run into trouble where the stem of the word is extremely common, like for the example in comment 1.

Updated

4 years ago
Duplicate of this bug: 815630

Comment 10

4 years ago
Quoting definitely does not work. This functionality defeats the purpose of "searching" if it's giving me things I'm not looking for! 

I would propose that this is not an "enhancement" request. This goes beyond the basic functionality of searching (i.e. returning matches for exact string "needle") in a way that reduces the quality of the search results ("needle OR hay OR stack OR haystack").

- RG>

Comment 11

4 years ago
It affects me to.

Search just doesn't work for me.

I can't find emails with 

"@reviews.co.uk"
"reviews.co.uk"

because it find emails with "review" word in it....

Comment 12

4 years ago
Yes, also for me is a big problem. Looking for domains or email addresses thunderbird returns too much results with some parts of my search, in most of cases I need an exact match.

Comment 13

4 years ago
This NEEDS correcting  - just because a bad method has been implemented is not a reason to justify it.
I was searching for "Messaging" - but I get a million results for "Messages"
This happens every time. 
USABILITIY IS KING
There are other algorithms for search - even if it's slow but returns what we want it's VASTLY BETTER than what happens now. 
Just give a checkbox "strict match", and let us save that as a preference.

Updated

3 years ago
Duplicate of this bug: 1036277

Comment 15

3 years ago
It's also missing some text. For instance, Exchange replies to meetings with messages having *~*~*~*~*~ in them. Current version of Thunderbird won't find these. For example:


When: Friday, November 14, 2014 3:30 PM-4:30 PM. (UTC-05:00) Eastern Time (US & Canada)
Where: Conference Room 2 WPB

*~*~*~*~*~*~*~*~*~*
Meeting notes go here

Comment 16

3 years ago
The official help pages of Mozilla https://support.mozilla.org/en-US/kb/global-search say:

'Search for the phrase "new Thunderbird pages". Results should include messages that contain the entire phrase'

This, as this bug report says, doesn't work.

Comment 17

3 years ago
A work-around is to use the classical search "Ctrl+Shift+F", but it is limited to one account.

Comment 18

2 years ago
This issue even affects searches for specific websites.
For example i searched for the website alternate.de and got matches for emails which contain the word alternatively.

This feature should be disabled for terms which aren't words. I still would prefer a solution where you can disable the stem feature completely.
Comment hidden (metoo)
Comment hidden (metoo)

Comment 21

a year ago
As a workaround with this plugin search seems to be better:
https://addons.mozilla.org/en-US/thunderbird/addon/gmailui/
This is no solution though.

Comment 22

a year ago
I hope this will be fixed soon too. I use the search a lot for searching non-English and non existing words to find important documents. It became a very time consuming task since this smart search has been enabled a few years ago. The Thunderbird versions before that saved me a lot of time.

Imho, it would be nice if we can disable stemming altogether, if I want both account and accounting I will just search for account. Or make it optional if people really need it.

Comment 23

a year ago
As a workaround you can always use the old style search (Ctrl+Shift+F). It's not as fast as it's not indexed, but there's no stemming.
Comment hidden (metoo)

Comment 25

10 months ago
(In reply to Piotr Szymkowski from comment #11)

Yes, for me it is a big problem. Search for domains or email addresses Thunderbird returns too many results, with some parts of my research, in most cases, I need an exact match.

@Piotr Szymkowski
Yes, I work in the wedding sector: http://www.noemiwedding.com/

A work-around is to use the classical search "Ctrl+Shift+F", but it is limited to one account.

Comment 26

9 months ago
A search for "interns" provides results for "internal" (even when followed by "and external") and "international" (which can swamp results) (though curiously, "internship" is not found).

Comment 27

4 months ago
And trying to exclude some of these unwanted terms, with -[unwanted term], doesn't exclude them.

Comment 28

3 months ago
I'm also trying to search for all e-mails containing the exact literal "hp.com", but Thunderbird finds all mails with "com" (i.e.: all mails containing a dot com address :-O)... and I can't search for just "hp" (no results... probably too short?).

This is a huge limitation of the global search engine that hurt me so many times.
You need to log in before you can comment on or make changes to this bug.