Open Bug 523183 Opened 15 years ago Updated 5 years ago

[faceted search] gloda fulltext search does not match partial terms (by default), eats short terms potentially causing misleading search failures

Categories

(Thunderbird :: Search, defect)

defect
Not set
major

Tracking

(Not tracked)

People

(Reporter: ovidiu.grigorescu, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Whiteboard: [gs?])

Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.4pre) Gecko/20090915 Thunderbird/3.0b4
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.5pre) Gecko/20091016 Shredder/3.0pre

TB3 beta 4 as well as current nightlie over same old TB2 profile, 
[pop, rss, news, several of each, tousands of msg, 2 year tb2 profile or so]
Innitial indexing was done with 3.0pre above or could have been 20091014
-maybe related to bug 523173, same profile etc see some errors there

Search for bug gives different results, than the bugzilla daemon address, so much fewer.
search for word bugz, bugzi, bugzil gives /no results/ and "bugzilla" /245/? (there are 2900+ anyway)

seams the issue in Bug 383895 Comment #24 , point 2)
Blocks: Migration223
Summary: Search fails to find parts of a word but the whole word, none or wrong results (upgrading 2-3) → [faceted search] fails to find parts of a word but the whole word, none or wrong results (upgrading 2-3)
Bug 383895, comment #24, point 2):
> the search behaviour of "Search all messages" is not clear at all:
> someone yet needs to explain to me why
- "susann" will find "Susanne",
- while searching forr "susan" won't

Question: What's the intended behaviour?
How much and which part of a word do I need to type to get a match?
Actually, I'm surprised that we don't just use "contains word" for the search, i.o.w. *word* (where * matches any number of any letters.)
> Question: What's the intended behaviour?
> How much and which part of a word do I need to type to get a match?

Ping?
OS: Windows Vista → All
Hardware: x86 → All
The CJK tokenizer landing also got us support for searching for "bug*" searching for bug and all words that have bug as a prefix.  When typing into the search box, do not put it in quotes though or it won't work.

I think I responded to Thomas' questions out-of-band in a support e-mail thread, but for anyone else who comes along to the bug: the current behavior is intended in the sense that technical limitations make it hard to do much better.  In the case where no results are returned it's feasible for us to automatically escalate the search or at least provide UI to try and escalate it, but there are still certain limitations of the FTS3 engine that we can't yet work around.  (We don't know all of the words in the database or statistics about how frequently they are seen, etc.  The only magic we have is the single suffix wildcard trick.)
Another detail which is likely to confuse users:

STR
1 have lots of bugmail which contain this string: :)
Thomas D. <bugzilla2007@...> changed: ...
2 "Search all messages" for the following:
a) Thomas D.
b) "Thomas D." (with quotes)

Actual Result
a) finds nothing
b) finds loads of things, including "Thomas Düllm a nn" (without spaces)

Expected Result
a) should definitely find me, as it's the exact String that's in the mail
b) should not find my full name, but only "Thomas D." (with a dot)

Questions:
- Is the dot (.) a placeholder in the search? If yes, where is this documented?
- Is the dot (.) just ignored? If yes, where is this documented?
- What will average user expect when doing above searches?
(In reply to comment #4)
> 2 "Search all messages" for the following:
> a) Thomas D.
> b) "Thomas D." (with quotes)
> 
> Actual Result
> a) finds nothing
> b) finds loads of things, including "Thomas Düllm a nn" (without spaces)

'Thomas' tokenizes to 'thomas'. 'D.' tokenizes to nothing.

Case #a is searching for 'thomas' AND (nothing).  The intersection of anything and the empty set which nets us the empty set.

Case #b is searching for just 'thomas' because the use of quotes causes us to search for a single phrase which ends up being a single word because the tokenizer eats D.

I agree that case #a is quite confusing.  I'll file a specific bug on that now.
I should note that I also agree that case B is likely to produce some confusion in the user, but is a much less severe failure mode since it returns more results than the user wanted rather than none at all.  (And it doesn't really have a super easy fix...)
Thanks Andrew, that was helpful, especially because we can expect improvements from the bug you promised in comment 5.

Andrew:
- what's the shortest token? 2 letters? 3 letters?
- does that mean words with less letters are not indexed in the database at all?

(in reply to comment 3)
> The only magic we have is the single suffix wildcard trick.
- does that mean that when I search for Thomas, you actually search for Thomas?, i. e. Thomas with or without exactly ONE extra character?

Along with the new bug Andrew promised in comment 5, the following should also be fixed (which is similar yet slightly different as it involves a placeholder)
c1) Thomas -> correctly finds msgs containing token "thomas"
c2) Thomas D*
-> actually finds nothing
-> expected: should find msgs with token "thomas", as in c1), (without 1-letter tokens, unfortunately we can't do any intersection here between Thomas and any tokens starting with the letter d, as we'd loose "Thomas D." from results.)


>'Thomas' tokenizes to 'thomas'. 'D.' tokenizes to nothing.
> Case #a is searching for 'thomas' AND (nothing).  The intersection of anything
> and the empty set which nets us the empty set.
> I agree that case #a is quite confusing.  I'll file a specific bug on that 
> now.

Request:
Even after fixing the above, don't we need some indication when we ignore search strings because they are too short for a token?
As a matter of fact, after the fix, we'll NOT be searching for the user's search input, Thomas D., or Thomas D*, but only for Thomas. In the results title where we now have "Search for Thomas AND D., after the fix we should display something else:
* Search for Thomas -AND D.- (crossed out, or light grey, or both, or maybe just this:)
* Search for Thomas
(In reply to comment #7)
> - what's the shortest token? 2 letters? 3 letters?

2, although I could see raising that to 3 letters.  (The rationale being that, especially since we don't have stop-word support, most of the time we will be indexing useless common things like in/of/to/at.  Proper stop-words would moot those specific issues.  The question is how much is left and whether what is left is useful in searching or just weird artifacts from HTML-to-text or tokenization involving punctuation/ASCII art.)

> - does that mean words with less letters are not indexed in the database at
> all?

Correct, single letters are not indexed.

> > The only magic we have is the single suffix wildcard trick.
> - does that mean that when I search for Thomas, you actually search for
> Thomas?, i. e. Thomas with or without exactly ONE extra character?

No.  If you search for 'Thomas' then that is what we search for.  If you search for 'thomas*' then we will match any term that is prefixed with 'thomas'.
 
> c2) Thomas D*
> -> actually finds nothing
> -> expected: should find msgs with token "thomas", as in c1), (without 1-letter
> tokens, unfortunately we can't do any intersection here between Thomas and any
> tokens starting with the letter d, as we'd loose "Thomas D." from results.)

Interesting.  Although the user's intent in this case is likely different from the A case (if they know what the '*' does), I think the solution for the bug (bug 549594) is the same.  Namely, actually processing 'd*' is likely to overwhelm the system without contributing meaningfully to the results.

(I do think if we had a usable post-filtering mechanism that it would address this case as well as a few of your other cases, but we don't really have a promising technical solution within the reach of our current manpower.)
 
> Request:
> Even after fixing the above, don't we need some indication when we ignore
> search strings because they are too short for a token?

Indeed we do!  Thanks to the way the message searching logic is structured, this will already happen as a result of the fix for the A case tracked on bug 549594.
(In reply to comment #8)
> (In reply to comment #7)
> > - what's the shortest token? 2 letters? 3 letters?
> 2, although I could see raising that to 3 letters.

I'm not sure raising the tokenizer limit to 3 would be a good idea. Actually, my question was in the opposite direction (as I'll explain below).

> (The rationale being that,
> especially since we don't have stop-word support, most of the time we will be
> indexing useless common things like in/of/to/at.

I'm wondering if these are really useless, or if we are creating another problem by eating them, in fact even by eating one-letter tokens.

What happens when user wants to search for things like these?
c2) Thomas D*
c3) "Thomas D*"

Especially in the case of c3), user can rightly expect that we'll only match such tokens of "Thomas" that are immediately followed by any word starting with the letter D (followed by no or any letters). Although "Thomas" is clearly the more unique and therefore significant token, searching exactly for "Thomas D*" will usually return less and therefore more relevant results (e.g I won't find Thomas Anderson, Thomas Pearson, Thomas Petterson, etc.).

Without one-letter tokens, we don't have much choice other than ignoring D* in both c2 and c3 (which will be implemented by bug 549594). Technically, we should match "Thomas D" as well, but from my understanding we can't do this as we don't have the one-letter "D" token in the database (pls correct me if I'm wrong, I'm not a search expression or gloda expert).
But at least for anything with two or more letters, like "Thomas Dü*", we can currently execute the user's search correctly as we should, by definition.

If we raised the threshold to 3-letter tokens, we'd be manipulating even more potential searches and making them less precise and effective than they should be. Note that the user will always expect that any of "Thomas D*", "Thomas Dü*" etc. should return *less* (better) results than just searching for "Thomas".

Szenario: E.g. user is looking for a contact with known first name (Thomas), but can't remember complete / correct spelling of last name (is it _Düll_mann, _Dill_mann, or _Duell_mann?). Or user wants a list of all people with first name Thomas, and last name starts with D, to find the right one.


> Proper stop-words would moot those specific issues.
If we had stop words, how do I search for exact strings that contain them?
What if for some reason I need to find all mails that contain the exact string "Thomas in *", e.g. to retrieve a series of mails from a globetrotter who visited several countries?

> > - does that mean words with less letters are not indexed in the database at
> > all?
> Correct, single letters are not indexed.

I'm wondering if they should be, as explained above.

> > > The only magic we have is the single suffix wildcard trick.
> > - does that mean that when I search for Thomas, you actually search for
> > Thomas?, i. e. Thomas with or without exactly ONE extra character?
> No.  If you search for 'Thomas' then that is what we search for.

Hmmm, but still you aren't always searching exactly only what I type in:

m1) Thoma -> finds Thomas
m2) Josefin -> finds Josefine (however, that version of the name is nowhere in the source text of the mail, only in my address book; in the source, it's Josephine)
m3) test22 -> does NOT find test222

Can someone enlighten me how this magic works, especially m2? Are we secretly searching address book display names, deriving their email address, and then including that address into the actual search?


> (bug 549594) is the same.  Namely, actually processing 'd*' is likely to
> overwhelm the system without contributing meaningfully to the results.

I disagree, as shown above. It's not d* itself, but d* in exact string search "thomas d*" (with quotes), which can very well help to narrow down.

> (I do think if we had a usable post-filtering mechanism that it would address
> this case as well as a few of your other cases

I'm not sure what you mean here... what kind of post-filtering mechanism are you thinking of?
sidenote:  Bug 523443

> > Even after fixing the above, don't we need some indication when we ignore
> > search strings because they are too short for a token?
> Indeed we do!  Thanks to the way the message searching logic is structured,
> this will already happen as a result of the fix for the A case tracked on bug
> 549594.

Thanks for opening that bug, and for patient, informative and cooperative responses.
Depends on: 549594
SQLite FTS3 uses an inverted index for searching.  This is stored as a list of terms.  For each list of terms, there is a list of documents in which the term occurs (as well as the offsets of each occurrence).

When we search for multiple terms, FTS3 gets the list of documents in which the term occurs and intersects those lists.  If a phrase is used (ex: "foo bar"), the intersection additionally requires that the offsets of the words are adjacent.  The more document in the list, the more disk accesses and intersection work SQLite has to go through.  When we use a wildcard, FTS3 finds all of the terms that possess the prefix and uses all of their document lists.

The reason to increase the minimum token size and avoid indexing stopwords is because a document list that mentions every document is not contributing much in the way of search refinement but carries a high cost.

Take the example you use "Thomas in".  The list of documents referenced for "in" is likely to include almost every document in the database.  Although the list includes enough offset information to actually reduce the result set, we will be doing a ton of extra processing.

My mention of post-filtering is the idea that if you typed in "Thomas in" we would only ask the FTS3 engine for "Thomas", and then post-process those results outside of FTS3 to only include results where we see that string.

I think you can then see why "d*" would be really expensive too; we would be considering the list of documents where every word beginning with the letter d.
(In reply to comment #9)
> Hmmm, but still you aren't always searching exactly only what I type in:
> 
> m1) Thoma -> finds Thomas
> m2) Josefin -> finds Josefine (however, that version of the name is nowhere in
> the source text of the mail, only in my address book; in the source, it's
> Josephine)
> m3) test22 -> does NOT find test222

The tokenizer includes 'porter stemmer' logic for ASCII strings.  It's an algorithm for english words that tries to find the roots of words.  It apparently thinks Thomas might be a plural for Thoma and likewise that Josefin is the root of Josefine.  If you searches for Josefines you would probably also get a match.

I think wikipedia has a pretty good description of the porter stemming algorithm.  (It's probably also worth checking out their description/links for inverted indexing.)
Summary: [faceted search] fails to find parts of a word but the whole word, none or wrong results (upgrading 2-3) → [faceted search] gloda fulltext search does not match partial terms (by default), eats short terms potentially causing misleading search failures
No longer depends on: 536874
multi part question:
1. is this a pain point for global search users?  (and is this cited on gsfn?)  If one is a heavy search user, I would think that it would be.
2. do we have a mechanism for asking people? (other than gsfn)
3. should this be prioritized for TB 3.3?
Whiteboard: [gs?]
(In reply to Wayne Mery (:wsmwk) from comment #14)
> multi part question:
> 1. is this a pain point for global search users?  (and is this cited on
> gsfn?)  If one is a heavy search user, I would think that it would be.

there might be some among http://getsatisfaction.com/mozilla_messaging/tags/search  -- i started looking but there is so much cruft there


> 2. do we have a mechanism for asking people? (other than gsfn)
> 3. should this be prioritized for TB 3.3?
Ok, and what think about logic as MO? Because Filter in TB is great.. Global search useless .. sry.

https://i.imgur.com/lyQU4P9.png
See Also: → 832757
You need to log in before you can comment on or make changes to this bug.