Open Bug 544580 Opened 14 years ago Updated 16 days ago

gloda tokenizer could probably do a better job of indexing numbers

Categories

(Thunderbird :: Search, defect)

defect

Tracking

(Not tracked)

People

(Reporter: asuth, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Whiteboard: [gloda key][tokenizer key])

Attachments

(1 file)

Here's the deal on how this works right now:

- The copy stemmer takes at most the first 3 characters and last 3 characters from a string involving digits.  So "123456789" is emitted as "123789".  As long as the user queries on "123456789" they will get what they expect, but they will also get any other results that look like "123*789".

- We stop on punctuation.  So "100,000,000" gets emitted as "100", "000", "000".  Also, "1-555-555-5555" get emitted as "1", "555", "555", "5555".  While phone number detection can be handled at a higher level, unless we start forcibly intercepting queries and shunting them to higher-level searches (rather than hinting with autocomplete), this can still result in some ridiculously expensive queries.

The most likely solution to this problem would be to do the following:

- Increase the copy stemmer constant so that it at least goes up to a full phone number with 4 digit extension.

- Further specialize the state machine so that it distinguishes between ASCII digits and ASCII letters and treat limited punctuation between digits as invisible.  For example ",.-" would probably be good candidates.

I think the driving motivation here would be to constrain the potential set of matches rather than guarantee the user gets the right match.  For example, apart from the ambiguity between US (1,000) and European (1.000) style delimiters, the user probably wouldn't care about the number of cents involved if they were searching for a dollar amount.  However, from a search space perspective, interpreting "1,122.00" as "1122" and "00" is going to completely flood us with a number of bogus search results in the latter case.  Interepreting it as "112200" is much saner, and could still turn out okay for the user if promoted to a wildcard search.
Whiteboard: [gloda key][tokenizer key]
The thing I duped in was a request for finding version numbers like "3.6.4".  Same general problem, although in that case we likely do not want to elide the punctuation.
I was searching today for my email announcement of Thunderbird 3.1.7 using the string "3.1.7" and "3 1 7".  Neither search returned any results, even though I've located emails (using gmail) with that string.

I think it makes sense to index "3.1.7" as a word and make it searchable.
Flags: wanted-thunderbird+
(exactly what I mentioned to protz on IRC yesterday)

Protz, will this be helped by Bug 681754 - gloda fts3 tokenizer would greatly benefit from stopword support

and how is this related to Bug 549594?  GlodaMsgSearcher needs to avoid generating clauses that the tokenizer will eat


(In reply to John Hopkins (:jhopkins) from comment #3)
> I was searching today for my email announcement of Thunderbird 3.1.7 using
> the string "3.1.7" and "3 1 7".  Neither search returned any results, even
> though I've located emails (using gmail) with that string.
> 
> I think it makes sense to index "3.1.7" as a word and make it searchable.
This bug is specifically about improving the tokenizer so that it emits better tokens when indexing numbers. However, if we are to take any action regarding this bug or bug 681754, then we better make sure we fix bug 549594, otherwise the situation will become a real mess.

The problem you're having is that the part of gloda that builds the query doesn't behave exactly like the tokenizer, so gloda thinks that search terms it passes to the SQLite search will yield valid results, while in fact, they're not valid tokens, so there's no chance they'll yield any results.

To make sure Gloda only issues valid search terms, we need to run the query through to tokenizer first, and this is bug 549594.

I'm updating the dependencies to reflect my comment.
Depends on: 549594
Depends on: 752844
No longer depends on: 752844

I don't know if I am experiencing this bug but I have an e-mail with words in the subject Electrolux 5303918344, and if I search 5303918344 the email is not found. OTOH if I search Electrolux the e-mail is found. I'm using Thunderbird 68.8.0.

Severity: normal → S3

(In reply to Ray Satiro from comment #11)

I don't know if I am experiencing this bug but I have an e-mail with words in the subject Electrolux 5303918344, and if I search 5303918344 the email is not found. OTOH if I search Electrolux the e-mail is found. I'm using Thunderbird 68.8.0.

I created a draft with 5303918344 in the body and another with 5303918345 in the subject - both are found.

Attached image screenshot

(In reply to Wayne Mery (:wsmwk) from comment #14)

(In reply to Ray Satiro from comment #11)

I don't know if I am experiencing this bug but I have an e-mail with words in the subject Electrolux 5303918344, and if I search 5303918344 the email is not found. OTOH if I search Electrolux the e-mail is found. I'm using Thunderbird 68.8.0.

I created a draft with 5303918344 in the body and another with 5303918345 in the subject - both are found.

It might be something specific to the e-mail but I don't know what it could be. Both words are part of link text and not just regular text, see attached screenshot. I tried in Thunderbird 115.8.1 (64-bit) and now the e-mail does not show in the search results when I search for either word. Other e-mails with the same search terms show.

When I search specifically the local folder that contains the e-mail for body content then either search term will show the e-mail.

I deleted global-messages-db.sqlite (about 70MB), started Thunderbird, waited for the database to be rebuilt (watched Activity Monitor and checked that it rebuilt to the same size) and then searched again from the top box that searches everything. The results are the same, the e-mail is not returned in the search results.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: