gloda tokenizer could probably do a better job of indexing numbers

NEW
Unassigned

Status

9 years ago
7 months ago

People

(Reporter: asuth, Unassigned)

Tracking

(Depends on: 1 bug, Blocks: 1 bug)

Dependency tree / graph
Bug Flags:
wanted-thunderbird +

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [gloda key][tokenizer key])

(Reporter)

Description

9 years ago
Here's the deal on how this works right now:

- The copy stemmer takes at most the first 3 characters and last 3 characters from a string involving digits.  So "123456789" is emitted as "123789".  As long as the user queries on "123456789" they will get what they expect, but they will also get any other results that look like "123*789".

- We stop on punctuation.  So "100,000,000" gets emitted as "100", "000", "000".  Also, "1-555-555-5555" get emitted as "1", "555", "555", "5555".  While phone number detection can be handled at a higher level, unless we start forcibly intercepting queries and shunting them to higher-level searches (rather than hinting with autocomplete), this can still result in some ridiculously expensive queries.

The most likely solution to this problem would be to do the following:

- Increase the copy stemmer constant so that it at least goes up to a full phone number with 4 digit extension.

- Further specialize the state machine so that it distinguishes between ASCII digits and ASCII letters and treat limited punctuation between digits as invisible.  For example ",.-" would probably be good candidates.

I think the driving motivation here would be to constrain the potential set of matches rather than guarantee the user gets the right match.  For example, apart from the ambiguity between US (1,000) and European (1.000) style delimiters, the user probably wouldn't care about the number of cents involved if they were searching for a dollar amount.  However, from a search space perspective, interpreting "1,122.00" as "1122" and "00" is going to completely flood us with a number of bogus search results in the latter case.  Interepreting it as "112200" is much saner, and could still turn out okay for the user if promoted to a wildcard search.
(Reporter)

Updated

9 years ago
Whiteboard: [gloda key][tokenizer key]
(Reporter)

Updated

8 years ago
Duplicate of this bug: 563780
(Reporter)

Comment 2

8 years ago
The thing I duped in was a request for finding version numbers like "3.6.4".  Same general problem, although in that case we likely do not want to elide the punctuation.
I was searching today for my email announcement of Thunderbird 3.1.7 using the string "3.1.7" and "3 1 7".  Neither search returned any results, even though I've located emails (using gmail) with that string.

I think it makes sense to index "3.1.7" as a word and make it searchable.
Flags: wanted-thunderbird+

Updated

7 years ago
Blocks: 541349

Comment 4

7 years ago
(exactly what I mentioned to protz on IRC yesterday)

Protz, will this be helped by Bug 681754 - gloda fts3 tokenizer would greatly benefit from stopword support

and how is this related to Bug 549594?  GlodaMsgSearcher needs to avoid generating clauses that the tokenizer will eat


(In reply to John Hopkins (:jhopkins) from comment #3)
> I was searching today for my email announcement of Thunderbird 3.1.7 using
> the string "3.1.7" and "3 1 7".  Neither search returned any results, even
> though I've located emails (using gmail) with that string.
> 
> I think it makes sense to index "3.1.7" as a word and make it searchable.
This bug is specifically about improving the tokenizer so that it emits better tokens when indexing numbers. However, if we are to take any action regarding this bug or bug 681754, then we better make sure we fix bug 549594, otherwise the situation will become a real mess.

The problem you're having is that the part of gloda that builds the query doesn't behave exactly like the tokenizer, so gloda thinks that search terms it passes to the SQLite search will yield valid results, while in fact, they're not valid tokens, so there's no chance they'll yield any results.

To make sure Gloda only issues valid search terms, we need to run the query through to tokenizer first, and this is bug 549594.

I'm updating the dependencies to reflect my comment.
Depends on: 549594

Updated

5 years ago
Duplicate of this bug: 956651

Updated

4 years ago
Duplicate of this bug: 1067960

Updated

3 years ago
Duplicate of this bug: 1210714

Updated

a year ago
Duplicate of this bug: 1350540

Updated

7 months ago
Duplicate of this bug: 834621

Updated

7 months ago
Depends on: 752844

Updated

7 months ago
No longer depends on: 752844
You need to log in before you can comment on or make changes to this bug.