Open Bug 544580 Opened 16 years ago Updated 1 year ago

gloda tokenizer could probably do a better job of indexing numbers

Tracking

(Not tracked)

Status:

NEW

People

(Reporter: asuth, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Whiteboard: [gloda key][tokenizer key])

Attachments

(1 file)

screenshot 2 years ago Ray Satiro 12.81 KB, image/png		Details

Andrew Sutherland [:asuth] (he/him)

Reporter

Description

•

16 years ago

Here's the deal on how this works right now: - The copy stemmer takes at most the first 3 characters and last 3 characters from a string involving digits. So "123456789" is emitted as "123789". As long as the user queries on "123456789" they will get what they expect, but they will also get any other results that look like "123*789". - We stop on punctuation. So "100,000,000" gets emitted as "100", "000", "000". Also, "1-555-555-5555" get emitted as "1", "555", "555", "5555". While phone number detection can be handled at a higher level, unless we start forcibly intercepting queries and shunting them to higher-level searches (rather than hinting with autocomplete), this can still result in some ridiculously expensive queries. The most likely solution to this problem would be to do the following: - Increase the copy stemmer constant so that it at least goes up to a full phone number with 4 digit extension. - Further specialize the state machine so that it distinguishes between ASCII digits and ASCII letters and treat limited punctuation between digits as invisible. For example ",.-" would probably be good candidates. I think the driving motivation here would be to constrain the potential set of matches rather than guarantee the user gets the right match. For example, apart from the ambiguity between US (1,000) and European (1.000) style delimiters, the user probably wouldn't care about the number of cents involved if they were searching for a dollar amount. However, from a search space perspective, interpreting "1,122.00" as "1122" and "00" is going to completely flood us with a number of bogus search results in the latter case. Interepreting it as "112200" is much saner, and could still turn out okay for the user if promoted to a wildcard search.

Andrew Sutherland [:asuth] (he/him)

Reporter

Updated

•

16 years ago

Whiteboard: [gloda key][tokenizer key]

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 2

•

16 years ago

The thing I duped in was a request for finding version numbers like "3.6.4". Same general problem, although in that case we likely do not want to elide the punctuation.

John Hopkins (:jhopkins)

Comment 3

•

15 years ago

I was searching today for my email announcement of Thunderbird 3.1.7 using the string "3.1.7" and "3 1 7". Neither search returned any results, even though I've located emails (using gmail) with that string. I think it makes sense to index "3.1.7" as a word and make it searchable.

Mark Banner (:standard8)

Updated

•

15 years ago

Flags: wanted-thunderbird+

Wayne Mery (:wsmwk)

Updated

•

14 years ago

Blocks: glodafailtracker

Wayne Mery (:wsmwk)

Comment 4

•

14 years ago

(exactly what I mentioned to protz on IRC yesterday) Protz, will this be helped by Bug 681754 - gloda fts3 tokenizer would greatly benefit from stopword support and how is this related to Bug 549594? GlodaMsgSearcher needs to avoid generating clauses that the tokenizer will eat (In reply to John Hopkins (:jhopkins) from comment #3) > I was searching today for my email announcement of Thunderbird 3.1.7 using > the string "3.1.7" and "3 1 7". Neither search returned any results, even > though I've located emails (using gmail) with that string. > > I think it makes sense to index "3.1.7" as a word and make it searchable.

Jonathan Protzenko [:protz]

Comment 5

•

14 years ago

This bug is specifically about improving the tokenizer so that it emits better tokens when indexing numbers. However, if we are to take any action regarding this bug or bug 681754, then we better make sure we fix bug 549594, otherwise the situation will become a real mess. The problem you're having is that the part of gloda that builds the query doesn't behave exactly like the tokenizer, so gloda thinks that search terms it passes to the SQLite search will yield valid results, while in fact, they're not valid tokens, so there's no chance they'll yield any results. To make sure Gloda only issues valid search terms, we need to run the query through to tokenizer first, and this is bug 549594. I'm updating the dependencies to reflect my comment.

Depends on: 549594

Wayne Mery (:wsmwk)

Updated

•

8 years ago

Depends on: 752844

Wayne Mery (:wsmwk)

Updated

•

8 years ago

No longer depends on: 752844

Ray Satiro

Comment 11

•

6 years ago

I don't know if I am experiencing this bug but I have an e-mail with words in the subject Electrolux 5303918344, and if I search 5303918344 the email is not found. OTOH if I search Electrolux the e-mail is found. I'm using Thunderbird 68.8.0.

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

Wayne Mery (:wsmwk)

Comment 14

•

2 years ago

(In reply to Ray Satiro from comment #11)

I don't know if I am experiencing this bug but I have an e-mail with words in the subject Electrolux 5303918344, and if I search 5303918344 the email is not found. OTOH if I search Electrolux the e-mail is found. I'm using Thunderbird 68.8.0.

I created a draft with 5303918344 in the body and another with 5303918345 in the subject - both are found.

Ray Satiro

Comment 15

•

2 years ago

Attached image screenshot — Details

(In reply to Wayne Mery (:wsmwk) from comment #14)

(In reply to Ray Satiro from comment #11)

I don't know if I am experiencing this bug but I have an e-mail with words in the subject Electrolux 5303918344, and if I search 5303918344 the email is not found. OTOH if I search Electrolux the e-mail is found. I'm using Thunderbird 68.8.0.

I created a draft with 5303918344 in the body and another with 5303918345 in the subject - both are found.

It might be something specific to the e-mail but I don't know what it could be. Both words are part of link text and not just regular text, see attached screenshot. I tried in Thunderbird 115.8.1 (64-bit) and now the e-mail does not show in the search results when I search for either word. Other e-mails with the same search terms show.

When I search specifically the local folder that contains the e-mail for body content then either search term will show the e-mail.

I deleted global-messages-db.sqlite (about 70MB), started Thunderbird, waited for the database to be rebuilt (watched Activity Monitor and checked that it rebuilt to the same size) and then searched again from the top box that searches everything. The results are the same, the e-mail is not returned in the search results.

chrizilla

Updated

•

1 year ago

Bugzilla

gloda tokenizer could probably do a better job of indexing numbers

Categories

(Thunderbird :: Search, defect)

Tracking

(Not tracked)

People

(Reporter: asuth, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Whiteboard: [gloda key][tokenizer key])

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 2

Comment 3

Updated

Updated

Comment 4

Comment 5

Updated

Updated

Comment 11

Updated

Comment 14

Comment 15

Updated

Attachment

General

Description

File Name

Content Type