Closed Bug 1559811 Opened 6 years ago Closed 6 years ago

Tokenisation used by Gloda will find "way" or "18" when looking for "2-way" or "B18".

Categories

(Thunderbird :: Search, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: evichlaettstar, Unassigned)

Details

Attachments

(10 files, 2 obsolete files)

User Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0

Steps to reproduce:

  1. I search for the term '2-way'.
  2. I search for the term 'B18'.

Actual results:

  1. Search result shows all e-mails containing the term 'way' and the e-mails containing the term '2-way'.
  2. Search result shows all e-mails containing the term '18' and the e-mails containing the term 'B18'.

Expected results:

  1. Search result should show only e-mails containing the term '2-way'.
  2. Search result should show only e-mails containing the term 'B18'.

That's the global search, right, the one from the box with <Ctrl+K>? Global search is not literal, it tokenises, also for verb forms, so if you look for "look" you'll find "looked", "looks", "looking", etc.

(In reply to Jorg K (GMT+2) from comment #1)

That's the global search, right, the one from the box with <Ctrl+K>? Global search is not literal, it tokenises, also for verb forms, so if you look for "look" you'll find "looked", "looks", "looking", etc.

The problem described by me concerns all searches in TB.
Tokenising does not concern the delineated problem.
The problem rather seems to be that the search engine truncates the search term if the search term contains letters and numbers.

Let's see whether 2-way and B18 when doing a non-global "regular" body search.

As I said, "regular" search, right-click on folder, "Search Messages...", same as Ctrl+Shift+F or using the quick filter bar (QFB) does not tokenise. Looking for "2-way" or "B18" in the body, I only find the message generated from the previous comment, not messages only containing "way" or "18".

I think I'll close this bug since it works as designed.

Status: UNCONFIRMED → RESOLVED
Closed: 6 years ago
Resolution: --- → INVALID
Summary: No search filtering of search terms consisting of letters and numbers. → Tokenisation used by Gloda will find "way" or "18" when looking for "2-way" or "B18".

(In reply to Jorg K (GMT+2) from comment #4)

As I said, "regular" search, right-click on folder, "Search Messages...", same as Ctrl+Shift+F or using the quick filter bar (QFB) does not tokenise. Looking for "2-way" or "B18" in the body, I only find the message generated from the previous comment, not messages only containing "way" or "18".

I think I'll close this bug since it works as designed.

You are right if you try to reproduce it with a few simple test e-mails. However, the problem exists indeed. Now I try to find out constraints, maybe a minimum number of e-mails within a folder or characters per e-mail or whatever.

btw:
I already rebuilt the global-messages-db.sqlite. Problem still exists.

Please understand that TB has TWO search methods:

  1. "regular" search, right-click on folder, "Search Messages...", same as Ctrl+Shift+F or using the quick filter bar (QFB)
  2. Global search, the so-called Gloda.

"Regular" search doesn't tokenise and will match strings exactly, so looking for "B18" will NOT find "18".

Gloda search, which uses global-messages-db.sqlite, does tokenise. You can rebuild the database as many times as you like, that won't change how it works. "B18" will be tokenised somehow and will find "18". One reason why we do it that way is given in comment #1. Further reading:
https://developer.mozilla.org/en-US/docs/Mozilla/Thunderbird/gloda#Full-text_search

https://developer.mozilla.org/en-US/docs/Mozilla/Thunderbird/gloda#Full-text_search
Thank you for the link. Unfortunately I don't really understand this special knowledge. Too many technical terms I cannot translate in my native language German.

If I'm looking for "B18",
Ctrl+Shift+F or QFB finds "18" in my test folder, but it finds only "B18" in the sent folder. Strange.
Now I copied all e-mails from test folder to test folder 2 and test folder 3. It finds only "B18". Same e-mails are in all four folders. That is strange.
Ctrl+K finds only "B18" but only in the sent folder, not in the three test folders. That is strange too.

If I'm looking for "2-way",
Ctrl+Shift+F or QFB finds "way". That's bad.
Ctrl+K finds "way". That's bad.

Attached image 2-way.png

Quick search for 2-way only finds 4 messages from this bug.

Attached image way.png

Looking for "way" finds many more. So I don't understand why this is behaving differently from yours.

(In reply to Jorg K (GMT+2) from comment #8)

Quick search for 2-way only finds 4 messages from this bug.

Maybe you find only 4 messages, because in the other messages is only the word "way". But if the other messages would contain "way" and somewhere else "2", then you had more hits.

OK, my bugmail folder has 129252 messages. Looking for "2-way" in the body finds six, looking for "way" finds 15549, also matches on "neil@parkwaycc.co.uk", and looking for "2 way" (with a space) finds 14776 messages.

So in your theory "2-way" should have found those 14776 messages?

Just for fun, since I'm one of the developers here, I added some debug here where the search string is matched against the body's content:
https://searchfox.org/comm-central/rev/630f951ef8efd45af34ef07382851a4ab3184d6c/mailnews/base/search/src/nsMsgSearchTerm.cpp#1002

Using QFB, I see "2-way" being compared, if I enter "2 way" with the quotes into the search box, it's actually looking for "2 way" and not "2" and "way" separately. Just was an aside, did you know that "2|way" works for "2" or "way".

"Search Messages", Ctrl+Shift+F, has some other input rules, there you don't need to quote the string to get an exact match.

So I really don't know how your statement from comment #7

Ctrl+Shift+F or QFB finds "18" in my test folder, but it finds only "B18" in the sent folder. Strange.
Ctrl+Shift+F or QFB finds "way". That's bad. [when looking for 2-way]
can be true.

I'm happy for you to attach your test folder, maybe zipped up, then I can take a further look.

Attached file Testordner.zip (obsolete) —

Test folder containing plain text of obviously more than 6 e-mails. But TB shows only 6 e-mails.

Attached file Testordner 3.zip (obsolete) —

Test folder containing 6 e-mails.

(In reply to Jorg K (GMT+2) from comment #11)

So in your theory "2-way" should have found those 14776 messages?

Yes, because that is the behaviour of my TB - just in some folders, as I presume now.

Using QFB, I see "2-way" being compared, if I enter "2 way" with the quotes into the search box, it's actually looking for "2 way" and not "2" and "way" separately.

But in some of my folders it's looking for "2" and "way" separately.

did you know that "2|way" works for "2" or "way".

This is new for me, but does not solve my problem.

"Search Messages", Ctrl+Shift+F, has some other input rules, there you don't need to quote the string to get an exact match.

So I really don't know how your statement from comment #7

Ctrl+Shift+F or QFB finds "18" in my test folder, but it finds only "B18" in the sent folder. Strange.
Ctrl+Shift+F or QFB finds "way". That's bad. [when looking for 2-way]
can be true.

I'm happy for you to attach your test folder, maybe zipped up, then I can take a further look.

I attached 2 zipped folders "testordner.zip" and "testordner 3.zip", wherein TB shows me the same 6 e-mails. But both zip archives have different sizes. I looked in the plain text of archive "testordner.zip" and see the text of more than 6 e-mails. What's going on in my TB 60.7.1?

You may want to repair those folders, right-click, Properties, "Repair Folder".

Testordner has 19 messages, "B18" isn't found, "18" is found various times, "2-way" is found in one message. Testordner_3 has three messages, "B18" is found once.

Overall, I can't see any problem.

Attached file Test Folder A.zip

Test folder A for search filtering

Attachment #9072707 - Attachment is obsolete: true
Attachment #9072708 - Attachment is obsolete: true
Attached file Test Folder B.zip

Test folder B for search filtering

(In reply to Jorg K (GMT+2) from comment #15)

You may want to repair those folders, right-click, Properties, "Repair Folder".

Testordner has 19 messages, "B18" isn't found, "18" is found various times, "2-way" is found in one message. Testordner_3 has three messages, "B18" is found once.

Overall, I can't see any problem.

Many apologizes, as my test setup was not correct.
Unfortunately it is now impossible to export the "Testordner" again. I tried it several times with Add-on ImportExportTools, no way. All other folders can be exported. Weird.

Next try.
I created Test Folder A. In this folder I moved all incoming test e-mails.
I created Test Folder B. In this folder I copied all e-mails from Test Folder A.
As you can see in my 6 screenshots the search filtering results differs between both Test Folders. Can you reproduce this behaviour?

I really don't have more time to invest into this. As I said before, most likely your folders need repair. Furthermore, I don't trust ImportExportTools, it's better just to grab the folder file off the file system.

Since you gave me the folders without the .msf file, adding them to my profile rebuilds the index, which is what a repair would do.

I see this:
Folder A has six messages, and folder B has 12 messages, they are all doubled-up.

"B18" will hit will hit "Testmail 2", once in A and twice in B. "2-way" hits "Noch ein Test 1", again, once in A and twice in B. I think this is the desired outcome.

OK. Deleting all msf-files didn't help. So I deleted all messages in TB's local IMAP folders. Afterwards I reloaded all messages from the IMAP server. Now the search filters work as you described.

Sorry for misallocation of my problem. Now I know that obviously I have to empty all my local IMAP folders from time to time in order to be able doing correct searches. Thanks for your effort and your leading to the solution of my search filter problem.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: