Closed Bug 529824 Opened 16 years ago Closed 16 years ago

tokenization bug breaks global search query strings ending in non-ASCII characters

Categories

(Thunderbird :: Search, defect)

x86
Windows XP
defect
Not set
normal

Tracking

(blocking-thunderbird3.0 .1+, thunderbird3.0 .1-fixed)

VERIFIED FIXED
Thunderbird 3.1a1
Tracking Status
blocking-thunderbird3.0 --- .1+
thunderbird3.0 --- .1-fixed

People

(Reporter: chrodos, Assigned: asuth)

References

Details

(Keywords: testcase)

Attachments

(5 files, 1 obsolete file)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729) Build Identifier: version 2.0.0.23 (20090812) When i try to search for a greek word using the search input field on the toolbar i get no results. This happens when i select the Entire Message option. When i search using the subject option or any other option i get the expected results. The search functionality works just fine when i do the search using am English word Reproducible: Always Steps to Reproduce: 1. Select the Entire Message option from the search input box on the toolbar 2. Type a greek word that exists in any of your emails 3. You will not get any results
Can you post an email as an attachment that should have been matched?
(In reply to comment #2) > Created an attachment (id=413332) [details] > An email for test purposes Try to search for the word "δοκιμαστικό". It is in the message body
It seems true with TB 2.x. In TB 3.x this problem is fixed by bug #472764. I think that this bug should be closed as WONTFIX because 2.X is a security fix and maintainance branch it's very unlikely that what fixed this bug ends up backported to the 2.x branch.
Wontfix - as it workd in 3.0 and 2.x is a stability and security release branch. 3.0rc1 is coming out next week - so it won't be long before it works.
Status: UNCONFIRMED → RESOLVED
Closed: 16 years ago
Resolution: --- → WONTFIX
citing bug 472764 unfortunately doesn't mean it works. :) I tried earlier this morning, and again just now. The sample message and the bugmail message are not found with both entire message and message body filter.
Status: RESOLVED → UNCONFIRMED
Resolution: WONTFIX → ---
i'm using rc1 build 2 en-us. the messages I expected to be found all have gloda ids
Status: UNCONFIRMED → NEW
Ever confirmed: true
Version: unspecified → 3.0
(In reply to comment #7) > i'm using rc1 build 2 en-us. > the messages I expected to be found all have gloda ids This is a duplicate of can find mix case in non english bug ... as the word start with a capital letter.
Christos word "δοκιμαστικό" has capital letter?
(In reply to comment #10) > Christos word "δοκιμαστικό" has capital letter? No, word "δοκιμαστικο" is only lower case letters "Δοκιμαστικο" is with the first letter capital.
When i chenge the font encoding to iso-8859-7 it works. I think it has to do with the encoding.
'search all messages' option in the Gloda search bar => "bakalářské práce" entered (sans the "") => yields no result (and it should, this phrase is exact a part of the subject). If I change the character encoding from the default ISO-8859-2 in View -> Character encoding to something else and re-perform the search the outcome is the same, no results.
Keywords: testcase
I've created a simple test case, that demonstrates the problem well. How to recreate it: 1) Compose a new mail in Central European (ISO-8859-2) encoding consisting of word pairs - one word with diacritics (non-ASCII characters) and the other one without it. 2) Send the mail to yourself 3) Perform Gloda search (option: search all messages) on these words, one at a time. 4) Words with diacritics won't be found, words without them (only standard ASCII characters) will be.
One more notice, the above example is not so clear cut. The word "bludička" (a Czech word) is also found by Gloda. What's going on? I'll try other words and character combination over the weekend to find out. Please tell me, if you need any other piece of information to move this bug forward! Gloda really rocks on ASCII words, but missing national vocable - it's quite useless...
Attached file Czech Wikipedia article - as HTML (obsolete) —
Attachment #417248 - Attachment is obsolete: true
Flags: blocking-thunderbird3.1?
Blocks: 534080
I've found out that with UTF-8 encoded e-mails the success rate when searching for non-ASCII words in them is a bit higher, about 50%. Since it was quite difficult to conclusively tell when the Gloda search would find a word or not, I've prepared an e-mail created from article on Czech Wikipedia: http://cs.wikipedia.org/wiki/Tom%C3%A1%C5%A1_Garrigue_Masaryk . I've simply copied part of that article into the message body using "paste without formatting" and sent that e-mail to myself. When performing Gloda search for any words containing just standard characters, the above e-mail is found every time. When searching for words with Czech diacritics, the results were as follows (some examples I've tried): Words found: Slovácko "Tomáš Jan" krátce bouřlivém revolučním moravském Slovácku Hanačka národnosti Words not found: Moravské rodině zaměstnanců žijící "Tomáš Tomáš Kropáčková německé negramotný I'd add that I'm using English Windows XP SP3 system with Czech regional and language settings. Is anybody of the developers able to replicate this problem? Or am I speaking just to myself? If the problem is as severe for non-English users of Gloda as I think it is, then it's a bit perplexing that everyone is so quiet... P.S.: I suggest to rename this bug so it more closely reflects the nature of the problem (i.e. it's not just Greek words and encoding)
Attachment #417247 - Attachment mime type: message/rfc822 → text/plain
Thank you for all the examples! What input method are you using to search for the messages? Are you: a) Just copying the text from the displayed e-mail into the search box? b) Typing the characters in, using a sequence of keypresses for the diacritics? (For example, in some word processors, if I am writing in Spanish and I want an "e" with an accent, I can do something where I type a single quote and then an e and I get an accented e.) c) Typing the characters in with a single (potentially shifted) keypress. My concern is that the input method may result in two distinct unicode characters that look like a single unicode character visually but are not identical byte-wise to what is in the message. For example, the byte sequence as you pasted "Moravské" above is identical to the message you attached and should be found by the search logic. If you were to copy and paste it from bugzilla, I would expect the gloda search would find the message... So I guess that raises the question of whether "Words not found" from above are pasted from what you typed into the search box or from the document itself. The search box is the preferable choice, although I could see various pieces of software trying to normalize that. In terms of re-titling, we have existing bugs on our lack of case-folding and what not but I'm not sure where they are right now so it's not super critical bug feel free... this bug is primarily invaluable in terms of all the examples it provides. (I want to add the examples to our unit tests but need to better understand why things are failing for you in cases where they should not fail.)
Andrew, I see what are you driving at - I've tried to search again for the words above, now typed into the Gloda bar in different ways: 1) copy & paste the word from the Wikipedia article into the search bar 2) copy & paste the word from my Bugzilla Comment #19 3) copy & paste the word from the body of the e-mail 4) manually typed in using the special diacritics characters keys 'ěščřžýáíé' under keys F1-F12 (single key press) 5) manually typed in only using non-diacritics keys combined with the '´' or 'ˇ' symbol next to backspace key (combination of two key presses, sometimes with shift). Unfortunately, none of the input methods described above performed differently. I tried all of the words listed in my Comment 19.
Karel, Thank you for trying and documenting the many different methods; it is very useful information to me. I will add a representative subset of your examples to our unit tests and see what happens and try and chase things down from there. Unfortunately it will likely be approximately a week before I am able to dedicate effort to investigating the problem more thoroughly (including modifying the unit tests). I'm not sure what your level of development experience is, but the unit test we currently use for checking these things is: http://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/test/unit/test_intl.js The unit test logic has a deficiency where test_intl_fulltextsearch assumes there is only ever one phrase in intlPhrases... if one were to modify the test without modifying the logic, one would want to replace the existing phrase (and search phrases in intlSearchPhrases.)
Re-prioritized for 3.0.1 and did the tests. Turns out there is an off-by-one error in the multi-byte UTF-8 case when it is the last character in the string being tokenized. In plain english, if the last character in a string being tokenizer is non-ASCII, we will lop a byte off. Since query strings are really short strings that get tokenized, they are highly vulnerable. In contrast, body strings are less vulnerable, especially since they can have punctuation or whitespace shielding them. This leads to a difference in tokenization between the two cases and a resulting inability to find things. This makes a lot of sense when you look at the words not found in comment 19 and they all end in multi-byte UTF-8 encoded characters. In terms of needing to blow away the gloda database, there should be no need for the body as noted above. Unfortunately, for the author and recipient cases there is a greater chance of the bug having wormed its way into being stored on disk and resulting in an inconsistency between that disk representation and the query. I think we have to accept this for now and make sure that we tell people for whom it is a huge issue that they should blow away their gloda database. Our autocompletion on contacts should provide an extra mitigating factor. I expect the next major change to tokenization will provide for better case/accent-foldering for non-ASCII characters in which case we might blow away the database based on the user's locale, or at least consider it.
Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Attachment #418228 - Flags: superreview?(bienvenu)
Attachment #418228 - Flags: review?(bienvenu)
blocking-thunderbird3.0: --- → ?
Summary: Cannot search for a Greek word in the Entire message option → tokenization bug breaks global search query strings ending in non-ASCII characters
Comment on attachment 418228 [details] [diff] [review] v1 do not have an off-by-one bug for utf8, do test more. One nit - - * Test that i18n goes through das pipes acceptably. Currently this means: - * - Subject, Body, and Attachment names are properly indexed. Since you've dropped the " Currently this means:" part of the comment, the - Check and - That look a bit odd. Is this something we want to get back into the core sqlite code, i.e., did the fts3_porter code come from there?
Attachment #418228 - Flags: superreview?(bienvenu)
Attachment #418228 - Flags: superreview+
Attachment #418228 - Flags: review?(bienvenu)
Attachment #418228 - Flags: review+
(In reply to comment #25) > Is this something we want to get back into the core sqlite code, i.e., did the > fts3_porter code come from there? The bug is our own and was introduced with the introduction of our CJK support. The stock sqlite3 fts3_porter code is blissfully ignorant of all things UTF8 and just happens to handle it thanks to UTF8 not ever introducing fake nul characters and never exposing its tokenized output when it manipulates the UTF8 string in blatantly incorrect ways. pushed to trunk with nit addressed: http://hg.mozilla.org/comm-central/rev/eb656ca0b708
Status: ASSIGNED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 3.1a1
Attachment #418228 - Flags: approval-thunderbird3.0.1?
No longer blocks: 534080
Attachment #418228 - Flags: approval-thunderbird3.0.1? → approval-thunderbird3.0.1+
blocking-thunderbird3.0: ? → .1+
Flags: blocking-thunderbird3.1? → blocking-thunderbird3.1+
V. Fixed with Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.7) Gecko/20100107 Shredder/3.0.1pre
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: