Closed Bug 529824 Opened 16 years ago Closed 16 years ago

tokenization bug breaks global search query strings ending in non-ASCII characters

Tracking

(blocking-thunderbird3.0 .1+, thunderbird3.0 .1-fixed)

Status:

VERIFIED FIXED

Milestone:

Thunderbird 3.1a1

Tracking Flags:

Tracking

Status

blocking-thunderbird3.0

---

.1+

thunderbird3.0

---

.1-fixed

People

(Reporter: chrodos, Assigned: asuth)

References

Details

(Keywords: testcase)

Attachments

(5 files, 1 obsolete file)

An email for test purposes 16 years ago Christos R. 817 bytes, message/rfc822		Details
Sample e-mail demonstrating the problem 16 years ago Karel Koubek 3.24 KB, message/rfc822		Details
Another test case e-mail 16 years ago Karel Koubek 1.44 KB, message/rfc822		Details
Czech Wikipedia article - as plain text 16 years ago Karel Koubek 8.51 KB, text/plain		Details
Czech Wikipedia article - as HTML 16 years ago Karel Koubek 53.17 KB, message/rfc822		Details
v1 do not have an off-by-one bug for utf8, do test more. 16 years ago Andrew Sutherland [:asuth] (he/him) 11.44 KB, patch	Bienvenu : review+ Bienvenu : superreview+ standard8 : approval-thunderbird3.0.1+	Details \| Diff \| Splinter Review

Christos R.

Reporter

Description

•

16 years ago

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729) Build Identifier: version 2.0.0.23 (20090812) When i try to search for a greek word using the search input field on the toolbar i get no results. This happens when i select the Entire Message option. When i search using the subject option or any other option i get the expected results. The search functionality works just fine when i do the search using am English word Reproducible: Always Steps to Reproduce: 1. Select the Entire Message option from the search input box on the toolbar 2. Type a greek word that exists in any of your emails 3. You will not get any results

Joshua Cranmer [:jcranmer]

Comment 1

•

16 years ago

Can you post an email as an attachment that should have been matched?

Christos R.

Reporter

Comment 2

•

16 years ago

Attached file An email for test purposes — Details

Christos R.

Reporter

Comment 3

•

16 years ago

(In reply to comment #2) > Created an attachment (id=413332) [details] > An email for test purposes Try to search for the word "δοκιμαστικό". It is in the message body

[:Aureliano Buendía]

Comment 4

•

16 years ago

It seems true with TB 2.x. In TB 3.x this problem is fixed by bug #472764. I think that this bug should be closed as WONTFIX because 2.X is a security fix and maintainance branch it's very unlikely that what fixed this bug ends up backported to the 2.x branch.

Ludovic Hirlimann [:Usul]

Comment 5

•

16 years ago

Wontfix - as it workd in 3.0 and 2.x is a stability and security release branch. 3.0rc1 is coming out next week - so it won't be long before it works.

Status: UNCONFIRMED → RESOLVED

Closed: 16 years ago

Resolution: --- → WONTFIX

Wayne Mery (:wsmwk)

Comment 6

•

16 years ago

citing bug 472764 unfortunately doesn't mean it works. :) I tried earlier this morning, and again just now. The sample message and the bugmail message are not found with both entire message and message body filter.

Status: RESOLVED → UNCONFIRMED

Resolution: WONTFIX → ---

Wayne Mery (:wsmwk)

Comment 7

•

16 years ago

i'm using rc1 build 2 en-us. the messages I expected to be found all have gloda ids

Status: UNCONFIRMED → NEW

Ever confirmed: true

Version: unspecified → 3.0

Ludovic Hirlimann [:Usul]

Comment 8

•

16 years ago

(In reply to comment #7) > i'm using rc1 build 2 en-us. > the messages I expected to be found all have gloda ids This is a duplicate of can find mix case in non english bug ... as the word start with a capital letter.

Ludovic Hirlimann [:Usul]

Comment 9

•

16 years ago

bug 525537

[:Aureliano Buendía]

Comment 10

•

16 years ago

Christos word "δοκιμαστικό" has capital letter?

Christos R.

Reporter

Comment 11

•

16 years ago

(In reply to comment #10) > Christos word "δοκιμαστικό" has capital letter? No, word "δοκιμαστικο" is only lower case letters "Δοκιμαστικο" is with the first letter capital.

Christos R.

Reporter

Comment 12

•

16 years ago

When i chenge the font encoding to iso-8859-7 it works. I think it has to do with the encoding.

Karel Koubek

Comment 13

•

16 years ago

Attached file Sample e-mail demonstrating the problem — Details

Karel Koubek

Comment 14

•

16 years ago

'search all messages' option in the Gloda search bar => "bakalářské práce" entered (sans the "") => yields no result (and it should, this phrase is exact a part of the subject). If I change the character encoding from the default ISO-8859-2 in View -> Character encoding to something else and re-perform the search the outcome is the same, no results.

Ludovic Hirlimann [:Usul]

Updated

•

16 years ago

Keywords: testcase

Karel Koubek

Comment 15

•

16 years ago

Attached file Another test case e-mail — Details

I've created a simple test case, that demonstrates the problem well. How to recreate it: 1) Compose a new mail in Central European (ISO-8859-2) encoding consisting of word pairs - one word with diacritics (non-ASCII characters) and the other one without it. 2) Send the mail to yourself 3) Perform Gloda search (option: search all messages) on these words, one at a time. 4) Words with diacritics won't be found, words without them (only standard ASCII characters) will be.

Karel Koubek

Comment 16

•

16 years ago

One more notice, the above example is not so clear cut. The word "bludička" (a Czech word) is also found by Gloda. What's going on? I'll try other words and character combination over the weekend to find out. Please tell me, if you need any other piece of information to move this bug forward! Gloda really rocks on ASCII words, but missing national vocable - it's quite useless...

Karel Koubek

Comment 17

•

16 years ago

Attached file Czech Wikipedia article - as plain text — Details

Karel Koubek

Comment 18

•

16 years ago

Attached file Czech Wikipedia article - as HTML (obsolete) — Details

Karel Koubek

Updated

•

16 years ago

Attachment #417248 - Attachment is obsolete: true

Ludovic Hirlimann [:Usul]

Updated

•

16 years ago

Flags: blocking-thunderbird3.1?

Ludovic Hirlimann [:Usul]

Updated

•

16 years ago

Blocks: 534080

Karel Koubek

Comment 19

•

16 years ago

I've found out that with UTF-8 encoded e-mails the success rate when searching for non-ASCII words in them is a bit higher, about 50%. Since it was quite difficult to conclusively tell when the Gloda search would find a word or not, I've prepared an e-mail created from article on Czech Wikipedia: http://cs.wikipedia.org/wiki/Tom%C3%A1%C5%A1_Garrigue_Masaryk . I've simply copied part of that article into the message body using "paste without formatting" and sent that e-mail to myself. When performing Gloda search for any words containing just standard characters, the above e-mail is found every time. When searching for words with Czech diacritics, the results were as follows (some examples I've tried): Words found: Slovácko "Tomáš Jan" krátce bouřlivém revolučním moravském Slovácku Hanačka národnosti Words not found: Moravské rodině zaměstnanců žijící "Tomáš Tomáš Kropáčková německé negramotný I'd add that I'm using English Windows XP SP3 system with Czech regional and language settings. Is anybody of the developers able to replicate this problem? Or am I speaking just to myself? If the problem is as severe for non-English users of Gloda as I think it is, then it's a bit perplexing that everyone is so quiet... P.S.: I suggest to rename this bug so it more closely reflects the nature of the problem (i.e. it's not just Greek words and encoding)

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

16 years ago

Attachment #417247 - Attachment mime type: message/rfc822 → text/plain

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 20

•

16 years ago

Thank you for all the examples! What input method are you using to search for the messages? Are you: a) Just copying the text from the displayed e-mail into the search box? b) Typing the characters in, using a sequence of keypresses for the diacritics? (For example, in some word processors, if I am writing in Spanish and I want an "e" with an accent, I can do something where I type a single quote and then an e and I get an accented e.) c) Typing the characters in with a single (potentially shifted) keypress. My concern is that the input method may result in two distinct unicode characters that look like a single unicode character visually but are not identical byte-wise to what is in the message. For example, the byte sequence as you pasted "Moravské" above is identical to the message you attached and should be found by the search logic. If you were to copy and paste it from bugzilla, I would expect the gloda search would find the message... So I guess that raises the question of whether "Words not found" from above are pasted from what you typed into the search box or from the document itself. The search box is the preferable choice, although I could see various pieces of software trying to normalize that. In terms of re-titling, we have existing bugs on our lack of case-folding and what not but I'm not sure where they are right now so it's not super critical bug feel free... this bug is primarily invaluable in terms of all the examples it provides. (I want to add the examples to our unit tests but need to better understand why things are failing for you in cases where they should not fail.)

Karel Koubek

Comment 21

•

16 years ago

Andrew, I see what are you driving at - I've tried to search again for the words above, now typed into the Gloda bar in different ways: 1) copy & paste the word from the Wikipedia article into the search bar 2) copy & paste the word from my Bugzilla Comment #19 3) copy & paste the word from the body of the e-mail 4) manually typed in using the special diacritics characters keys 'ěščřžýáíé' under keys F1-F12 (single key press) 5) manually typed in only using non-diacritics keys combined with the '´' or 'ˇ' symbol next to backspace key (combination of two key presses, sometimes with shift). Unfortunately, none of the input methods described above performed differently. I tried all of the words listed in my Comment 19.

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 22

•

16 years ago

Karel, Thank you for trying and documenting the many different methods; it is very useful information to me. I will add a representative subset of your examples to our unit tests and see what happens and try and chase things down from there. Unfortunately it will likely be approximately a week before I am able to dedicate effort to investigating the problem more thoroughly (including modifying the unit tests). I'm not sure what your level of development experience is, but the unit test we currently use for checking these things is: http://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/test/unit/test_intl.js The unit test logic has a deficiency where test_intl_fulltextsearch assumes there is only ever one phrase in intlPhrases... if one were to modify the test without modifying the logic, one would want to replace the existing phrase (and search phrases in intlSearchPhrases.)

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 23

•

16 years ago

Attached patch v1 do not have an off-by-one bug for utf8, do test more. — Details — Splinter Review

Re-prioritized for 3.0.1 and did the tests. Turns out there is an off-by-one error in the multi-byte UTF-8 case when it is the last character in the string being tokenized. In plain english, if the last character in a string being tokenizer is non-ASCII, we will lop a byte off. Since query strings are really short strings that get tokenized, they are highly vulnerable. In contrast, body strings are less vulnerable, especially since they can have punctuation or whitespace shielding them. This leads to a difference in tokenization between the two cases and a resulting inability to find things. This makes a lot of sense when you look at the words not found in comment 19 and they all end in multi-byte UTF-8 encoded characters. In terms of needing to blow away the gloda database, there should be no need for the body as noted above. Unfortunately, for the author and recipient cases there is a greater chance of the bug having wormed its way into being stored on disk and resulting in an inconsistency between that disk representation and the query. I think we have to accept this for now and make sure that we tell people for whom it is a huge issue that they should blow away their gloda database. Our autocompletion on contacts should provide an extra mitigating factor. I expect the next major change to tokenization will provide for better case/accent-foldering for non-ASCII characters in which case we might blow away the database based on the user's locale, or at least consider it.

Assignee: nobody → bugmail

Status: NEW → ASSIGNED

Attachment #418228 - Flags: superreview?(bienvenu)

Attachment #418228 - Flags: review?(bienvenu)

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

16 years ago

blocking-thunderbird3.0: --- → ?

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

16 years ago

Summary: Cannot search for a Greek word in the Entire message option → tokenization bug breaks global search query strings ending in non-ASCII characters

David :Bienvenu

Comment 25

•

16 years ago

Comment on attachment 418228 [details] [diff] [review] v1 do not have an off-by-one bug for utf8, do test more. One nit - - * Test that i18n goes through das pipes acceptably. Currently this means: - * - Subject, Body, and Attachment names are properly indexed. Since you've dropped the " Currently this means:" part of the comment, the - Check and - That look a bit odd. Is this something we want to get back into the core sqlite code, i.e., did the fts3_porter code come from there?

Attachment #418228 - Flags: superreview?(bienvenu)

Attachment #418228 - Flags: superreview+

Attachment #418228 - Flags: review?(bienvenu)

Attachment #418228 - Flags: review+

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 26

•

16 years ago

(In reply to comment #25) > Is this something we want to get back into the core sqlite code, i.e., did the > fts3_porter code come from there? The bug is our own and was introduced with the introduction of our CJK support. The stock sqlite3 fts3_porter code is blissfully ignorant of all things UTF8 and just happens to handle it thanks to UTF8 not ever introducing fake nul characters and never exposing its tokenized output when it manipulates the UTF8 string in blatantly incorrect ways. pushed to trunk with nit addressed: http://hg.mozilla.org/comm-central/rev/eb656ca0b708

Status: ASSIGNED → RESOLVED

Closed: 16 years ago → 16 years ago

Resolution: --- → FIXED

Target Milestone: --- → Thunderbird 3.1a1

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

16 years ago

Attachment #418228 - Flags: approval-thunderbird3.0.1?

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

16 years ago

No longer blocks: 534080

Mark Banner (:standard8)

Updated

•

16 years ago

Attachment #418228 - Flags: approval-thunderbird3.0.1? → approval-thunderbird3.0.1+

Mark Banner (:standard8)

Updated

•

16 years ago

blocking-thunderbird3.0: ? → .1+

Flags: blocking-thunderbird3.1? → blocking-thunderbird3.1+

Mark Banner (:standard8)

Comment 30

•

16 years ago

Checked in on branch: http://hg.mozilla.org/releases/comm-1.9.1/rev/a1a6620276bc

status-thunderbird3.0: --- → .1-fixed

Ludovic Hirlimann [:Usul]

Comment 31

•

16 years ago

V. Fixed with Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.7) Gecko/20100107 Shredder/3.0.1pre

Status: RESOLVED → VERIFIED

Keywords: verified-thunderbird3.0

You need to log in before you can comment on or make changes to this bug.