Closed
Bug 529824
Opened 16 years ago
Closed 16 years ago
tokenization bug breaks global search query strings ending in non-ASCII characters
Categories
(Thunderbird :: Search, defect)
Tracking
(blocking-thunderbird3.0 .1+, thunderbird3.0 .1-fixed)
VERIFIED
FIXED
Thunderbird 3.1a1
People
(Reporter: chrodos, Assigned: asuth)
References
Details
(Keywords: testcase)
Attachments
(5 files, 1 obsolete file)
817 bytes,
message/rfc822
|
Details | |
3.24 KB,
message/rfc822
|
Details | |
1.44 KB,
message/rfc822
|
Details | |
8.51 KB,
text/plain
|
Details | |
11.44 KB,
patch
|
Bienvenu
:
review+
Bienvenu
:
superreview+
standard8
:
approval-thunderbird3.0.1+
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)
Build Identifier: version 2.0.0.23 (20090812)
When i try to search for a greek word using the search input field on the toolbar i get no results. This happens when i select the Entire Message option. When i search using the subject option or any other option i get the expected results.
The search functionality works just fine when i do the search using am English word
Reproducible: Always
Steps to Reproduce:
1. Select the Entire Message option from the search input box on the toolbar
2. Type a greek word that exists in any of your emails
3. You will not get any results
Comment 1•16 years ago
|
||
Can you post an email as an attachment that should have been matched?
Reporter | ||
Comment 2•16 years ago
|
||
Reporter | ||
Comment 3•16 years ago
|
||
(In reply to comment #2)
> Created an attachment (id=413332) [details]
> An email for test purposes
Try to search for the word "δοκιμαστικό". It is in the message body
Comment 4•16 years ago
|
||
It seems true with TB 2.x.
In TB 3.x this problem is fixed by bug #472764. I think that this bug should be closed as WONTFIX because 2.X is a security fix and maintainance branch it's very
unlikely that what fixed this bug ends up backported to the 2.x branch.
Comment 5•16 years ago
|
||
Wontfix - as it workd in 3.0 and 2.x is a stability and security release branch. 3.0rc1 is coming out next week - so it won't be long before it works.
Status: UNCONFIRMED → RESOLVED
Closed: 16 years ago
Resolution: --- → WONTFIX
Comment 6•16 years ago
|
||
citing bug 472764 unfortunately doesn't mean it works. :)
I tried earlier this morning, and again just now. The sample message and the bugmail message are not found with both entire message and message body filter.
Status: RESOLVED → UNCONFIRMED
Resolution: WONTFIX → ---
Comment 7•16 years ago
|
||
i'm using rc1 build 2 en-us.
the messages I expected to be found all have gloda ids
Status: UNCONFIRMED → NEW
Ever confirmed: true
Version: unspecified → 3.0
Comment 8•16 years ago
|
||
(In reply to comment #7)
> i'm using rc1 build 2 en-us.
> the messages I expected to be found all have gloda ids
This is a duplicate of can find mix case in non english bug ... as the word start with a capital letter.
Comment 9•16 years ago
|
||
Comment 10•16 years ago
|
||
Christos word "δοκιμαστικό" has capital letter?
Reporter | ||
Comment 11•16 years ago
|
||
(In reply to comment #10)
> Christos word "δοκιμαστικό" has capital letter?
No, word "δοκιμαστικο" is only lower case letters "Δοκιμαστικο" is with the first letter capital.
Reporter | ||
Comment 12•16 years ago
|
||
When i chenge the font encoding to iso-8859-7 it works. I think it has to do with the encoding.
Comment 13•16 years ago
|
||
Comment 14•16 years ago
|
||
'search all messages' option in the Gloda search bar => "bakalářské práce" entered (sans the "") => yields no result (and it should, this phrase is exact a part of the subject). If I change the character encoding from the default ISO-8859-2 in View -> Character encoding to something else and re-perform the search the outcome is the same, no results.
Comment 15•16 years ago
|
||
I've created a simple test case, that demonstrates the problem well. How to recreate it:
1) Compose a new mail in Central European (ISO-8859-2) encoding consisting of word pairs - one word with diacritics (non-ASCII characters) and the other one without it.
2) Send the mail to yourself
3) Perform Gloda search (option: search all messages) on these words, one at a time.
4) Words with diacritics won't be found, words without them (only standard ASCII characters) will be.
Comment 16•16 years ago
|
||
One more notice, the above example is not so clear cut. The word "bludička" (a Czech word) is also found by Gloda. What's going on? I'll try other words and character combination over the weekend to find out.
Please tell me, if you need any other piece of information to move this bug forward! Gloda really rocks on ASCII words, but missing national vocable - it's quite useless...
Comment 17•16 years ago
|
||
Comment 18•16 years ago
|
||
Updated•16 years ago
|
Attachment #417248 -
Attachment is obsolete: true
Updated•16 years ago
|
Flags: blocking-thunderbird3.1?
Comment 19•16 years ago
|
||
I've found out that with UTF-8 encoded e-mails the success rate when searching for non-ASCII words in them is a bit higher, about 50%.
Since it was quite difficult to conclusively tell when the Gloda search would find a word or not, I've prepared an e-mail created from article on Czech Wikipedia: http://cs.wikipedia.org/wiki/Tom%C3%A1%C5%A1_Garrigue_Masaryk . I've simply copied part of that article into the message body using "paste without formatting" and sent that e-mail to myself.
When performing Gloda search for any words containing just standard characters, the above e-mail is found every time. When searching for words with Czech diacritics, the results were as follows (some examples I've tried):
Words found:
Slovácko
"Tomáš Jan"
krátce
bouřlivém
revolučním
moravském
Slovácku
Hanačka
národnosti
Words not found:
Moravské
rodině
zaměstnanců
žijící
"Tomáš
Tomáš
Kropáčková
německé
negramotný
I'd add that I'm using English Windows XP SP3 system with Czech regional and language settings.
Is anybody of the developers able to replicate this problem? Or am I speaking just to myself? If the problem is as severe for non-English users of Gloda as I think it is, then it's a bit perplexing that everyone is so quiet...
P.S.: I suggest to rename this bug so it more closely reflects the nature of the problem (i.e. it's not just Greek words and encoding)
Assignee | ||
Updated•16 years ago
|
Attachment #417247 -
Attachment mime type: message/rfc822 → text/plain
Assignee | ||
Comment 20•16 years ago
|
||
Thank you for all the examples!
What input method are you using to search for the messages? Are you:
a) Just copying the text from the displayed e-mail into the search box?
b) Typing the characters in, using a sequence of keypresses for the diacritics? (For example, in some word processors, if I am writing in Spanish and I want an "e" with an accent, I can do something where I type a single quote and then an e and I get an accented e.)
c) Typing the characters in with a single (potentially shifted) keypress.
My concern is that the input method may result in two distinct unicode characters that look like a single unicode character visually but are not identical byte-wise to what is in the message.
For example, the byte sequence as you pasted "Moravské" above is identical to the message you attached and should be found by the search logic. If you were to copy and paste it from bugzilla, I would expect the gloda search would find the message...
So I guess that raises the question of whether "Words not found" from above are pasted from what you typed into the search box or from the document itself. The search box is the preferable choice, although I could see various pieces of software trying to normalize that.
In terms of re-titling, we have existing bugs on our lack of case-folding and what not but I'm not sure where they are right now so it's not super critical bug feel free... this bug is primarily invaluable in terms of all the examples it provides.
(I want to add the examples to our unit tests but need to better understand why things are failing for you in cases where they should not fail.)
Comment 21•16 years ago
|
||
Andrew,
I see what are you driving at - I've tried to search again for the words above, now typed into the Gloda bar in different ways:
1) copy & paste the word from the Wikipedia article into the search bar
2) copy & paste the word from my Bugzilla Comment #19
3) copy & paste the word from the body of the e-mail
4) manually typed in using the special diacritics characters keys 'ěščřžýáíé' under keys F1-F12 (single key press)
5) manually typed in only using non-diacritics keys combined with the '´' or 'ˇ' symbol next to backspace key (combination of two key presses, sometimes with shift).
Unfortunately, none of the input methods described above performed differently. I tried all of the words listed in my Comment 19.
Assignee | ||
Comment 22•16 years ago
|
||
Karel,
Thank you for trying and documenting the many different methods; it is very useful information to me. I will add a representative subset of your examples to our unit tests and see what happens and try and chase things down from there. Unfortunately it will likely be approximately a week before I am able to dedicate effort to investigating the problem more thoroughly (including modifying the unit tests).
I'm not sure what your level of development experience is, but the unit test we currently use for checking these things is:
http://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/test/unit/test_intl.js
The unit test logic has a deficiency where test_intl_fulltextsearch assumes there is only ever one phrase in intlPhrases... if one were to modify the test without modifying the logic, one would want to replace the existing phrase (and search phrases in intlSearchPhrases.)
Assignee | ||
Comment 23•16 years ago
|
||
Re-prioritized for 3.0.1 and did the tests. Turns out there is an off-by-one error in the multi-byte UTF-8 case when it is the last character in the string being tokenized.
In plain english, if the last character in a string being tokenizer is non-ASCII, we will lop a byte off. Since query strings are really short strings that get tokenized, they are highly vulnerable. In contrast, body strings are less vulnerable, especially since they can have punctuation or whitespace shielding them. This leads to a difference in tokenization between the two cases and a resulting inability to find things.
This makes a lot of sense when you look at the words not found in comment 19 and they all end in multi-byte UTF-8 encoded characters.
In terms of needing to blow away the gloda database, there should be no need for the body as noted above. Unfortunately, for the author and recipient cases there is a greater chance of the bug having wormed its way into being stored on disk and resulting in an inconsistency between that disk representation and the query.
I think we have to accept this for now and make sure that we tell people for whom it is a huge issue that they should blow away their gloda database. Our autocompletion on contacts should provide an extra mitigating factor. I expect the next major change to tokenization will provide for better case/accent-foldering for non-ASCII characters in which case we might blow away the database based on the user's locale, or at least consider it.
Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Attachment #418228 -
Flags: superreview?(bienvenu)
Attachment #418228 -
Flags: review?(bienvenu)
Assignee | ||
Updated•16 years ago
|
blocking-thunderbird3.0: --- → ?
Assignee | ||
Updated•16 years ago
|
Summary: Cannot search for a Greek word in the Entire message option → tokenization bug breaks global search query strings ending in non-ASCII characters
Comment 25•16 years ago
|
||
Comment on attachment 418228 [details] [diff] [review]
v1 do not have an off-by-one bug for utf8, do test more.
One nit -
- * Test that i18n goes through das pipes acceptably. Currently this means:
- * - Subject, Body, and Attachment names are properly indexed.
Since you've dropped the " Currently this means:" part of the comment, the - Check and - That look a bit odd.
Is this something we want to get back into the core sqlite code, i.e., did the fts3_porter code come from there?
Attachment #418228 -
Flags: superreview?(bienvenu)
Attachment #418228 -
Flags: superreview+
Attachment #418228 -
Flags: review?(bienvenu)
Attachment #418228 -
Flags: review+
Assignee | ||
Comment 26•16 years ago
|
||
(In reply to comment #25)
> Is this something we want to get back into the core sqlite code, i.e., did the
> fts3_porter code come from there?
The bug is our own and was introduced with the introduction of our CJK support. The stock sqlite3 fts3_porter code is blissfully ignorant of all things UTF8 and just happens to handle it thanks to UTF8 not ever introducing fake nul characters and never exposing its tokenized output when it manipulates the UTF8 string in blatantly incorrect ways.
pushed to trunk with nit addressed:
http://hg.mozilla.org/comm-central/rev/eb656ca0b708
Status: ASSIGNED → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 3.1a1
Assignee | ||
Updated•16 years ago
|
Attachment #418228 -
Flags: approval-thunderbird3.0.1?
Updated•16 years ago
|
Attachment #418228 -
Flags: approval-thunderbird3.0.1? → approval-thunderbird3.0.1+
Updated•16 years ago
|
blocking-thunderbird3.0: ? → .1+
Flags: blocking-thunderbird3.1? → blocking-thunderbird3.1+
Comment 30•16 years ago
|
||
Checked in on branch: http://hg.mozilla.org/releases/comm-1.9.1/rev/a1a6620276bc
status-thunderbird3.0:
--- → .1-fixed
Comment 31•16 years ago
|
||
V. Fixed with Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.7) Gecko/20100107 Shredder/3.0.1pre
Status: RESOLVED → VERIFIED
Keywords: verified-thunderbird3.0
You need to log in
before you can comment on or make changes to this bug.
Description
•