Closed Bug 324413 Opened 14 years ago Closed 12 years ago
search for utf-8 string fails because of CAST AS BINARY
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5 1. a mysql database with character encoding utf-8 2. one bug with an umlaut in its short_desc ( e.g. "für" / "f\xc3\xbcr" ) 3. searching for 'für' using the advanced search 4. result: "Zarro Boogs found.", expected result: buglist with one bug Bugzilla uses the following query to search for the bug. SELECT bugs.bug_id, bugs.bug_severity, bugs.priority, bugs.bug_status, bugs.resolution, bugs.bug_severity, bugs.priority, bugs.rep_platform, map_assigned_to.login_name, bugs.bug_status, bugs.resolution, bugs.short_desc FROM bugs INNER JOIN profiles AS map_assigned_to ON (bugs.assigned_to = map_assigned_to.userid) LEFT JOIN bug_group_map ON bug_group_map.bug_id = bugs.bug_id AND bug_group_map.group_id NOT IN (13,12,3,14,4,10,1,6,2,5,7) LEFT JOIN cc ON cc.bug_id = bugs.bug_id AND cc.who = 2 WHERE ((bugs.bug_status IN ('NEW','ASSIGNED','REOPENED')) AND (INSTR(CAST(LOWER(bugs.short_desc) AS BINARY), CAST('fÃ¼r' AS BINARY)) > 0)) AND bugs.creation_ts IS NOT NULL AND ((bug_group_map.group_id IS NULL) OR (bugs.reporter_accessible = 1 AND bugs.reporter = 2) OR (bugs.cclist_accessible = 1 AND cc.who IS NOT NULL) OR (bugs.assigned_to = 2) OR (bugs.qa_contact = 2) ) GROUP BY bugs.bug_id at /usr/share/perl5/Bugzilla/Search.pm line 1405. The issue is the cast to binary in this part of the query "INSTR(CAST(LOWER(bugs.short_desc) AS BINARY), CAST('fÃ¼r' AS BINARY))". Without the cast "INSTR(LOWER(bugs.short_desc), 'fÃ¼r')" the bug is found and the results are as expected. Interestingly for older version of mysql the cast is omitted. See Mysql.pm:sql_position. Is there a reason for this cast? The entire issue may be caused by a bug in mysql, but if the cast is unnecessary, it should be removed. "mysql -V" output: mysql Ver 14.12 Distrib 5.0.16, for pc-linux-gnu (i486) using readline 5.1 Reproducible: Always Steps to Reproduce:
In order to get that CAST AS BINARY, you must have specified "exact case." Try it without that.
No, I haven't searched with the exact case option. I used "contains all of the words/strings". I think that is the default. Isn't the LOWER call a sign that this is not a exact case search? In an exact case search this LOWER call is missing. If I understand this correctly the CAST AS BINARY forces case sensitive comparison. So it is necessary for the exact case search. But for case-insensitive searches the CAST AS BINARY and the LOWER just result in the behaviour INSTR already has. That means that at least the exact case search isn't fixable because of the underlieing bug in mysql. For reference the bug I filed with mysql. http://bugs.mysql.com/bug.php?id=16725&edit=2
Seeing this with 2.22.
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Linux → All
Hardware: PC → All
Target Milestone: --- → Bugzilla 2.22
Version: unspecified → 2.22
Tryed with bugzilla 3.0.2 - search fail for cyrillic characters.
Test database. Created from scratch under newly-installed Bugzilla 3.0.2 installation. OS FreeBSD 6.2. Mysql 5.0.51. p5-DBD-mysql-4.005 In this database, all attempts to find any bug fail if one of the classifications is selected. The classification name is on Russian. If no classification is selected, search succeed.
I try to log the SELECT statements which was send to MySQL. I try to search for bugs in product 'ИТ' in my test database. From mysql log: 1 Query SELECT products.id FROM products WHERE products.name IN ('ÐÐ¢') Hexdump show this: 00000000 09 09 20 20 20 20 20 20 31 20 51 75 65 72 79 20 |.. 1 Query | 00000010 20 20 20 20 20 20 53 45 4c 45 43 54 20 70 72 6f | SELECT pro| 00000020 64 75 63 74 73 2e 69 64 20 46 52 4f 4d 20 70 72 |ducts.id FROM pr| 00000030 6f 64 75 63 74 73 20 57 48 45 52 45 20 70 72 6f |oducts WHERE pro| 00000040 64 75 63 74 73 2e 6e 61 6d 65 20 49 4e 20 28 27 |ducts.name IN ('| 00000050 c3 90 c2 98 c3 90 c2 a2 27 29 0a |�.�.�.¢').| As you can see, 'ИТ' - two-letters cyrillic - converted to c3 90 c2 98 c3 90 c2 a2 - 8 bytes. Correct utf-8 for 'ИТ' is: d0 98 d0 a2 - 4 bytes. As i understand, somewhere inside Bugzilla unneded conversion was performed from utf-8 to utf-8. I mean, from d0 -> c3 90; 98 -> c2 98; a2 -> c2 a2. Please, everybody watching for this bug - vote for it. I think, it must be corrected as soon as possible.
I tryes to debug the source code (dumb, just debug output) and it seems like the problem is in DBI selectcol_arrayref function. Bugzilla calls it with utf8 string, but this function again convert it to utf8.
Hi Aleksandr. This is probably fixed in the current CVS HEAD code, then.
Can you give me a link to change which fix it ? I can't use CVS HEAD in production environment, but i can try to apply fix in 3.0.2. Just now i tryed 3.1.2 - the same problem.
(In reply to comment #9) > Can you give me a link to change which fix it ? It's not a single change, it's a very complex series of changes that cannot be backported.
OK. I find solution for this problem: Bugzilla/Search.pm 36d35 < use Encode qw(encode decode); 400c399 < my $s = trim(decode('UTF-8',$params->param($f))); --- > my $s = trim($params->param($f)); 1519c1518 < my $q = decode( 'UTF-8',"SELECT $inner FROM $table WHERE $cond"); --- > my $q = "SELECT $inner FROM $table WHERE $cond"; I.e. in all cases where we need to contact DBI, we need to explicitly set encoding UTF-8. The changes above is enought to make context search work with any language, and search for bugs inside products/components named with non-English symbols will work too. I think, this patch will work even with older Bugzilla installations with non-Unicode encodings: just replace UTF-8 with correct encoding for you install. It seems for me that this problem arise with last version of p5-DBI-1.60.1. Previously DBI doesn't try to convert everything to UTF-8, and it works correctly if database is also in native encoding.
(In reply to comment #11) > I.e. in all cases where we need to contact DBI, we need to explicitly set > encoding UTF-8. Aleksandr: Yes, I already know this. In fact, it's already fixed in CVS HEAD, as I told you. Based on Aleksandr's information, this has been fixed in 3.2, by several different bugs. (Starting with bug 363153 and then many bugs after that. I swear that if anybody just applies the patch from bug 363153 and then complains that it doesn't work, I will come to their house and hit them over the head with a frying pan. At least virtually. :-) )
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
Target Milestone: Bugzilla 2.22 → Bugzilla 3.2
I guess this bug is also the cause why I´m not able to search for the flag "Änderung" with German umlauts. Can someone confirm this or should I post more detailed information?
Lars, this is already fixed in CVS (see the bugs Max mentioned), so 3.2rc1 (or maybe even some 3.1.x) should work for you.
You need to log in before you can comment on or make changes to this bug.