Closed Bug 324413 Opened 14 years ago Closed 12 years ago

search for utf-8 string fails because of CAST AS BINARY

Categories

(Bugzilla :: Database, defect)

2.22
defect
Not set

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: bugzilla.20.scyt, Unassigned)

References

Details

Attachments

(1 file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5

1. a mysql database with character encoding utf-8
2. one bug with an umlaut in its short_desc ( e.g. "für" / "f\xc3\xbcr" )
3. searching for 'für' using the advanced search
4. result: "Zarro Boogs found.", expected result: buglist with one bug 


Bugzilla uses the following query to search for the bug.

SELECT bugs.bug_id, bugs.bug_severity, bugs.priority, bugs.bug_status, bugs.resolution, bugs.bug_severity, bugs.priority, bugs.rep_platform, map_assigned_to.login_name, bugs.bug_status, bugs.resolution, bugs.short_desc
FROM bugs
INNER JOIN profiles AS map_assigned_to ON (bugs.assigned_to = map_assigned_to.userid) LEFT JOIN bug_group_map  ON bug_group_map.bug_id = bugs.bug_id  AND bug_group_map.group_id NOT IN (13,12,3,14,4,10,1,6,2,5,7)  LEFT JOIN cc ON cc.bug_id = bugs.bug_id AND cc.who = 2
WHERE ((bugs.bug_status IN ('NEW','ASSIGNED','REOPENED')) AND (INSTR(CAST(LOWER(bugs.short_desc) AS BINARY), CAST('für' AS BINARY)) > 0)) AND bugs.creation_ts IS NOT NULL AND ((bug_group_map.group_id IS NULL)    OR (bugs.reporter_accessible = 1 AND bugs.reporter = 2)     OR (bugs.cclist_accessible = 1 AND cc.who IS NOT NULL)     OR (bugs.assigned_to = 2) OR (bugs.qa_contact = 2) ) GROUP BY bugs.bug_id at /usr/share/perl5/Bugzilla/Search.pm line 1405.


The issue is the cast to binary in this part of the query "INSTR(CAST(LOWER(bugs.short_desc) AS BINARY), CAST('für' AS BINARY))". Without the cast "INSTR(LOWER(bugs.short_desc), 'für')" the bug is found and the results are as expected.

Interestingly for older version of mysql the cast is omitted. See Mysql.pm:sql_position. Is there a reason for this cast?

The entire issue may be caused by a bug in mysql, but if the cast is unnecessary, it should be removed.

"mysql -V" output:
mysql  Ver 14.12 Distrib 5.0.16, for pc-linux-gnu (i486) using readline 5.1


Reproducible: Always

Steps to Reproduce:
In order to get that CAST AS BINARY, you must have specified "exact case." Try it without that.
No, I haven't searched with the exact case option. I used "contains all of the words/strings". I think that is the default.

Isn't the LOWER call a sign that this is not a exact case search? In an exact case search this LOWER call is missing.

If I understand this correctly the CAST AS BINARY forces case sensitive comparison. So it is necessary for the exact case search. But for case-insensitive searches the CAST AS BINARY and the LOWER just result in the behaviour INSTR already has.

That means that at least the exact case search isn't fixable because of the underlieing bug in mysql.

For reference the bug I filed with mysql. http://bugs.mysql.com/bug.php?id=16725&edit=2
Depends on: 332190
Seeing this with 2.22.
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Linux → All
Hardware: PC → All
Target Milestone: --- → Bugzilla 2.22
Version: unspecified → 2.22
Tryed with bugzilla 3.0.2 - search fail for cyrillic characters.
Test database. Created from scratch under newly-installed Bugzilla 3.0.2 installation. OS FreeBSD 6.2. Mysql 5.0.51. p5-DBD-mysql-4.005
In this database, all attempts to find any bug fail if one of the classifications is selected. The classification name is on Russian. If no classification is selected, search succeed.
I try to log the SELECT statements which was send to MySQL. I try to search for bugs in product 'ИТ' in my test database. 
From mysql log:
 1 Query       SELECT products.id FROM products WHERE products.name IN ('ÐТ')

Hexdump show this:

00000000  09 09 20 20 20 20 20 20  31 20 51 75 65 72 79 20  |..      1 Query |
00000010  20 20 20 20 20 20 53 45  4c 45 43 54 20 70 72 6f  |      SELECT pro|
00000020  64 75 63 74 73 2e 69 64  20 46 52 4f 4d 20 70 72  |ducts.id FROM pr|
00000030  6f 64 75 63 74 73 20 57  48 45 52 45 20 70 72 6f  |oducts WHERE pro|
00000040  64 75 63 74 73 2e 6e 61  6d 65 20 49 4e 20 28 27  |ducts.name IN ('|
00000050  c3 90 c2 98 c3 90 c2 a2  27 29 0a                 |�.�.�.¢').|

As you can see, 'ИТ' - two-letters cyrillic - converted to 
c3 90 c2 98 c3 90 c2 a2 - 8 bytes.
Correct utf-8 for 'ИТ' is:
d0 98 d0 a2 - 4 bytes.
As i understand, somewhere inside Bugzilla unneded conversion was performed from utf-8 to utf-8. I mean, from d0 -> c3 90; 98 -> c2 98; a2 -> c2 a2.

Please, everybody watching for this bug - vote for it. I think, it must be corrected as soon as possible.
I tryes to debug the source code (dumb, just debug output) and it seems like the problem is in DBI selectcol_arrayref function. Bugzilla calls it with utf8 string, but this function again convert it to utf8.
Hi Aleksandr. This is probably fixed in the current CVS HEAD code, then.
Can you give me a link to change which fix it ? I can't use CVS HEAD in production environment, but i can try to apply fix in 3.0.2.
Just now i tryed 3.1.2 - the same problem.
(In reply to comment #9)
> Can you give me a link to change which fix it ?

  It's not a single change, it's a very complex series of changes that cannot be backported.
OK. I find solution for this problem:

Bugzilla/Search.pm

36d35
< use Encode qw(encode decode);
400c399
<             my $s = trim(decode('UTF-8',$params->param($f)));
---
>             my $s = trim($params->param($f));
1519c1518
<     my $q = decode( 'UTF-8',"SELECT $inner FROM $table WHERE $cond");
---
>     my $q = "SELECT $inner FROM $table WHERE $cond";


I.e. in all cases where we need to contact DBI, we need to explicitly set encoding UTF-8. The changes above is enought to make context search work with any language, and search for bugs inside products/components named with non-English symbols will work too.
I think, this patch will work even with older Bugzilla installations with non-Unicode encodings: just replace UTF-8 with correct encoding for you install.

It seems for me that this problem arise with last version of p5-DBI-1.60.1. Previously DBI doesn't try to convert everything to UTF-8, and it works correctly if database is also in native encoding.
(In reply to comment #11)
> I.e. in all cases where we need to contact DBI, we need to explicitly set
> encoding UTF-8.

  Aleksandr: Yes, I already know this. In fact, it's already fixed in CVS HEAD, as I told you.

  Based on Aleksandr's information, this has been fixed in 3.2, by several different bugs. (Starting with bug 363153 and then many bugs after that. I swear that if anybody just applies the patch from bug 363153 and then complains that it doesn't work, I will come to their house and hit them over the head with a frying pan. At least virtually. :-) )
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
Target Milestone: Bugzilla 2.22 → Bugzilla 3.2
Target Milestone: Bugzilla 3.2 → ---
I guess this bug is also the cause why I´m not able to search for the flag "Änderung" with German umlauts. Can someone confirm this or should I post more detailed information?
Lars, this is already fixed in CVS (see the bugs Max mentioned), so 3.2rc1 (or maybe even some 3.1.x) should work for you.
You need to log in before you can comment on or make changes to this bug.