Advanced search for Turkish İ - Unicode Character LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130 İ fails

RESOLVED FIXED in Bugzilla 3.2

Status

()

Bugzilla
Query/Bug List
RESOLVED FIXED
12 years ago
9 years ago

People

(Reporter: lɛʁi לערי ריינהארט, Assigned: Max Kanat-Alexander)

Tracking

unspecified
Bugzilla 3.2
Bug Flags:
approval +
approval3.2 +
blocking3.2 +

Details

(URL)

Attachments

(1 attachment, 1 obsolete attachment)

v2
3.16 KB, patch
Frédéric Buclin
: review+
Details | Diff | Splinter Review
(Reporter)

Description

12 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5

Hallo!

See first:
http://www.fileformat.info/info/unicode/char/0130/index.htm
Unicode Character LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130
TML Entity (decimal) İ (hex) İ
UTF-8 (hex) 0xC4 0xB0 (c4b0) &c4%b0 &C4%B0

Please read:
http://bugzilla.wikimedia.org/show_bug.cgi?id=2761
== [Bug MediaZilla 2761]: Capitalization of "i" is not "I" in Turkish

LATIN CAPITAL LETTER I WITH DOT ABOVE is contained both inside a comment of
http://landfill.bugzilla.org/bugzilla-tip/show_bug.cgi?id=3296 and inside " Keywords:" at the same bug.

Nevertheless Advanced serarch fails to find this bug at landfill.

I noticed this problem because
http://bugzilla.wikimedia.org/query.cgi?format=advanced
generates false positives / pages that should not belong to the search result.

best regards reinhardt [[user:gangleri]]

Reproducible: Always

Steps to Reproduce:
every time - follow the instructions
use copy and paste to insert the special characters
- or use the keyboard as described at http://www.fileformat.info/info/unicode/char/0130/index.htm
- or change the search url using &long_desc=%C4%B0
Actual Results:  
"Zarro Boogs found."

Expected Results:  
only bugs containig LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130 should be found

*notes*
Search should work independend of the language interface.
As the functions (whatever called) capital() and lowercase() are language dependent Bugzilla should offer an "exact search option" where *no* "normalisation" to the search string should aplay.
Such a feature would be better then actual behaviour.
Status: UNCONFIRMED → NEW
Ever confirmed: true
(Reporter)

Updated

12 years ago
Summary: Advanced search for Turkish capital of Unicode Character LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130 İ fails → Advanced search for Turkish İ - Unicode Character LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130 İ fails
(Reporter)

Comment 1

12 years ago
This bug is about Advanced search at landfill.
not simple search / "Find a Specific Bug".

https://bugzilla.mozilla.org/show_bug.cgi?id=316836
== Search bugs http://bugzilla.wikimedia.org/query.cgi?format=specific does not handle Unicode strings correctly
being marked as a duplicate of
https://bugzilla.mozilla.org/show_bug.cgi?id=126266
== Use UTF-8 (Unicode) charset encoding for pages and email for NEW installations

"Find a Specific Bug" works at Landwill with İ
(Assignee)

Comment 3

11 years ago
This should work now in Bugzilla 3.0.
Status: NEW → RESOLVED
Last Resolved: 11 years ago
Resolution: --- → WORKSFORME
Clicking the links in comment 2 shows it doesn't.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
(Assignee)

Comment 5

11 years ago
Hrm. Maybe with a different MySQL collation this would work properly?

Comment 6

10 years ago
The search for UTF-8 is completely broken. I just install clean Bugzilla-3.0.2, and search doesn't work if i search something outside latin1 encoding.
Flags: blocking3.2?
(Assignee)

Comment 7

10 years ago
Okay, this should definitely at least be looked into before 3.2.
Flags: blocking3.2? → blocking3.2+

Updated

10 years ago
Status: REOPENED → NEW
Target Milestone: --- → Bugzilla 3.2

Comment 8

10 years ago
(In reply to comment #2)
> buglist.cgi?long_desc=%C4%B0&long_desc_type=regexp
> finds the bug.
> buglist.cgi?long_desc=%C4%B0&long_desc_type=allwordssubstr
> doesn't.

Let's add a 3rd query:
buglist.cgi?long_desc_type=casesubstring&long_desc=%C4%B0


Appending &debug=1 to all three queries shows that:

1) the regexp one uses:
   longdescs_.thetext REGEXP 'İ'

2) the allwordssubstr one (case insensitive) uses:
   INSTR(CAST(LOWER(longdescs_.thetext) AS BINARY), CAST('i̇' AS BINARY)) > 0

3) the casesubstring one (case sensitive) uses:
   INSTR(CAST(longdescs_.thetext AS BINARY), CAST('İ' AS BINARY)) > 0

So the problem seems to be that 'i̇' is not seen as the lowercase flavor of 'İ', and so MySQL returns no match.

Comment 9

10 years ago
I tested with PostgreSQL 8.2.6, and it has the same problem.

Comment 10

10 years ago
In Search::GetByWordListSubstr(), I tried replacing (using PostgreSQL):

            push(@list, $dbh->sql_position(lc($sql_word),
                                           "LOWER($field)") . " > 0");

by:
            push(@list, $dbh->sql_position("LOWER($sql_word)",
                                           "LOWER($field)") . " > 0");

but this doesn't help. Instead of 0 bugs, it now returns all bugs.

Comment 11

10 years ago
As reported by bbaetz on IRC, there isn't a one to one mapping between lowercase and uppercase for Turkish, see http://rt.perl.org/rt3/Public/Bug/Display.html?id=36953 and also perldoc perlunicode /lc:

"Things to do with locales (Lithuanian, Turkish, Azeri) do not work since Perl does not understand the concept of Unicode locales."
(Assignee)

Comment 12

10 years ago
Okay. So we should find a way to be using sql_istrcmp or something like that to be doing case-insensitive substring location, instead of using Perl's lc.

Updated

10 years ago
Assignee: query-and-buglist → jjclark1982

Comment 13

10 years ago
In theory this should work if we replace code like

$$term = $dbh->sql_position(lc($$q), "LOWER($$ff)") . " > 0";

with 

$$term = $dbh->sql_position($dbh->sql_istring($$q), $dbh->sql_istring($$ff)) . " > 0";

However, I am having a lot of trouble ensuring that the entered value ($$q) is in the correct encoding. encode('utf8',decode('utf8',$$q)) appears to print the correct value, but passing this to mysql does not match correctly.
(Assignee)

Comment 14

10 years ago
(In reply to comment #13)
> However, I am having a lot of trouble ensuring that the entered value ($$q) is
> in the correct encoding. encode('utf8',decode('utf8',$$q)) appears to print the
> correct value, but passing this to mysql does not match correctly.

  Oh, don't mess with the encoding of anything--that shouldn't be necessary at all, if this is 3.1.x.
(Assignee)

Comment 15

10 years ago
Hey jjclark, any progress on this? This is one of our few code blockers for 3.2.

Comment 16

10 years ago
Created attachment 327334 [details] [diff] [review]
patch, v1

Is it as simple as that? I didn't test this patch.
Attachment #327334 - Flags: review?(jjclark1982)
(Assignee)

Comment 17

10 years ago
Comment on attachment 327334 [details] [diff] [review]
patch, v1

This won't work on MySQL. Our sql_position for MySQL was made case-sensitive:

INSTR(CAST($text AS BINARY), CAST($fragment AS BINARY))

We could make a sql_iposition, though, which could handle it. It could default to calling istring on both its arguments, and MySQL could have its own version.
Attachment #327334 - Flags: review?(jjclark1982) → review-
(Assignee)

Comment 18

10 years ago
I didn't realize there were so few LOWER/lc calls in Search.pm, I can probably fix this myself.
Assignee: jjclark1982 → mkanat
(Assignee)

Comment 19

10 years ago
Created attachment 327344 [details] [diff] [review]
v2

I've tested this and it generates the right SQL. So at this point, if we don't work, it's a bug in the database, not in Bugzilla. :-)
Attachment #327334 - Attachment is obsolete: true
Attachment #327344 - Flags: review?(LpSolit)
(Assignee)

Comment 20

10 years ago
Comment on attachment 327344 [details] [diff] [review]
v2

I want to write a more extensive patch for the tip that uses sql_iposition everywhere that we currently use LOWER() in sql_position.
Attachment #327344 - Attachment description: v2 → v2 (3.2)
(Assignee)

Comment 21

10 years ago
Comment on attachment 327344 [details] [diff] [review]
v2

Actually, I'll just do that in a separate bug.
Attachment #327344 - Attachment description: v2 (3.2) → v2
(Assignee)

Updated

10 years ago
Blocks: 442582

Comment 22

10 years ago
Comment on attachment 327344 [details] [diff] [review]
v2

Looks correct to me, so r=LpSolit. Someone who is used to Turkish characters will have to test it for us after checkin.
Attachment #327344 - Flags: review?(LpSolit) → review+
(Assignee)

Updated

10 years ago
Flags: approval3.2+
Flags: approval+
(Assignee)

Comment 23

10 years ago
tip:

Checking in Bugzilla/DB.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB.pm,v  <--  DB.pm
new revision: 1.115; previous revision: 1.114
done
Checking in Bugzilla/Search.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Search.pm,v  <--  Search.pm
new revision: 1.160; previous revision: 1.159
done
Checking in Bugzilla/DB/Mysql.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Mysql.pm,v  <--  Mysql.pm
new revision: 1.62; previous revision: 1.61
done

3.2:

Checking in Bugzilla/DB.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB.pm,v  <--  DB.pm
new revision: 1.112.2.1; previous revision: 1.112
done
Checking in Bugzilla/Search.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Search.pm,v  <--  Search.pm
new revision: 1.159.2.1; previous revision: 1.159
done
Checking in Bugzilla/DB/Mysql.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Mysql.pm,v  <--  Mysql.pm
new revision: 1.60.2.1; previous revision: 1.60
done
Status: NEW → RESOLVED
Last Resolved: 11 years ago10 years ago
Resolution: --- → FIXED

Comment 24

10 years ago
Will try to get Pardus team involved

Comment 26

10 years ago
Right now landfill returns 16 bugs:

http://landfill.bugzilla.org/bugzilla-tip/buglist.cgi?query_format=advanced&short_desc_type=allwordssubstr&short_desc=%C4%B0

Correct test case (http://landfill.bugzilla.org/bugzilla-tip/show_bug.cgi?id=3296) is found, but all accented 'i' variants (í, Î, Ì) are returned also.

Comment 27

9 years ago
confirmed by Bugzilla-tr staff:

http://bugs.pardus.org.tr/show_bug.cgi?id=7621#c7

QA passed, one can safely pronounce this CLOSED :-)
You need to log in before you can comment on or make changes to this bug.