321427 - Advanced search for Turkish İ - Unicode Character LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130 İ fails

Reporter

Description

•

19 years ago

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5

Hallo!

See first:
http://www.fileformat.info/info/unicode/char/0130/index.htm
Unicode Character LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130
TML Entity (decimal) &#304; (hex) &#x130;
UTF-8 (hex) 0xC4 0xB0 (c4b0) &c4%b0 &C4%B0

Please read:
http://bugzilla.wikimedia.org/show_bug.cgi?id=2761
== [Bug MediaZilla 2761]: Capitalization of "i" is not "I" in Turkish

LATIN CAPITAL LETTER I WITH DOT ABOVE is contained both inside a comment of
http://landfill.bugzilla.org/bugzilla-tip/show_bug.cgi?id=3296 and inside " Keywords:" at the same bug.

Nevertheless Advanced serarch fails to find this bug at landfill.

I noticed this problem because
http://bugzilla.wikimedia.org/query.cgi?format=advanced
generates false positives / pages that should not belong to the search result.

best regards reinhardt [[user:gangleri]]

Reproducible: Always

Steps to Reproduce:
every time - follow the instructions
use copy and paste to insert the special characters
- or use the keyboard as described at http://www.fileformat.info/info/unicode/char/0130/index.htm
- or change the search url using &long_desc=%C4%B0
Actual Results:  
"Zarro Boogs found."

Expected Results:  
only bugs containig LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130 should be found

*notes*
Search should work independend of the language interface.
As the functions (whatever called) capital() and lowercase() are language dependent Bugzilla should offer an "exact search option" where *no* "normalisation" to the search string should aplay.
Such a feature would be better then actual behaviour.

victory <never@receive.bug.mails.i.hate.spammer>

Updated

•

19 years ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

lɛʁi לערי ריינהארט

Reporter

Updated

•

19 years ago

Summary: Advanced search for Turkish capital of Unicode Character LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130 İ fails → Advanced search for Turkish İ - Unicode Character LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130 İ fails

lɛʁi לערי ריינהארט

Reporter

Comment 1

•

19 years ago

This bug is about Advanced search at landfill.
not simple search / "Find a Specific Bug".

https://bugzilla.mozilla.org/show_bug.cgi?id=316836
== Search bugs http://bugzilla.wikimedia.org/query.cgi?format=specific does not handle Unicode strings correctly
being marked as a duplicate of
https://bugzilla.mozilla.org/show_bug.cgi?id=126266
== Use UTF-8 (Unicode) charset encoding for pages and email for NEW installations

"Find a Specific Bug" works at Landwill with &#304;

Marc Schumann [:Wurblzap]

Comment 2

•

19 years ago

http://landfill.bugzilla.org/bugzilla-tip/buglist.cgi?query_format=advanced&long_desc=%C4%B0&long_desc_type=regexp finds the bug.
http://landfill.bugzilla.org/bugzilla-tip/buglist.cgi?query_format=advanced&long_desc=%C4%B0&long_desc_type=allwordssubstr doesn't.

Strange.

MySQL?

Max Kanat-Alexander

Assignee

Comment 3

•

17 years ago

This should work now in Bugzilla 3.0.

Status: NEW → RESOLVED

Closed: 17 years ago

Resolution: --- → WORKSFORME

Marc Schumann [:Wurblzap]

Comment 4

•

17 years ago

Clicking the links in comment 2 shows it doesn't.

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Max Kanat-Alexander

Assignee

Comment 5

•

17 years ago

Hrm. Maybe with a different MySQL collation this would work properly?

Aleksandr Derevianko

Comment 6

•

17 years ago

The search for UTF-8 is completely broken. I just install clean Bugzilla-3.0.2, and search doesn't work if i search something outside latin1 encoding.

Marc Schumann [:Wurblzap]

Updated

•

17 years ago

Flags: blocking3.2?

Max Kanat-Alexander

Assignee

Comment 7

•

17 years ago

Okay, this should definitely at least be looked into before 3.2.

Flags: blocking3.2? → blocking3.2+

Frédéric Buclin

Updated

•

16 years ago

Status: REOPENED → NEW

Target Milestone: --- → Bugzilla 3.2

Frédéric Buclin

Comment 8

•

16 years ago

(In reply to comment #2)
> buglist.cgi?long_desc=%C4%B0&long_desc_type=regexp
> finds the bug.
> buglist.cgi?long_desc=%C4%B0&long_desc_type=allwordssubstr
> doesn't.

Let's add a 3rd query:
buglist.cgi?long_desc_type=casesubstring&long_desc=%C4%B0


Appending &debug=1 to all three queries shows that:

1) the regexp one uses:
   longdescs_.thetext REGEXP 'İ'

2) the allwordssubstr one (case insensitive) uses:
   INSTR(CAST(LOWER(longdescs_.thetext) AS BINARY), CAST('i̇' AS BINARY)) > 0

3) the casesubstring one (case sensitive) uses:
   INSTR(CAST(longdescs_.thetext AS BINARY), CAST('İ' AS BINARY)) > 0

So the problem seems to be that 'i̇' is not seen as the lowercase flavor of 'İ', and so MySQL returns no match.

Frédéric Buclin

Comment 9

•

16 years ago

I tested with PostgreSQL 8.2.6, and it has the same problem.

Frédéric Buclin

Comment 10

•

16 years ago

In Search::GetByWordListSubstr(), I tried replacing (using PostgreSQL):

            push(@list, $dbh->sql_position(lc($sql_word),
                                           "LOWER($field)") . " > 0");

by:
            push(@list, $dbh->sql_position("LOWER($sql_word)",
                                           "LOWER($field)") . " > 0");

but this doesn't help. Instead of 0 bugs, it now returns all bugs.

Frédéric Buclin

Comment 11

•

16 years ago

As reported by bbaetz on IRC, there isn't a one to one mapping between lowercase and uppercase for Turkish, see http://rt.perl.org/rt3/Public/Bug/Display.html?id=36953 and also perldoc perlunicode /lc:

"Things to do with locales (Lithuanian, Turkish, Azeri) do not work since Perl does not understand the concept of Unicode locales."

Max Kanat-Alexander

Assignee

Comment 12

•

16 years ago

Okay. So we should find a way to be using sql_istrcmp or something like that to be doing case-insensitive substring location, instead of using Perl's lc.

Frédéric Buclin

Updated

•

16 years ago

Assignee: query-and-buglist → jjclark1982

Jesse Clark

Comment 13

•

16 years ago

In theory this should work if we replace code like

$$term = $dbh->sql_position(lc($$q), "LOWER($$ff)") . " > 0";

with 

$$term = $dbh->sql_position($dbh->sql_istring($$q), $dbh->sql_istring($$ff)) . " > 0";

However, I am having a lot of trouble ensuring that the entered value ($$q) is in the correct encoding. encode('utf8',decode('utf8',$$q)) appears to print the correct value, but passing this to mysql does not match correctly.

Max Kanat-Alexander

Assignee

Comment 14

•

16 years ago

(In reply to comment #13)
> However, I am having a lot of trouble ensuring that the entered value ($$q) is
> in the correct encoding. encode('utf8',decode('utf8',$$q)) appears to print the
> correct value, but passing this to mysql does not match correctly.

  Oh, don't mess with the encoding of anything--that shouldn't be necessary at all, if this is 3.1.x.

Max Kanat-Alexander

Assignee

Comment 15

•

16 years ago

Hey jjclark, any progress on this? This is one of our few code blockers for 3.2.

Frédéric Buclin

Comment 16

•

16 years ago

Attached patch patch, v1 (obsolete) — Details — Splinter Review

Is it as simple as that? I didn't test this patch.

Attachment #327334 - Flags: review?(jjclark1982)

Max Kanat-Alexander

Assignee

Comment 17

•

16 years ago

Comment on attachment 327334 [details] [diff] [review]
patch, v1

This won't work on MySQL. Our sql_position for MySQL was made case-sensitive:

INSTR(CAST($text AS BINARY), CAST($fragment AS BINARY))

We could make a sql_iposition, though, which could handle it. It could default to calling istring on both its arguments, and MySQL could have its own version.

Attachment #327334 - Flags: review?(jjclark1982) → review-

Max Kanat-Alexander

Assignee

Comment 18

•

16 years ago

I didn't realize there were so few LOWER/lc calls in Search.pm, I can probably fix this myself.

Assignee: jjclark1982 → mkanat

Max Kanat-Alexander

Assignee

Comment 19

•

16 years ago

Attached patch v2 — Details — Splinter Review

I've tested this and it generates the right SQL. So at this point, if we don't work, it's a bug in the database, not in Bugzilla. :-)

Attachment #327334 - Attachment is obsolete: true

Attachment #327344 - Flags: review?(LpSolit)

Max Kanat-Alexander

Assignee

Comment 20

•

16 years ago

Comment on attachment 327344 [details] [diff] [review]
v2

I want to write a more extensive patch for the tip that uses sql_iposition everywhere that we currently use LOWER() in sql_position.

Attachment #327344 - Attachment description: v2 → v2 (3.2)

Max Kanat-Alexander

Assignee

Comment 21

•

16 years ago

Comment on attachment 327344 [details] [diff] [review]
v2

Actually, I'll just do that in a separate bug.

Attachment #327344 - Attachment description: v2 (3.2) → v2

Max Kanat-Alexander

Assignee

Updated

•

16 years ago

Blocks: 442582

Frédéric Buclin

Comment 22

•

16 years ago

Comment on attachment 327344 [details] [diff] [review]
v2

Looks correct to me, so r=LpSolit. Someone who is used to Turkish characters will have to test it for us after checkin.

Attachment #327344 - Flags: review?(LpSolit) → review+

Max Kanat-Alexander

Assignee

Updated

•

16 years ago

Flags: approval3.2+

Flags: approval+

Max Kanat-Alexander

Assignee

Comment 23

•

16 years ago

tip:

Checking in Bugzilla/DB.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB.pm,v  <--  DB.pm
new revision: 1.115; previous revision: 1.114
done
Checking in Bugzilla/Search.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Search.pm,v  <--  Search.pm
new revision: 1.160; previous revision: 1.159
done
Checking in Bugzilla/DB/Mysql.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Mysql.pm,v  <--  Mysql.pm
new revision: 1.62; previous revision: 1.61
done

3.2:

Checking in Bugzilla/DB.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB.pm,v  <--  DB.pm
new revision: 1.112.2.1; previous revision: 1.112
done
Checking in Bugzilla/Search.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Search.pm,v  <--  Search.pm
new revision: 1.159.2.1; previous revision: 1.159
done
Checking in Bugzilla/DB/Mysql.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Mysql.pm,v  <--  Mysql.pm
new revision: 1.60.2.1; previous revision: 1.60
done

Status: NEW → RESOLVED

Closed: 17 years ago → 16 years ago

Resolution: --- → FIXED

Vitaly Fedrushkov

Comment 24

•

16 years ago

Will try to get Pardus team involved

Vitaly Fedrushkov

Comment 25

•

16 years ago

http://bugs.pardus.org.tr/show_bug.cgi?id=7621 filed

Vitaly Fedrushkov

Comment 26

•

16 years ago

Right now landfill returns 16 bugs:

http://landfill.bugzilla.org/bugzilla-tip/buglist.cgi?query_format=advanced&short_desc_type=allwordssubstr&short_desc=%C4%B0

Correct test case (http://landfill.bugzilla.org/bugzilla-tip/show_bug.cgi?id=3296) is found, but all accented 'i' variants (í, Î, Ì) are returned also.

Vitaly Fedrushkov

Comment 27

•

16 years ago

confirmed by Bugzilla-tr staff:

http://bugs.pardus.org.tr/show_bug.cgi?id=7621#c7

QA passed, one can safely pronounce this CLOSED :-)

patch, v1 16 years ago Frédéric Buclin 1.38 KB, patch	mkanat : review-	Details \| Diff \| Splinter Review
v2 16 years ago Max Kanat-Alexander 3.16 KB, patch	LpSolit : review+	Details \| Diff \| Splinter Review

Advanced search for Turkish &#304; - Unicode Character LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130 &#304; fails

Advanced search for Turkish İ - Unicode Character LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130 İ fails