Closed Bug 321427 Opened 19 years ago Closed 16 years ago

Advanced search for Turkish İ - Unicode Character LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130 İ fails

Categories

(Bugzilla :: Query/Bug List, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Bugzilla 3.2

People

(Reporter: gangleri, Assigned: mkanat)

References

()

Details

Attachments

(1 file, 1 obsolete file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5

Hallo!

See first:
http://www.fileformat.info/info/unicode/char/0130/index.htm
Unicode Character LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130
TML Entity (decimal) İ (hex) İ
UTF-8 (hex) 0xC4 0xB0 (c4b0) &c4%b0 &C4%B0

Please read:
http://bugzilla.wikimedia.org/show_bug.cgi?id=2761
== [Bug MediaZilla 2761]: Capitalization of "i" is not "I" in Turkish

LATIN CAPITAL LETTER I WITH DOT ABOVE is contained both inside a comment of
http://landfill.bugzilla.org/bugzilla-tip/show_bug.cgi?id=3296 and inside " Keywords:" at the same bug.

Nevertheless Advanced serarch fails to find this bug at landfill.

I noticed this problem because
http://bugzilla.wikimedia.org/query.cgi?format=advanced
generates false positives / pages that should not belong to the search result.

best regards reinhardt [[user:gangleri]]

Reproducible: Always

Steps to Reproduce:
every time - follow the instructions
use copy and paste to insert the special characters
- or use the keyboard as described at http://www.fileformat.info/info/unicode/char/0130/index.htm
- or change the search url using &long_desc=%C4%B0
Actual Results:  
"Zarro Boogs found."

Expected Results:  
only bugs containig LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130 should be found

*notes*
Search should work independend of the language interface.
As the functions (whatever called) capital() and lowercase() are language dependent Bugzilla should offer an "exact search option" where *no* "normalisation" to the search string should aplay.
Such a feature would be better then actual behaviour.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: Advanced search for Turkish capital of Unicode Character LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130 İ fails → Advanced search for Turkish İ - Unicode Character LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130 İ fails
This bug is about Advanced search at landfill.
not simple search / "Find a Specific Bug".

https://bugzilla.mozilla.org/show_bug.cgi?id=316836
== Search bugs http://bugzilla.wikimedia.org/query.cgi?format=specific does not handle Unicode strings correctly
being marked as a duplicate of
https://bugzilla.mozilla.org/show_bug.cgi?id=126266
== Use UTF-8 (Unicode) charset encoding for pages and email for NEW installations

"Find a Specific Bug" works at Landwill with İ
This should work now in Bugzilla 3.0.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → WORKSFORME
Clicking the links in comment 2 shows it doesn't.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Hrm. Maybe with a different MySQL collation this would work properly?
The search for UTF-8 is completely broken. I just install clean Bugzilla-3.0.2, and search doesn't work if i search something outside latin1 encoding.
Flags: blocking3.2?
Okay, this should definitely at least be looked into before 3.2.
Flags: blocking3.2? → blocking3.2+
Status: REOPENED → NEW
Target Milestone: --- → Bugzilla 3.2
(In reply to comment #2)
> buglist.cgi?long_desc=%C4%B0&long_desc_type=regexp
> finds the bug.
> buglist.cgi?long_desc=%C4%B0&long_desc_type=allwordssubstr
> doesn't.

Let's add a 3rd query:
buglist.cgi?long_desc_type=casesubstring&long_desc=%C4%B0


Appending &debug=1 to all three queries shows that:

1) the regexp one uses:
   longdescs_.thetext REGEXP 'İ'

2) the allwordssubstr one (case insensitive) uses:
   INSTR(CAST(LOWER(longdescs_.thetext) AS BINARY), CAST('i̇' AS BINARY)) > 0

3) the casesubstring one (case sensitive) uses:
   INSTR(CAST(longdescs_.thetext AS BINARY), CAST('İ' AS BINARY)) > 0

So the problem seems to be that 'i̇' is not seen as the lowercase flavor of 'İ', and so MySQL returns no match.
I tested with PostgreSQL 8.2.6, and it has the same problem.
In Search::GetByWordListSubstr(), I tried replacing (using PostgreSQL):

            push(@list, $dbh->sql_position(lc($sql_word),
                                           "LOWER($field)") . " > 0");

by:
            push(@list, $dbh->sql_position("LOWER($sql_word)",
                                           "LOWER($field)") . " > 0");

but this doesn't help. Instead of 0 bugs, it now returns all bugs.
As reported by bbaetz on IRC, there isn't a one to one mapping between lowercase and uppercase for Turkish, see http://rt.perl.org/rt3/Public/Bug/Display.html?id=36953 and also perldoc perlunicode /lc:

"Things to do with locales (Lithuanian, Turkish, Azeri) do not work since Perl does not understand the concept of Unicode locales."
Okay. So we should find a way to be using sql_istrcmp or something like that to be doing case-insensitive substring location, instead of using Perl's lc.
Assignee: query-and-buglist → jjclark1982
In theory this should work if we replace code like

$$term = $dbh->sql_position(lc($$q), "LOWER($$ff)") . " > 0";

with 

$$term = $dbh->sql_position($dbh->sql_istring($$q), $dbh->sql_istring($$ff)) . " > 0";

However, I am having a lot of trouble ensuring that the entered value ($$q) is in the correct encoding. encode('utf8',decode('utf8',$$q)) appears to print the correct value, but passing this to mysql does not match correctly.
(In reply to comment #13)
> However, I am having a lot of trouble ensuring that the entered value ($$q) is
> in the correct encoding. encode('utf8',decode('utf8',$$q)) appears to print the
> correct value, but passing this to mysql does not match correctly.

  Oh, don't mess with the encoding of anything--that shouldn't be necessary at all, if this is 3.1.x.
Hey jjclark, any progress on this? This is one of our few code blockers for 3.2.
Attached patch patch, v1 (obsolete) — Splinter Review
Is it as simple as that? I didn't test this patch.
Attachment #327334 - Flags: review?(jjclark1982)
Comment on attachment 327334 [details] [diff] [review]
patch, v1

This won't work on MySQL. Our sql_position for MySQL was made case-sensitive:

INSTR(CAST($text AS BINARY), CAST($fragment AS BINARY))

We could make a sql_iposition, though, which could handle it. It could default to calling istring on both its arguments, and MySQL could have its own version.
Attachment #327334 - Flags: review?(jjclark1982) → review-
I didn't realize there were so few LOWER/lc calls in Search.pm, I can probably fix this myself.
Assignee: jjclark1982 → mkanat
Attached patch v2Splinter Review
I've tested this and it generates the right SQL. So at this point, if we don't work, it's a bug in the database, not in Bugzilla. :-)
Attachment #327334 - Attachment is obsolete: true
Attachment #327344 - Flags: review?(LpSolit)
Comment on attachment 327344 [details] [diff] [review]
v2

I want to write a more extensive patch for the tip that uses sql_iposition everywhere that we currently use LOWER() in sql_position.
Attachment #327344 - Attachment description: v2 → v2 (3.2)
Comment on attachment 327344 [details] [diff] [review]
v2

Actually, I'll just do that in a separate bug.
Attachment #327344 - Attachment description: v2 (3.2) → v2
Blocks: 442582
Comment on attachment 327344 [details] [diff] [review]
v2

Looks correct to me, so r=LpSolit. Someone who is used to Turkish characters will have to test it for us after checkin.
Attachment #327344 - Flags: review?(LpSolit) → review+
Flags: approval3.2+
Flags: approval+
tip:

Checking in Bugzilla/DB.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB.pm,v  <--  DB.pm
new revision: 1.115; previous revision: 1.114
done
Checking in Bugzilla/Search.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Search.pm,v  <--  Search.pm
new revision: 1.160; previous revision: 1.159
done
Checking in Bugzilla/DB/Mysql.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Mysql.pm,v  <--  Mysql.pm
new revision: 1.62; previous revision: 1.61
done

3.2:

Checking in Bugzilla/DB.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB.pm,v  <--  DB.pm
new revision: 1.112.2.1; previous revision: 1.112
done
Checking in Bugzilla/Search.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Search.pm,v  <--  Search.pm
new revision: 1.159.2.1; previous revision: 1.159
done
Checking in Bugzilla/DB/Mysql.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Mysql.pm,v  <--  Mysql.pm
new revision: 1.60.2.1; previous revision: 1.60
done
Status: NEW → RESOLVED
Closed: 17 years ago16 years ago
Resolution: --- → FIXED
Will try to get Pardus team involved
Right now landfill returns 16 bugs:

http://landfill.bugzilla.org/bugzilla-tip/buglist.cgi?query_format=advanced&short_desc_type=allwordssubstr&short_desc=%C4%B0

Correct test case (http://landfill.bugzilla.org/bugzilla-tip/show_bug.cgi?id=3296) is found, but all accented 'i' variants (í, Î, Ì) are returned also.
confirmed by Bugzilla-tr staff:

http://bugs.pardus.org.tr/show_bug.cgi?id=7621#c7

QA passed, one can safely pronounce this CLOSED :-)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: