Closed
Bug 321427
Opened 19 years ago
Closed 16 years ago
Advanced search for Turkish İ - Unicode Character LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130 İ fails
Categories
(Bugzilla :: Query/Bug List, defect)
Bugzilla
Query/Bug List
Tracking
()
RESOLVED
FIXED
Bugzilla 3.2
People
(Reporter: gangleri, Assigned: mkanat)
References
()
Details
Attachments
(1 file, 1 obsolete file)
3.16 KB,
patch
|
LpSolit
:
review+
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8) Gecko/20051111 Firefox/1.5
Hallo!
See first:
http://www.fileformat.info/info/unicode/char/0130/index.htm
Unicode Character LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130
TML Entity (decimal) İ (hex) İ
UTF-8 (hex) 0xC4 0xB0 (c4b0) &c4%b0 &C4%B0
Please read:
http://bugzilla.wikimedia.org/show_bug.cgi?id=2761
== [Bug MediaZilla 2761]: Capitalization of "i" is not "I" in Turkish
LATIN CAPITAL LETTER I WITH DOT ABOVE is contained both inside a comment of
http://landfill.bugzilla.org/bugzilla-tip/show_bug.cgi?id=3296 and inside " Keywords:" at the same bug.
Nevertheless Advanced serarch fails to find this bug at landfill.
I noticed this problem because
http://bugzilla.wikimedia.org/query.cgi?format=advanced
generates false positives / pages that should not belong to the search result.
best regards reinhardt [[user:gangleri]]
Reproducible: Always
Steps to Reproduce:
every time - follow the instructions
use copy and paste to insert the special characters
- or use the keyboard as described at http://www.fileformat.info/info/unicode/char/0130/index.htm
- or change the search url using &long_desc=%C4%B0
Actual Results:
"Zarro Boogs found."
Expected Results:
only bugs containig LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130 should be found
*notes*
Search should work independend of the language interface.
As the functions (whatever called) capital() and lowercase() are language dependent Bugzilla should offer an "exact search option" where *no* "normalisation" to the search string should aplay.
Such a feature would be better then actual behaviour.
Updated•19 years ago
|
Status: UNCONFIRMED → NEW
Ever confirmed: true
Reporter | ||
Updated•19 years ago
|
Summary: Advanced search for Turkish capital of Unicode Character LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130 İ fails → Advanced search for Turkish İ - Unicode Character LATIN CAPITAL LETTER I WITH DOT ABOVE - U+0130 İ fails
Reporter | ||
Comment 1•19 years ago
|
||
This bug is about Advanced search at landfill.
not simple search / "Find a Specific Bug".
https://bugzilla.mozilla.org/show_bug.cgi?id=316836
== Search bugs http://bugzilla.wikimedia.org/query.cgi?format=specific does not handle Unicode strings correctly
being marked as a duplicate of
https://bugzilla.mozilla.org/show_bug.cgi?id=126266
== Use UTF-8 (Unicode) charset encoding for pages and email for NEW installations
"Find a Specific Bug" works at Landwill with İ
Comment 2•19 years ago
|
||
Assignee | ||
Comment 3•17 years ago
|
||
This should work now in Bugzilla 3.0.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → WORKSFORME
Comment 4•17 years ago
|
||
Clicking the links in comment 2 shows it doesn't.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Assignee | ||
Comment 5•17 years ago
|
||
Hrm. Maybe with a different MySQL collation this would work properly?
Comment 6•17 years ago
|
||
The search for UTF-8 is completely broken. I just install clean Bugzilla-3.0.2, and search doesn't work if i search something outside latin1 encoding.
Updated•17 years ago
|
Flags: blocking3.2?
Assignee | ||
Comment 7•17 years ago
|
||
Okay, this should definitely at least be looked into before 3.2.
Flags: blocking3.2? → blocking3.2+
Updated•17 years ago
|
Status: REOPENED → NEW
Target Milestone: --- → Bugzilla 3.2
Comment 8•17 years ago
|
||
(In reply to comment #2)
> buglist.cgi?long_desc=%C4%B0&long_desc_type=regexp
> finds the bug.
> buglist.cgi?long_desc=%C4%B0&long_desc_type=allwordssubstr
> doesn't.
Let's add a 3rd query:
buglist.cgi?long_desc_type=casesubstring&long_desc=%C4%B0
Appending &debug=1 to all three queries shows that:
1) the regexp one uses:
longdescs_.thetext REGEXP 'İ'
2) the allwordssubstr one (case insensitive) uses:
INSTR(CAST(LOWER(longdescs_.thetext) AS BINARY), CAST('i̇' AS BINARY)) > 0
3) the casesubstring one (case sensitive) uses:
INSTR(CAST(longdescs_.thetext AS BINARY), CAST('İ' AS BINARY)) > 0
So the problem seems to be that 'i̇' is not seen as the lowercase flavor of 'İ', and so MySQL returns no match.
Comment 9•17 years ago
|
||
I tested with PostgreSQL 8.2.6, and it has the same problem.
Comment 10•17 years ago
|
||
In Search::GetByWordListSubstr(), I tried replacing (using PostgreSQL):
push(@list, $dbh->sql_position(lc($sql_word),
"LOWER($field)") . " > 0");
by:
push(@list, $dbh->sql_position("LOWER($sql_word)",
"LOWER($field)") . " > 0");
but this doesn't help. Instead of 0 bugs, it now returns all bugs.
Comment 11•17 years ago
|
||
As reported by bbaetz on IRC, there isn't a one to one mapping between lowercase and uppercase for Turkish, see http://rt.perl.org/rt3/Public/Bug/Display.html?id=36953 and also perldoc perlunicode /lc:
"Things to do with locales (Lithuanian, Turkish, Azeri) do not work since Perl does not understand the concept of Unicode locales."
Assignee | ||
Comment 12•17 years ago
|
||
Okay. So we should find a way to be using sql_istrcmp or something like that to be doing case-insensitive substring location, instead of using Perl's lc.
Updated•17 years ago
|
Assignee: query-and-buglist → jjclark1982
Comment 13•17 years ago
|
||
In theory this should work if we replace code like
$$term = $dbh->sql_position(lc($$q), "LOWER($$ff)") . " > 0";
with
$$term = $dbh->sql_position($dbh->sql_istring($$q), $dbh->sql_istring($$ff)) . " > 0";
However, I am having a lot of trouble ensuring that the entered value ($$q) is in the correct encoding. encode('utf8',decode('utf8',$$q)) appears to print the correct value, but passing this to mysql does not match correctly.
Assignee | ||
Comment 14•17 years ago
|
||
(In reply to comment #13)
> However, I am having a lot of trouble ensuring that the entered value ($$q) is
> in the correct encoding. encode('utf8',decode('utf8',$$q)) appears to print the
> correct value, but passing this to mysql does not match correctly.
Oh, don't mess with the encoding of anything--that shouldn't be necessary at all, if this is 3.1.x.
Assignee | ||
Comment 15•16 years ago
|
||
Hey jjclark, any progress on this? This is one of our few code blockers for 3.2.
Comment 16•16 years ago
|
||
Is it as simple as that? I didn't test this patch.
Attachment #327334 -
Flags: review?(jjclark1982)
Assignee | ||
Comment 17•16 years ago
|
||
Comment on attachment 327334 [details] [diff] [review]
patch, v1
This won't work on MySQL. Our sql_position for MySQL was made case-sensitive:
INSTR(CAST($text AS BINARY), CAST($fragment AS BINARY))
We could make a sql_iposition, though, which could handle it. It could default to calling istring on both its arguments, and MySQL could have its own version.
Attachment #327334 -
Flags: review?(jjclark1982) → review-
Assignee | ||
Comment 18•16 years ago
|
||
I didn't realize there were so few LOWER/lc calls in Search.pm, I can probably fix this myself.
Assignee: jjclark1982 → mkanat
Assignee | ||
Comment 19•16 years ago
|
||
I've tested this and it generates the right SQL. So at this point, if we don't work, it's a bug in the database, not in Bugzilla. :-)
Attachment #327334 -
Attachment is obsolete: true
Attachment #327344 -
Flags: review?(LpSolit)
Assignee | ||
Comment 20•16 years ago
|
||
Comment on attachment 327344 [details] [diff] [review]
v2
I want to write a more extensive patch for the tip that uses sql_iposition everywhere that we currently use LOWER() in sql_position.
Attachment #327344 -
Attachment description: v2 → v2 (3.2)
Assignee | ||
Comment 21•16 years ago
|
||
Comment on attachment 327344 [details] [diff] [review]
v2
Actually, I'll just do that in a separate bug.
Attachment #327344 -
Attachment description: v2 (3.2) → v2
Comment 22•16 years ago
|
||
Comment on attachment 327344 [details] [diff] [review]
v2
Looks correct to me, so r=LpSolit. Someone who is used to Turkish characters will have to test it for us after checkin.
Attachment #327344 -
Flags: review?(LpSolit) → review+
Assignee | ||
Updated•16 years ago
|
Flags: approval3.2+
Flags: approval+
Assignee | ||
Comment 23•16 years ago
|
||
tip:
Checking in Bugzilla/DB.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB.pm,v <-- DB.pm
new revision: 1.115; previous revision: 1.114
done
Checking in Bugzilla/Search.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Search.pm,v <-- Search.pm
new revision: 1.160; previous revision: 1.159
done
Checking in Bugzilla/DB/Mysql.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Mysql.pm,v <-- Mysql.pm
new revision: 1.62; previous revision: 1.61
done
3.2:
Checking in Bugzilla/DB.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB.pm,v <-- DB.pm
new revision: 1.112.2.1; previous revision: 1.112
done
Checking in Bugzilla/Search.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Search.pm,v <-- Search.pm
new revision: 1.159.2.1; previous revision: 1.159
done
Checking in Bugzilla/DB/Mysql.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/DB/Mysql.pm,v <-- Mysql.pm
new revision: 1.60.2.1; previous revision: 1.60
done
Status: NEW → RESOLVED
Closed: 17 years ago → 16 years ago
Resolution: --- → FIXED
Comment 24•16 years ago
|
||
Will try to get Pardus team involved
Comment 25•16 years ago
|
||
Comment 26•16 years ago
|
||
Right now landfill returns 16 bugs:
http://landfill.bugzilla.org/bugzilla-tip/buglist.cgi?query_format=advanced&short_desc_type=allwordssubstr&short_desc=%C4%B0
Correct test case (http://landfill.bugzilla.org/bugzilla-tip/show_bug.cgi?id=3296) is found, but all accented 'i' variants (í, Î, Ì) are returned also.
Comment 27•16 years ago
|
||
confirmed by Bugzilla-tr staff:
http://bugs.pardus.org.tr/show_bug.cgi?id=7621#c7
QA passed, one can safely pronounce this CLOSED :-)
You need to log in
before you can comment on or make changes to this bug.
Description
•