Open
Bug 640856
Opened 13 years ago
Updated 2 years ago
Unable to find accented characters when using differing normalization forms
Categories
(Core :: Find Backend, defect, P2)
Core
Find Backend
Tracking
()
ASSIGNED
People
(Reporter: bugzilla, Assigned: emk)
References
(Blocks 3 open bugs)
Details
Attachments
(1 file, 3 obsolete files)
30.63 KB,
patch
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15 All the content on my website is served Unicode Decomposed, (in development) due to the way it's stored interally, and the find-as-you-type can't match characters with accents. Requiring all data to be output in Unicode NFC form in order for search functionality to work seems a bit excessive on the part of FireFox. Though admittedly, after testing, it seems to be the same character-stream matching situation on Opera, and IE. Only Chrome adequately handles this. Also: My keyboard input on Windows 7 uses combining characters (as opposed to deadkeys) to input accents. The end result, is that while I can find text on my website, I can't find accented text on the web in general. The net result is that I type in a then the -̈ key to yield ä but this cannot find ä on a page. And if I use my local german layout, I can find the latter by pressing the ä key, but not the former. Something needs to be done to ensure that the text being searched, as well as the page content are in the same normalization form, at the very least, it seems very odd to visitors testing my app to see bären but searching for bären gives no match. Chrome default matching: JP: Katakana=Hiragana (casing difference) DE: ß = ss (not s) EU: áéíóú = aeiou = áéíóú : æ = ae; ſ = s (historical variants) : ㎑ = khz №=no ℡=tel (precomposed letterlike symbols) : ⓕⓞⓧ = fox (circled letters) KR: 신 = 신 (combining jamo vs precomposed hangeul) : ㉦ = ᄉ (stylized characters = base form) 1 = ੧ = ௧ = ₁ = ۱ = ፩ = ... (numerals, cross script) From the looks of all this, it seems that they are using ICU Collation Data with the Sort-Key length set to primary differences only, ignoring all secondary (accent) tertiary (case) and quatrenary differences (styles); including expansion handling ß = ss. With the caveat that the full primary sort key for any expansions must match to display. That is: "⒓" will match "12." but "12" cannot find the partial match to "⒓" Given the primary ICU-EN sort-key for each of the characters, it becomes more obvious: 1: 159A 2: 159B .: 028E ⒓: 159A 159B 028E http://www.unicode.org/charts/collation/ Reproducible: Always Steps to Reproduce: Given the page content NFD: "café or bären " or NFC: "café or bären" Search for café Actual Results: A search will only find one of the two cases, depending on the keyboard input method. Expected Results: Search will find both results In addition, the searching is very literal in that the case-insensitiveness doesn't work in non-latin scripts reliably. The way that Chrome searches is very robust. й can by found by searching for й (I can type that in using the russian base character, and the combining-diacritic in a different layout (altgr-u); Unrealistic, but possible. おおかみ オオカミ This is a casing difference only, and Chrome will find both for either input; Firefox will only find a direct character match. Because both are very commonly in use, being unable to do a quick search on both is another usability issue that should be addressed.
Assignee | ||
Comment 1•13 years ago
|
||
Confirming. This is also important for searching text which contains Unicode Ideographic Variation Sequences.
Assignee: nobody → smontagu
Status: UNCONFIRMED → NEW
Component: General → Internationalization
Ever confirmed: true
Product: Firefox → Core
QA Contact: general → i18n
Comment 2•13 years ago
|
||
See also bug 202251 (but this is not a dupe)
Comment 3•13 years ago
|
||
(In reply to comment #1) > Confirming. This is also important for searching text which contains Unicode > Ideographic Variation Sequences. Variation Sequences are a somewhat different issue, more related to whether matching should be "strict" or "loose"; text with IVSs is not canonically equivalent to text without them. (So the IVS case is more akin to bug 202251; in many cases, users would prefer somewhat "loose" matching that ignores diacritics, variation selectors, and similar characters in the text. It's conceptually very similar to case-insensitive comparison, which we do by default.) This bug is an example of the issue that operations involving Unicode text should treat canonically-equivalent code sequences as identical. This affects spell-check, for example, as well as searching. This should be done even if matching is "strict" in the sense that case differences, diacritics, IVS, etc are _not_ being ignored.
Assignee | ||
Comment 4•13 years ago
|
||
(In reply to comment #3) Hmm, I thought about the Chrome's behavior, but bug 202251 looks to be better about the topic. Thanks for pointing out.
Assignee | ||
Comment 5•13 years ago
|
||
With this patch, nsFind will normalize strings to NFKC and strip default ignorable characters before compare.
Comment 6•13 years ago
|
||
I'm not sure normalizing to NFKC is always a good idea - this discards distinctions that users may legitimately expect to be recognized by Find. I'd like this to be an _option_ ("loose matching" or something like that), but NFC might be more appropriate for now, at least until we consider how to expose such an option in the UI (similar to case sensitivity).
Comment 7•13 years ago
|
||
In the testcase I think it would be helpful to use \uXXXX escapes for the accented letters and for the combining accents, rather than literal UTF8 text; otherwise it's difficult to understand when looking at the test file what it's actually supposed to be testing. Also, how about testing the reverse situation, where the document contains precomposed characters but the search text uses decomposed sequences?
Assignee | ||
Comment 8•13 years ago
|
||
(In reply to comment #6) > I'm not sure normalizing to NFKC is always a good idea - this discards > distinctions that users may legitimately expect to be recognized by Find. I'd > like this to be an _option_ ("loose matching" or something like that), but NFC > might be more appropriate for now, at least until we consider how to expose > such an option in the UI (similar to case sensitivity). We want loose matching between Hankaku and Zenkaku kana. IE9 matches those Kana variants only when case sensitive option is checked. What about using NFKC when ignore case option is specified? (In reply to comment #7) > In the testcase I think it would be helpful to use \uXXXX escapes for the > accented letters and for the combining accents, rather than literal UTF8 text; > otherwise it's difficult to understand when looking at the test file what it's > actually supposed to be testing. > > Also, how about testing the reverse situation, where the document contains > precomposed characters but the search text uses decomposed sequences? Will do.
Assignee | ||
Comment 9•13 years ago
|
||
> IE9 matches those Kana variants only when case sensitive option is checked.
Sorry, only when case sensitive option is _unchecked_.
Assignee | ||
Comment 10•13 years ago
|
||
Changes: * Use NFC for case sensitive match. * Use charref instead of raw UTF-8 char. * Added a testcase finding a decomposed pattern from a precomposed text. * Fixed a bug found by the updated test.
Attachment #523794 -
Attachment is obsolete: true
Attachment #523794 -
Flags: review?(smontagu)
Attachment #523857 -
Flags: review?(smontagu)
Assignee | ||
Comment 11•13 years ago
|
||
The previous patch didn't handle halfwidth katakana sound marks correctly.
Attachment #523857 -
Attachment is obsolete: true
Attachment #523857 -
Flags: review?(smontagu)
Attachment #523858 -
Flags: review?(smontagu)
Comment 12•13 years ago
|
||
Just a thought that occured to me when I had finished reporting the related Bug 647805: Would it be possible/make sense to have different default find behaviours in the normal browser window and in the view source window? Like that, normal users could happily ignore the different forms, but interested users could still find them in the view source window.
Comment 13•13 years ago
|
||
To the extent that we're introducing NFKC behaviors, I'd like to see the Unicode ligatures decomposed as well for find-- fl/fl, ffi/ffi, ffl/ffl and more (discussed http://en.wikipedia.org/wiki/Typographical_ligature and in the Unicode Normalization Charts, http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt). The larger set includes some Arabic ligatures discussed on the Wikipedia page (and many more, of course), and which we can reasonably expect to be covered. As discussed above, this is all covered in the ICU Collation tables-- is there a chance those can be incorporated wholesale here and elsewhere? In the interests of full disclosure, I should say that I'm also interested in Mozilla incorporating ICU data so it can eventually provide ICU-based collations to Sqlite for the Storage API.
Assignee | ||
Comment 14•13 years ago
|
||
Attachment #523858 -
Attachment is obsolete: true
Attachment #523858 -
Flags: review?(smontagu)
Attachment #530107 -
Flags: review?(smontagu)
Comment 16•13 years ago
|
||
This reminds me of bug 389651. Maybe someone can have a look?
Comment 17•13 years ago
|
||
A complete treatment of Unicode equivalence would, or at least could, address bug 389651 as well, since the zero-width space is included in the same set of rules.
Comment 18•12 years ago
|
||
An older report: bug 374795
Comment 19•11 years ago
|
||
Experienced this issue on MacOS X (10.7.5), with FF 23.0.1 as well. I had a page in UTF-8 with accented characters and then copied matching text from the Finder and pasted it into the 'Find' field of the search toolbar. This text could not be found. Only by retyping the accented letters in the field was the match found. It looks like interaction between the clipboard is important too.
Updated•10 years ago
|
Updated•10 years ago
|
Hardware: x86_64 → All
Updated•10 years ago
|
Flags: firefox-backlog+
Updated•8 years ago
|
Priority: -- → P2
Comment hidden (off-topic) |
Updated•2 years ago
|
Component: Find Toolbar → Find Backend
Flags: needinfo?(enndeakin)
Product: Toolkit → Core
Comment 24•2 years ago
|
||
Sorry, there was a problem with the detection of inactive users. I'm reverting the change.
Assignee: nobody → VYV03354
Status: NEW → ASSIGNED
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•