Open Bug 640856 Opened 13 years ago Updated 2 years ago

Unable to find accented characters when using differing normalization forms

Categories

(Core :: Find Backend, defect, P2)

defect

Tracking

()

ASSIGNED

People

(Reporter: bugzilla, Assigned: emk)

References

(Blocks 3 open bugs)

Details

Attachments

(1 file, 3 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15

All the content on my website is served Unicode Decomposed, (in development) due to the way it's stored interally, and the find-as-you-type can't match characters with accents.

Requiring all data to be output in Unicode NFC form in order for search functionality to work seems a bit excessive on the part of FireFox.  Though admittedly, after testing, it seems to be the same character-stream matching situation on Opera, and IE.  Only Chrome adequately handles this.


Also: My keyboard input on Windows 7 uses combining characters (as opposed to deadkeys) to input accents.  The end result, is that while I can find text on my website, I can't find accented text on the web in general.

The net result is that I type in a then the -̈ key to  yield ä  but this cannot find ä on a page.  And if I use my local german layout, I can find the latter by pressing the ä key, but not the former.

Something needs to be done to ensure that the text being searched, as well as the page content are in the same normalization form, at the very least, it seems very odd to visitors testing my app to see bären but searching for bären gives no match.


Chrome default matching:
JP: Katakana=Hiragana (casing difference)
DE: ß = ss (not s)
EU: áéíóú = aeiou = áéíóú  
  : æ = ae; ſ = s (historical variants)
  : ㎑ = khz №=no ℡=tel (precomposed letterlike symbols)
  : ⓕⓞⓧ = fox (circled letters)
KR: 신 = 신 (combining jamo vs precomposed hangeul)
  : ㉦ = ᄉ (stylized characters = base form)

  1 = ੧ = ௧ = ₁ = ۱ = ፩ = ... (numerals, cross script)


From the looks of all this, it seems that they are using ICU Collation Data with the Sort-Key length set to primary differences only, ignoring all secondary (accent) tertiary (case) and quatrenary differences (styles); including expansion handling ß = ss.  With the caveat that the full primary sort key for any expansions must match to display.  That is: "⒓" will match "12." but "12" cannot find the partial match to "⒓"

Given the primary ICU-EN sort-key for each of the characters, it becomes more obvious:
1: 159A
2: 159B
.: 028E
⒓: 159A 159B 028E

http://www.unicode.org/charts/collation/



Reproducible: Always

Steps to Reproduce:
Given the page content NFD: "café or bären " or NFC: "café or bären"
Search for café
Actual Results:  
A search will only find one of the two cases, depending on the keyboard input method.


Expected Results:  
Search will find both results

In addition, the searching is very literal in that the case-insensitiveness doesn't work in non-latin scripts reliably.


The way that Chrome searches is very robust.
й can by found by searching for й (I can type that in using the russian base character, and the combining-diacritic in a different layout (altgr-u); Unrealistic, but possible.


おおかみ オオカミ  This is a casing difference only, and Chrome will find both for either input; Firefox will only find a direct character match.  Because both are very commonly in use, being unable to do a quick search on both is another usability issue that should be addressed.
OS: Windows 7 → All
Confirming. This is also important for searching text which contains Unicode Ideographic Variation Sequences.
Assignee: nobody → smontagu
Status: UNCONFIRMED → NEW
Component: General → Internationalization
Ever confirmed: true
Product: Firefox → Core
QA Contact: general → i18n
See also bug 202251 (but this is not a dupe)
(In reply to comment #1)
> Confirming. This is also important for searching text which contains Unicode
> Ideographic Variation Sequences.

Variation Sequences are a somewhat different issue, more related to whether matching should be "strict" or "loose"; text with IVSs is not canonically equivalent to text without them. (So the IVS case is more akin to bug 202251; in many cases, users would prefer somewhat "loose" matching that ignores diacritics, variation selectors, and similar characters in the text. It's conceptually very similar to case-insensitive comparison, which we do by default.)

This bug is an example of the issue that operations involving Unicode text should treat canonically-equivalent code sequences as identical. This affects spell-check, for example, as well as searching. This should be done even if matching is "strict" in the sense that case differences, diacritics, IVS, etc are _not_ being ignored.
(In reply to comment #3)
Hmm, I thought about the Chrome's behavior, but bug 202251 looks to be better about the topic. Thanks for pointing out.
Attached patch patch (obsolete) — Splinter Review
With this patch, nsFind will normalize strings to NFKC and strip default ignorable characters before compare.
Assignee: smontagu → VYV03354
Status: NEW → ASSIGNED
Attachment #523794 - Flags: review?(smontagu)
I'm not sure normalizing to NFKC is always a good idea - this discards distinctions that users may legitimately expect to be recognized by Find. I'd like this to be an _option_ ("loose matching" or something like that), but NFC might be more appropriate for now, at least until we consider how to expose such an option in the UI (similar to case sensitivity).
In the testcase I think it would be helpful to use \uXXXX escapes for the accented letters and for the combining accents, rather than literal UTF8 text; otherwise it's difficult to understand when looking at the test file what it's actually supposed to be testing.

Also, how about testing the reverse situation, where the document contains precomposed characters but the search text uses decomposed sequences?
(In reply to comment #6)
> I'm not sure normalizing to NFKC is always a good idea - this discards
> distinctions that users may legitimately expect to be recognized by Find. I'd
> like this to be an _option_ ("loose matching" or something like that), but NFC
> might be more appropriate for now, at least until we consider how to expose
> such an option in the UI (similar to case sensitivity).
We want loose matching between Hankaku and Zenkaku kana. IE9 matches those Kana variants only when case sensitive option is checked. What about using NFKC when ignore case option is specified?
(In reply to comment #7)
> In the testcase I think it would be helpful to use \uXXXX escapes for the
> accented letters and for the combining accents, rather than literal UTF8 text;
> otherwise it's difficult to understand when looking at the test file what it's
> actually supposed to be testing.
> 
> Also, how about testing the reverse situation, where the document contains
> precomposed characters but the search text uses decomposed sequences?
Will do.
> IE9 matches those Kana variants only when case sensitive option is checked.
Sorry, only when case sensitive option is _unchecked_.
Attached patch patch v2 (obsolete) — Splinter Review
Changes:
* Use NFC for case sensitive match.
* Use charref instead of raw UTF-8 char.
* Added a testcase finding a decomposed pattern from a precomposed text.
* Fixed a bug found by the updated test.
Attachment #523794 - Attachment is obsolete: true
Attachment #523794 - Flags: review?(smontagu)
Attachment #523857 - Flags: review?(smontagu)
Attached patch patch v3 (obsolete) — Splinter Review
The previous patch didn't handle halfwidth katakana sound marks correctly.
Attachment #523857 - Attachment is obsolete: true
Attachment #523857 - Flags: review?(smontagu)
Attachment #523858 - Flags: review?(smontagu)
Blocks: 647805
Just a thought that occured to me when I had finished reporting the related Bug 647805: Would it be possible/make sense to have different default find behaviours in the normal browser window and in the view source window? Like that, normal users could happily ignore the different forms, but interested users could still find them in the view source window.
To the extent that we're introducing NFKC behaviors, I'd like to see the Unicode ligatures decomposed as well for find-- fl/fl, ffi/ffi, ffl/ffl and more (discussed http://en.wikipedia.org/wiki/Typographical_ligature and in the Unicode Normalization Charts, http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt). The larger set includes some Arabic ligatures discussed on the Wikipedia page (and many more, of course), and which we can reasonably expect to be covered.

As discussed above, this is all covered in the ICU Collation tables-- is there a chance those can be incorporated wholesale here and elsewhere?

In the interests of full disclosure, I should say that I'm also interested in Mozilla incorporating ICU data so it can eventually provide ICU-based collations to Sqlite for the Storage API.
Attached patch updated to tipSplinter Review
Attachment #523858 - Attachment is obsolete: true
Attachment #523858 - Flags: review?(smontagu)
Attachment #530107 - Flags: review?(smontagu)
This reminds me of bug 389651. Maybe someone can have a look?
A complete treatment of Unicode equivalence would, or at least could, address bug 389651 as well, since the zero-width space is included in the same set of rules.
Blocks: 389651
An older report: bug 374795
Experienced this issue on MacOS X (10.7.5), with FF 23.0.1 as well. I had a page in UTF-8 with accented characters and then copied matching text from the Finder and pasted it into the 'Find' field of the search toolbar. This text could not be found. Only by retyping the accented letters in the field was the match found. It looks like interaction between the clipboard is important too.
Blocks: 565552
Component: Internationalization → Find Toolbar
Product: Core → Toolkit
Hardware: x86_64 → All
Flags: firefox-backlog+
Priority: -- → P2
Blocks: 658986
Component: Find Toolbar → Find Backend
Flags: needinfo?(enndeakin)
Product: Toolkit → Core

Sorry, there was a problem with the detection of inactive users. I'm reverting the change.

Assignee: nobody → VYV03354
Status: NEW → ASSIGNED
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: