Open Bug 640856 Opened 14 years ago Updated 3 years ago

Unable to find accented characters when using differing normalization forms

Tracking

()

Status:

ASSIGNED

People

(Reporter: bugzilla, Assigned: emk)

References

(Blocks 2 open bugs)

Details

Attachments

(1 file, 3 obsolete files)

patch 14 years ago Masatoshi Kimura [:emk] 28.92 KB, patch		Details \| Diff \| Splinter Review
patch v2 14 years ago Masatoshi Kimura [:emk] 29.95 KB, patch		Details \| Diff \| Splinter Review
patch v3 14 years ago Masatoshi Kimura [:emk] 30.61 KB, patch		Details \| Diff \| Splinter Review
updated to tip 14 years ago Masatoshi Kimura [:emk] 30.63 KB, patch	emk : review? smontagu	Details \| Diff \| Splinter Review

Ice Wolf

Reporter

Description

•

14 years ago

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15 All the content on my website is served Unicode Decomposed, (in development) due to the way it's stored interally, and the find-as-you-type can't match characters with accents. Requiring all data to be output in Unicode NFC form in order for search functionality to work seems a bit excessive on the part of FireFox. Though admittedly, after testing, it seems to be the same character-stream matching situation on Opera, and IE. Only Chrome adequately handles this. Also: My keyboard input on Windows 7 uses combining characters (as opposed to deadkeys) to input accents. The end result, is that while I can find text on my website, I can't find accented text on the web in general. The net result is that I type in a then the -̈ key to yield ä but this cannot find ä on a page. And if I use my local german layout, I can find the latter by pressing the ä key, but not the former. Something needs to be done to ensure that the text being searched, as well as the page content are in the same normalization form, at the very least, it seems very odd to visitors testing my app to see bären but searching for bären gives no match. Chrome default matching: JP: Katakana=Hiragana (casing difference) DE: ß = ss (not s) EU: áéíóú = aeiou = áéíóú : æ = ae; ſ = s (historical variants) : ㎑ = khz №=no ℡=tel (precomposed letterlike symbols) : ⓕⓞⓧ = fox (circled letters) KR: 신 = 신 (combining jamo vs precomposed hangeul) : ㉦ = ᄉ (stylized characters = base form) 1 = ੧ = ௧ = ₁ = ۱ = ፩ = ... (numerals, cross script) From the looks of all this, it seems that they are using ICU Collation Data with the Sort-Key length set to primary differences only, ignoring all secondary (accent) tertiary (case) and quatrenary differences (styles); including expansion handling ß = ss. With the caveat that the full primary sort key for any expansions must match to display. That is: "⒓" will match "12." but "12" cannot find the partial match to "⒓" Given the primary ICU-EN sort-key for each of the characters, it becomes more obvious: 1: 159A 2: 159B .: 028E ⒓: 159A 159B 028E http://www.unicode.org/charts/collation/ Reproducible: Always Steps to Reproduce: Given the page content NFD: "café or bären " or NFC: "café or bären" Search for café Actual Results: A search will only find one of the two cases, depending on the keyboard input method. Expected Results: Search will find both results In addition, the searching is very literal in that the case-insensitiveness doesn't work in non-latin scripts reliably. The way that Chrome searches is very robust. й can by found by searching for й (I can type that in using the russian base character, and the combining-diacritic in a different layout (altgr-u); Unrealistic, but possible. おおかみオオカミ This is a casing difference only, and Chrome will find both for either input; Firefox will only find a direct character match. Because both are very commonly in use, being unable to do a quick search on both is another usability issue that should be addressed.

Ice Wolf

Reporter

Updated

•

14 years ago

OS: Windows 7 → All

Masatoshi Kimura [:emk]

Assignee

Comment 1

•

14 years ago

Confirming. This is also important for searching text which contains Unicode Ideographic Variation Sequences.

Assignee: nobody → smontagu

Status: UNCONFIRMED → NEW

Component: General → Internationalization

Ever confirmed: true

Product: Firefox → Core

QA Contact: general → i18n

Simon Montagu :smontagu

Comment 2

•

14 years ago

See also bug 202251 (but this is not a dupe)

Jonathan Kew [:jfkthame]

Comment 3

•

14 years ago

(In reply to comment #1) > Confirming. This is also important for searching text which contains Unicode > Ideographic Variation Sequences. Variation Sequences are a somewhat different issue, more related to whether matching should be "strict" or "loose"; text with IVSs is not canonically equivalent to text without them. (So the IVS case is more akin to bug 202251; in many cases, users would prefer somewhat "loose" matching that ignores diacritics, variation selectors, and similar characters in the text. It's conceptually very similar to case-insensitive comparison, which we do by default.) This bug is an example of the issue that operations involving Unicode text should treat canonically-equivalent code sequences as identical. This affects spell-check, for example, as well as searching. This should be done even if matching is "strict" in the sense that case differences, diacritics, IVS, etc are _not_ being ignored.

Masatoshi Kimura [:emk]

Assignee

Comment 4

•

14 years ago

(In reply to comment #3) Hmm, I thought about the Chrome's behavior, but bug 202251 looks to be better about the topic. Thanks for pointing out.

Masatoshi Kimura [:emk]

Assignee

Comment 5

•

14 years ago

Attached patch patch (obsolete) — Details — Splinter Review

With this patch, nsFind will normalize strings to NFKC and strip default ignorable characters before compare.

Assignee: smontagu → VYV03354

Status: NEW → ASSIGNED

Attachment #523794 - Flags: review?(smontagu)

Jonathan Kew [:jfkthame]

Comment 6

•

14 years ago

I'm not sure normalizing to NFKC is always a good idea - this discards distinctions that users may legitimately expect to be recognized by Find. I'd like this to be an _option_ ("loose matching" or something like that), but NFC might be more appropriate for now, at least until we consider how to expose such an option in the UI (similar to case sensitivity).

Jonathan Kew [:jfkthame]

Comment 7

•

14 years ago

In the testcase I think it would be helpful to use \uXXXX escapes for the accented letters and for the combining accents, rather than literal UTF8 text; otherwise it's difficult to understand when looking at the test file what it's actually supposed to be testing. Also, how about testing the reverse situation, where the document contains precomposed characters but the search text uses decomposed sequences?

Masatoshi Kimura [:emk]

Assignee

Comment 8

•

14 years ago

(In reply to comment #6) > I'm not sure normalizing to NFKC is always a good idea - this discards > distinctions that users may legitimately expect to be recognized by Find. I'd > like this to be an _option_ ("loose matching" or something like that), but NFC > might be more appropriate for now, at least until we consider how to expose > such an option in the UI (similar to case sensitivity). We want loose matching between Hankaku and Zenkaku kana. IE9 matches those Kana variants only when case sensitive option is checked. What about using NFKC when ignore case option is specified? (In reply to comment #7) > In the testcase I think it would be helpful to use \uXXXX escapes for the > accented letters and for the combining accents, rather than literal UTF8 text; > otherwise it's difficult to understand when looking at the test file what it's > actually supposed to be testing. > > Also, how about testing the reverse situation, where the document contains > precomposed characters but the search text uses decomposed sequences? Will do.

Masatoshi Kimura [:emk]

Assignee

Comment 9

•

14 years ago

> IE9 matches those Kana variants only when case sensitive option is checked. Sorry, only when case sensitive option is _unchecked_.

Masatoshi Kimura [:emk]

Assignee

Comment 10

•

14 years ago

Attached patch patch v2 (obsolete) — Details — Splinter Review

Changes: * Use NFC for case sensitive match. * Use charref instead of raw UTF-8 char. * Added a testcase finding a decomposed pattern from a precomposed text. * Fixed a bug found by the updated test.

Attachment #523794 - Attachment is obsolete: true

Attachment #523794 - Flags: review?(smontagu)

Attachment #523857 - Flags: review?(smontagu)

Masatoshi Kimura [:emk]

Assignee

Comment 11

•

14 years ago

Attached patch patch v3 (obsolete) — Details — Splinter Review

The previous patch didn't handle halfwidth katakana sound marks correctly.

Attachment #523857 - Attachment is obsolete: true

Attachment #523857 - Flags: review?(smontagu)

Attachment #523858 - Flags: review?(smontagu)

Masatoshi Kimura [:emk]

Assignee

Updated

•

14 years ago

Blocks: 647805

j. 'mach' wust

Comment 12

•

14 years ago

Just a thought that occured to me when I had finished reporting the related Bug 647805: Would it be possible/make sense to have different default find behaviours in the normal browser window and in the view source window? Like that, normal users could happily ignore the different forms, but interested users could still find them in the view source window.

Avram Lyon

Comment 13

•

14 years ago

To the extent that we're introducing NFKC behaviors, I'd like to see the Unicode ligatures decomposed as well for find-- fl/ﬂ, ffi/ﬃ, ffl/ﬄ and more (discussed http://en.wikipedia.org/wiki/Typographical_ligature and in the Unicode Normalization Charts, http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt). The larger set includes some Arabic ligatures discussed on the Wikipedia page (and many more, of course), and which we can reasonably expect to be covered. As discussed above, this is all covered in the ICU Collation tables-- is there a chance those can be incorporated wholesale here and elsewhere? In the interests of full disclosure, I should say that I'm also interested in Mozilla incorporating ICU data so it can eventually provide ICU-based collations to Sqlite for the Storage API.

Masatoshi Kimura [:emk]

Assignee

Comment 14

•

14 years ago

Attached patch updated to tip — Details — Splinter Review

Attachment #523858 - Attachment is obsolete: true

Attachment #523858 - Flags: review?(smontagu)

Attachment #530107 - Flags: review?(smontagu)

j.j.

Comment 16

•

14 years ago

This reminds me of bug 389651. Maybe someone can have a look?

Avram Lyon

Comment 17

•

14 years ago

A complete treatment of Unicode equivalence would, or at least could, address bug 389651 as well, since the zero-width space is included in the same set of rules.

Martijn Wargers (dead)

Updated

•

14 years ago

Blocks: 389651

[:Aleksej]

Comment 18

•

13 years ago

An older report: bug 374795

Andre-John Mas

Comment 19

•

12 years ago

Experienced this issue on MacOS X (10.7.5), with FF 23.0.1 as well. I had a page in UTF-8 with accented characters and then copied matching text from the Finder and pasted it into the 'Find' field of the search toolbar. This text could not be found. Only by retyping the accented letters in the field was the match found. It looks like interaction between the clipboard is important too.

Mike de Boer [:mikedeboer]

Updated

•

11 years ago

Blocks: 565552

Component: Internationalization → Find Toolbar

Product: Core → Toolkit

Mike de Boer [:mikedeboer]

Updated

•

11 years ago

Hardware: x86_64 → All

Jenn Chaulk (:jchaulk)

Updated

•

11 years ago

Flags: firefox-backlog+

Mike de Boer [:mikedeboer]

Updated

•

9 years ago

Priority: -- → P2

Wayne Mery (:wsmwk)

Updated

•

9 years ago

Blocks: 658986

Comment hidden (off-topic)

Neil Deakin

Updated

•

3 years ago

Component: Find Toolbar → Find Backend

Flags: needinfo?(enndeakin)

Product: Toolkit → Core

Suhaib Mujahid [:suhaib]

Comment 24

•

3 years ago

Sorry, there was a problem with the detection of inactive users. I'm reverting the change.

Assignee: nobody → VYV03354

Status: NEW → ASSIGNED

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

You need to log in before you can comment on or make changes to this bug.