Find/FindAsYouType will not find text if entered with diacritics ("nikud") in Hebrew, or accented characters in other languages

NEW
Unassigned

Status

()

Core
Find Backend
--
enhancement
14 years ago
3 days ago

People

(Reporter: tomer, Unassigned, NeedInfo)

Tracking

(Blocks: 3 bugs, {intl})

Trunk
x86
All
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [parity-chrome], URL)

Attachments

(1 attachment, 7 obsolete attachments)

(Reporter)

Description

14 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4a) Gecko/20030401
Build Identifier: 

Following is a link to Passover Haggadah (The jewish story for the passover
holidy), in Hebrew with inline punctuation. If you'll search for passover
("פסח"), you'll get only one result - the one on the title bar, which has no
punctuations at all, but if you'll search for the punctuations version of the
same word ("פֶּסַח"), you'll get the other results. 

Reproducible: Always

Steps to Reproduce:



Expected Results:  
Mozilla shuld ignore punctuation characters in search queries, to make it
possible to omit them in the search.
(Reporter)

Comment 1

14 years ago
The original post I sent was in ISO-8859-8-I encoding, which do not support
punctuations as the windows-1255 encoding has. Here is the same words as in the
previous post, hopefully more readable as than.

Passover - "פסח"
Passover with punctuations - "פֶּסַח"
ccing akkana; the problem is that we basically do an opaque match, right?

Comment 3

14 years ago
Right.  We just compare characters in the dom against characters in the search
string, one after another, with no added smarts except for whitespace.  Should
this go to an I18n person who's more familiar with how inline punctuation works?
 Should the find backend be ignoring certain characters in certain encodings?
I don't think we want to automatically ignore all diacritical characters (for
example I might want to search for a specific form with diacritics and not find
the same base characters without diacritics or with different diacritics), but
it would often be useful as an option.

Comment 5

14 years ago
i tend to agree with simon. maybe we should add a checkbox somewhere, "ignore
diacritics".

Comment 6

14 years ago
Adding a checkbox also means altering the find interfaces, since neither nsIFind
nor nsIWebBrowserFind offers that option.  nsIWebBrowserFind (the one used for
searching web pages) is frozen, so you can't add anything to it; you'd have to
make a new version of it.  nsIFind is not frozen, so I guess something like that
could be added.

These are public interfaces, so think carefully about how the interface should
be specified so it will be useful for different charsets under different
circumstances, and will cause minimal disruption to anyone already using the
current interfaces.

Cc'ing Kin and the other Simon, who were involved with the design of the current
interfaces and should have a chance at reviewing any proposal to change them.

Comment 7

14 years ago
Confirmed with WinXP/1.4a/20030321.

IMO, Find-As-You-Type should ignore diacritics by default, while the
dialog-box-search should include a Match Diacritic option, as in IE6.

Prog.

Updated

14 years ago
Status: UNCONFIRMED → NEW
Ever confirmed: true
*** Bug 247635 has been marked as a duplicate of this bug. ***

Updated

13 years ago
Summary: Find/FindAsYouType will not find text if entered with punctuation ("nikud") in Hebrew, but probably other languages as well. → Find/FindAsYouType will not find text if entered with diacritics ("nikud") in Hebrew, or accented characters in other languages

Updated

13 years ago
Keywords: intl

Comment 9

13 years ago
I am the one who opened #247635, and of course, I agree that there should be an
option to disable 'opaque match'. My main problem is on french boards, when
searching for a keyword with accents. Many people mispell accents, and for big
pages, it can be very difficult to find what I am searching for (must do
multiple tries).

Comment 10

13 years ago
Would this feature require us to create a data table of
the accent-less versions of every accented unicode char? Or, is does such a
table or library function already exist in open source somewhere?
Component: XP Apps → Keyboard: Find as you Type

Comment 11

12 years ago
It does not (and it should) ignore the French accents.
For example, I typed "telephonie" on a page where "Téléphonie" was written. And
it did not find it.
I have an american Keyboard without those French accents (like 'é').
It should be so:
é = ê = è = ë = e
same thing for a, i ... I guess other languages have the same problem.
*** Bug 312656 has been marked as a duplicate of this bug. ***
Assignee: jag → nobody
Severity: normal → enhancement
QA Contact: pawyskoczka → keyboard.fayt

Comment 13

11 years ago
*** Bug 353647 has been marked as a duplicate of this bug. ***

Comment 14

11 years ago
>Would this feature require us to create a data table of
>the accent-less versions of every accented unicode char?

For Latin alphabets at least, this can be easily generated from
<http://www.unicode.org/Public/UNIDATA/NamesList.txt>.  Map 'LATIN SMALL LETTER Y WITH WHATEVER' to 'LATIN SMALL LETTER Y' and so on.
(Reporter)

Comment 15

11 years ago
As for Hebrew, we should ignore characters 0x0591 to 0x05C7. 

Comment 16

10 years ago
I don't like this. In german language for example "Bär" and "Bar" are different words with different meanings and they shouldn't be mixed up by the search engine. On the other hand, somebody with an US keyboard might find it handy to search for "Bar" and find "Bär" because he cannot easily enter an "ä".

I think adding a checkbox will not make much sense, it will probably confuse users.

Any ideas?
Comment 0 and comment 1 got corrupted when bugzilla.m.o was converted to UTF-8. 
Without diacritics: פסח
With diacritics: פֶּסַח
(In reply to comment #17)
> Comment 0 and comment 1 got corrupted when bugzilla.m.o was converted to UTF-8. 

I just corrected the charsets in the database on those two comments.
(Reporter)

Comment 19

9 years ago
URL changed since the previous one have gone. (Thanks Yehuda!)
(In reply to comment #16)
> I don't like this. In german language for example "Bär" and "Bar" are
> different words with different meanings and they shouldn't be mixed up by the
> search engine. On the other hand, somebody with an US keyboard might find it
> handy to search for "Bar" and find "Bär" because he cannot easily enter an
> "ä".
> 
> I think adding a checkbox will not make much sense, it will probably confuse
> users.
> 
> Any ideas?
> 

Even in German there are edge cases, such as "Mass" (mass, large amounts) vs. "Maß" (measure), which are homographs in Switzerland but not in Germany (because the Swiss normally don't use the ß); hence the slur "How do the Swiss drink beer?". OTOH in French, accents are rarely printed on uppercase letters, except sometimes the É; I think French words differing only in accentuation would be regarded as homographs anyway.

I see the following possibilities:
- Checkbox "Ignore accents etc." in the search dialog (but this is not FAYT);
- Additional key (in addition to / etc.) to trigger search regardless of accents;
- Always ignore accents while searching, except when the string input by the user includes accents (but this would find German Bär "bear" when looking for Bar "bar");
- New preference [ ] Ignore accents when searching (where in the prefs UI would depend on the application => additional bugs for Fx prefs, Sm prefs and/or Tb prefs -- or only in about:config but IMHO it ought to have a UI);
- Maybe there are other possibilities: Any other ideas?

Comment 21

9 years ago
(In reply to comment #20)
> Even in German there are edge cases, such as "Mass" (mass, large amounts) vs.
> "Maß" (measure), which are homographs in Switzerland 

You mean "Masse" and "Maß" which have the same plural ("Massen" / "Maßen")  in Switzerland. Note that ß is not an accent of the s but a ligature.

But it is a good idea, to threat "ß" and "ss" the same when ignoring accents is activated.

> - Checkbox "Ignore accents etc." in the search dialog (but this is not FAYT);

There is already a very rarely used "ingore case" checkbox. I don't think that this is a good idea.

> - Additional key (in addition to / etc.) to trigger search regardless of
> accents;

I can't say something to this point as for german people like me, the /-Key isn't very easy to enter (Shift+7) so it's in most time easier to just activate FAYT or use Ctrl+F

> - Always ignore accents while searching, except when the string input by the
> user includes accents (but this would find German Bär "bear" when looking for
> Bar "bar");

... and this behaviour would be considered as a bug by most german people I think.

> - New preference [ ] Ignore accents when searching (where in the prefs UI would
> depend on the application => additional bugs for Fx prefs, Sm prefs and/or Tb
> prefs 

I like that idea. This pref could be on by default, but german people who have problems when searching, or consider that as a bug can deactivate it. 

Maybe it's even possible to activate the pref by default only for non-german* firefox versions?
(In reply to comment #21)
> (In reply to comment #20)
> > Even in German there are edge cases, such as "Mass" (mass, large amounts) vs.
> > "Maß" (measure), which are homographs in Switzerland 
> 
> You mean "Masse" and "Maß" which have the same plural ("Massen" / "Maßen") 
> in Switzerland. Note that ß is not an accent of the s but a ligature.
> 
> But it is a good idea, to threat "ß" and "ss" the same when ignoring accents
> is activated.

Yes, I meant treating ß like ss when ignoring accents; and they also have the same dative: "Wie trinken die Schweizer Bier? -- In Masse."

Comment 23

9 years ago
Finding extra matches (eg Bär when the user meant just Bar) is a bug but it is not a serious one.  The user can just skip to the next match.

On the other hand, failing to find a match is a serious deficiency in a search function.  So I think the right default is to search accent-blind when the search string is just ASCII text.  If the user explicitly enters accents, of course, it's different.
(Assignee)

Updated

9 years ago
Product: Core → SeaMonkey

Updated

9 years ago
Duplicate of this bug: 358307

Updated

9 years ago
Component: Find In Page → Find Backend
Product: SeaMonkey → Core
QA Contact: keyboard.fayt → find-backend

Comment 25

9 years ago
See also bug 475306, a similar bug for smart-quote characters and such.
Hi -- I'm the reporter of Bug 475306

Not sure if it's the best solution, but I feel I should throw this out there: How about a "match case and diacritics etc." box or something to that effect?

Also, if a separate checkbox or pref ends up being the best solution, wouldn't it be possible to disable it by default in German builds?
Oddly, I just ran across the following article while reading something completely unrelated to this bug: http://en.wikipedia.org/wiki/Unicode_normalization

These charts of equivalent characters are a few links from there: http://www.unicode.org/charts/normalization/

Small world, huh?  :)
I'm beginning to think bug 276757 and this should be merged and renamed "Unicode strings not normalized in Find [As You Type]" or something to that effect. They're really the same problem, and fixing this one would (or at least should) fix the other.

Comment 29

9 years ago
I agree.

Normalization of quotation marks and combined characters (ä = ä) should happen in any case.

But I think the replacement of diacritics should depend on the language the text is in.

For example in german language:

ä = ae
ö = oe
ü = ue
ë = e (but aë ≠ ä etc.)
ß = ss

while in some other european languages:

ä = a
ö = o
ü = u
etc.

and in dutch for example there is 

ij = ij = ÿ in some areas

I'm not an expert in typography and different languages, but doing this right seems to be a big load of work.
Normalization is orthogonal to what this bug and bug 276757 are about.

As comment 29 says, normalization means that searching for ä (a sequence of U+0061 LATIN SMALL LETTER A and U+0308 COMBINING DIAERESIS) should find ä (U+00E4 LATIN SMALL LETTER A WITH DIARESIS) and vice versa. This bug requests an option to search for a alone, and find ä.

Normalizing to NFC or NFD forms (canonical normalization) will solve the ä = ä issue, but will not help with bug 276757. Even NFKC or NFKD (compatibility normalization) will only solve some of the cases listed in bug 276757 comment 10.
How about treating characters the same as the first character in one of the equivalent strings?

For example, "à" (LATIN SMALL LETTER A WITH GRAVE) is 00E0, and the normalization chart says it's equivalent to 0061 (LATIN SMALL LETTER A) + 0300 (COMBINING GRAVE ACCENT). Shouldn't be too difficult to imply that if 00E0 = 0061 0300, then 00E0 = 0061.

Comment 32

9 years ago
My humble opinion is that the search tool should be as tolerant and sensitive as possible. While false positives (see ex 1) will be rare and unimportant, this would solve all false negatives issues (see ex 2), which are really painful.

Ex 1: search for 'daem'. Either you are german and you probably want this to match equally 'daem' and 'däm'. Or you are not german and the letter 'ä' probably doesn't appear in your page. Or the string 'däm' does appear in your page and does match, you find it odd and just skip it yourself (should not happen that often).

Ex 2: you are french and using someone else's computer, US-keyboard and US-locale. Not being able to type "café", you search "cafe" and expect it to match, whatever the computer locale!

My suggestion emphasizes that the search tool is a quick and ergonomic feature which doesn't need to be strict and exact like an industrial process, but does need to let you locate a word in the page, even a non-english one.

Updated

7 years ago
Duplicate of this bug: 576945

Updated

7 years ago
Duplicate of this bug: 442070
Duplicate of this bug: 601373
Duplicate of this bug: 596862
Duplicate of this bug: 551960

Comment 38

6 years ago
This info is probably more suited to this ticket.


Chrome default matching:
JP: Katakana=Hiragana (casing difference)
DE: ß = ss (not s)
EU: áéíóú = aeiou = áéíóú  
  : æ = ae; ſ = s (historical variants)
  : ㎑ = khz №=no ℡=tel (precomposed letterlike symbols)
  : ⓕⓞⓧ = fox (circled letters)
KR: 신 = 신 (combining jamo vs precomposed hangeul)
  : ㉦ = ᄉ (stylized characters = base form)

  1 = ੧ = ௧ = ₁ = ۱ = ፩ = ... (numerals, cross script)


From the looks of all this, it seems that they are using ICU Collation Data
with the Sort-Key length set to primary differences only, ignoring all
secondary (accent) tertiary (case) and quatrenary differences (styles);
including expansion handling ß = ss.  With the caveat that the full primary
sort key for any expansions must match to display.  That is: "⒓" will match
"12." but "12" cannot find the partial match to "⒓"

Given the primary ICU-EN sort-key for each of the characters, it becomes more
obvious:
1: 159A
2: 159B
.: 028E
⒓: 159A 159B 028E
a: 15A3
ä: 15A3
℀: 15A3 036B 15D1

The differences that should be ignored in the find-as-you type for a permissive match should be based only on the primary sort-key; which would give the best user-experience, and be the most intuative.

http://www.unicode.org/charts/collation/

Updated

6 years ago
Duplicate of this bug: 652434
Blocks: 670601

Comment 40

5 years ago
The same problem with accent (U+0300, see http://en.wikipedia.org/wiki/Acute_accent) (and I believe with other diacritic symbols).

Test case for Russian (fails for FF v11.0): "за́мок" == "замо́к" = "замок". Although "за́мок" and "замо́к" have different meaning, accent should be ignored during the search for "замок" (ambiguous meaning).
Duplicate of this bug: 595359
Duplicate of this bug: 812837

Comment 43

4 years ago
still not resolved? This bug is from 2003... This is thing I miss the most from chromium after switching..

Comment 44

4 years ago
With bug 276757 landed, Firefox now matches curly quotes when searching for straight quotes. Perhaps the same mechanism can be used to solve the very similar issues in this bug.

By the way, this bug is [parity-ie], Since Internet Explorer has had a way of ignoring diacritics for ages.

Comment 45

4 years ago
Please, I'd also like to see this resolved! Not only should at least there be an option to ignore accented characters (and even characters that aren't technically accented, such as "c" and "ç"), but also any characters considered "equivalent," just like bug 413927 proposes.

Also, I believe this should apply everywhere in Firefox where you can search text (not only the findbar), such as the history and the awesomebar.
(In reply to Gustavo Silva from comment #45)
> Please, I'd also like to see this resolved! Not only should at least there
> be an option to ignore accented characters (and even characters that aren't
> technically accented, such as "c" and "ç"), but also any characters
> considered "equivalent," just like bug 413927 proposes.
> 
> Also, I believe this should apply everywhere in Firefox where you can search
> text (not only the findbar), such as the history and the awesomebar.

The current bug is not about treaing precomposed ç as c, ø as o, à as a, but about ignoring combining characters such as Hebrew nikkud, Arabic harakaat, or combining accents, cedillas, ogoneks, etc., when a separate "combining" character is used (e.g. U+0301 COMBINING ACUTE ACCENT) together with an unaccented letter. In the latter case it might be nice, but I don't know how feasible, to treat precomposed é (U+00E9 SMALL LATIN LETTER E WITH ACUTE) and decomposed é  (U+0065 SMALL LATIN LETTER E then U+0301 COMBINING ACUTE ACCENT) as equivalent (and similarly for other "canonical" equivalences), and anyway I suppose it would be a different bug than this one.

Updated

4 years ago
See Also: → bug 374795

Updated

3 years ago
Duplicate of this bug: 969984
Blocks: 565552
I imagine this is related to bug 640856 in some way. (Not sure if enough to dupe, though.)
I've got the same problem so I've taken a look at it. One possible way to solve it is to use ICU Transliterator to do the work for us:

Given the example given by Ice Wolf:
DE: ß = ss (not s)
EU: áéíóú = aeiou = áéíóú  
  : æ = ae; ſ = s (historical variants)
  : ㎑ = khz №=no ℡=tel (precomposed letterlike symbols)
  : ⓕⓞⓧ = fox (circled letters)
KR: 신 = 신 (combining jamo vs precomposed hangeul)
  : ㉦ = ᄉ (stylized characters = base form)

  1 = ੧ = ௧ = ₁ = ۱ = ፩ = ... (numerals, cross script)

Using ICU Transform tool (http://www.icu-project.org/icu-bin/translit) with the rule ":: latin; :: nfkd; :: Latin-ASCII;" gives a not so bad result:

DE: ss = ss (not s)
EU: aeiou = aeiou = aeiou  
  : ae = ae; s = s (historical variants)
  : kHz = khz No=no TEL=tel (precomposed letterlike symbols)
  : fox = fox (circled letters)
KR: sin = sin (combining jamo vs precomposed hangeul)
  : s = s (stylized characters = base form)

  1 = 1 = 1 = 1 = 1̱ = ፩ = ... (numerals, cross script)

Some problems with this rule is the fact that Non-latin characters have been converted (due to the rule ::latin), if we disable it we get:

DE: ss = ss (not s)
EU: aeiou = aeiou = aeiou  
  : ae = ae; s = s (historical variants)
  : kHz = khz No=no TEL=tel (precomposed letterlike symbols)
  : fox = fox (circled letters)
KR: 신 = 신 (combining jamo vs precomposed hangeul)
  : ᄉ = ᄉ (stylized characters = base form)

  1 = ੧ = ௧ = 1 = ۱ = ፩ = ... (numerals, cross script)

In this case, numerals are not transformed. I do not have much knowledge in ICU's Transliteration, but if anybody thinks that this can be used to solve this bug I can look into it more deeper.

One problem left if we consider this solution, by default we compile with -DUCONFIG_NO_TRANSLITERATION that disables these features. A check has to be done also that enabling it does not break something else.
Flags: needinfo?(gphemsley)
Flags: needinfo?(smontagu)
Flags: needinfo?(jfkthame)
Flags: needinfo?(gphemsley)
Great idea, Mohamed! It certainly looks as if ICU Transliteration/Transforms can do what we need for this bug and bug 640856 (as comment 46 and earlier comments point out the two bugs have rather different scope).

We need to spec out what it is that we want to do, and what UI we need (if any). As an opening bid, my feeling is that we should always be doing canonical normalization, and omitting accents should be available as an option (An option to do compatibility normalization is probably too esoteric to expose in UI). Retaining needinfo=jfkthame for all of this.
Blocks: 724529
Flags: needinfo?(smontagu)
BTW, I deliberately didn't make this depend on bug 864843 and bug 866301 like other bugs that depend on using ICU: on the one hand this feature could be especially useful on mobile devices where entering accented characters can be even more challenging than with a conventional keyboard, but enabling the feature isn't going to break anything even if isn't available on such devices.
Actually, while I certainly agree that we want a solution here, I don't think the Transliteration/Transforms APIs are the appropriate tool. For one thing, running the content to be searched through one of these APIs would make it difficult to retain a mapping back to the original text -- note that the number of characters present may change considerably; it wouldn't be 1:1. So after finding a "hit" in the transformed content, it would be difficult to identify the correct range in the actual document.

The correct approach, IMO, is what's outlined already in comment 38: what "matches" for search purposes should be based on collation data. Depending whether we use primary keys only, or include secondary/tertiary keys, we could readily support varying levels of "strictness".

http://userguide.icu-project.org/collation

Note that ICU also includes a String Search Service API that is built on top of collation, although I have not studied this API to see how readily we could use it within Gecko.

http://userguide.icu-project.org/collation/icu-string-search-service
Flags: needinfo?(jfkthame)
Jonathan, I agree with you :)

I've written a simple POC using ICU's collation library. I'm not familiar with it, so read a little bit and written what seems a working example http://pastebin.mozilla.org/5311663
The output is http://pastebin.mozilla.org/5311662
To test the code a little bit more, I test now different locales: http://pastebin.mozilla.org/5311881 and the output http://pastebin.mozilla.org/5311882
In addition to the basic collator I took a look at the string search. The results are quite interesting:
Sample code (French and Arabic used): http://pastebin.mozilla.org/5314754 
Output: http://pastebin.mozilla.org/5314757

I don't know how you feel about it, but it seems that with the different features of ICU's string search (next word, backward search, case sensitive and insensitive) we can do something good here.
Flags: needinfo?(smontagu)
Flags: needinfo?(jfkthame)
The real question here isn't whether the collation-based approach (whether using the basic collator interface, or the string-search module) can provide the kind of "loose matching" that would be useful. That's already pretty well understood. The interesting (perhaps tricky?) part here will be figuring out how to efficiently use those tools within Gecko, where the text to be searched isn't simply a C or C++ string, it's data within the DOM, probably spread across an extensive hierarchy of different elements.

Perhaps we could implement mozilla::DOMIterator as a concrete subclass of icu::SearchIterator, and then use that as the tool to implement Find-in-Page behavior. But I haven't looked into exactly what that would involve.
Flags: needinfo?(jfkthame)
(In reply to Jonathan Kew (:jfkthame) from comment #56)
> The interesting (perhaps tricky?) part here will be
> figuring out how to efficiently use those tools within Gecko, where the text
> to be searched isn't simply a C or C++ string, it's data within the DOM,
> probably spread across an extensive hierarchy of different elements.

I have no idea how this would/could work in C++ as I barely have any knowledge of it, but FWIW I was (still am, until this bug is resolved at least) going to implement this search feature in FindBar Tweak by making use of the nodes' .textContent property, stripping it out of head, style and scripts tags text through a custom method I wrote: https://github.com/Quicksaver/Bootstraped-Addon-Backbone/blob/master/resource/modules/utils/HTMLElements.jsm

I haven't done any extensive testing, but from what I could tell when I first wrote that, it's a pretty fast routine, perfectly usable in any normal-to-large sized webpage. Also, I'm guessing this question comes because that API needs a full text string provided or something, and the current fastFind (which if I read the code correctly, goes through the page's nodes on the spot) wouldn't work with something like this?
(In reply to Jonathan Kew (:jfkthame) from comment #56)
> The interesting (perhaps tricky?) part here will be
> figuring out how to efficiently use those tools within Gecko, where the text
> to be searched isn't simply a C or C++ string, it's data within the DOM,
> probably spread across an extensive hierarchy of different elements.

Isn't that a solved problem? I was thinking that we could plug the ICU search tools into the current implementations of nsIWebBrowserFind and nsIFind and not have to reinvent the wheel.
Flags: needinfo?(smontagu)
(In reply to Simon Montagu :smontagu from comment #58)
> Isn't that a solved problem? I was thinking that we could plug the ICU
> search tools into the current implementations of nsIWebBrowserFind and
> nsIFind and not have to reinvent the wheel.

We can do that by `normalize`-ing the character data extracted from text nodes. nsFind is basically a tree walker that extracts matches by pulling one character off the text content and advancing to the next character in the needle upon match.

That's why the ICU search methods/ algo won't work for nsFind; it's not a simple substring-matching-search. However, since we only _really_ need to normalize individual characters in the most common case, I'd say using normalization is a good first step.

If we'd like locale & grammar aware normalization, we'd need to keep track of character positions before and after. Or can we assure that no character positions are changed when we feed it the entire node contents?
(In reply to Simon Montagu :smontagu from comment #58)
> (In reply to Jonathan Kew (:jfkthame) from comment #56)
> Isn't that a solved problem? I was thinking that we could plug the ICU
> search tools into the current implementations of nsIWebBrowserFind and
> nsIFind and not have to reinvent the wheel.

Well, we certainly have existing code that handles this - what I'm unsure about is how easily we'll be able to fit the ICU tools into that. Maybe it'll be straightforward, but I suspect not entirely. Ultimately, nsFind iterates through the document elements by characters, but we can't plug in a collator-based match at that level; it needs to see strings (which may span element boundaries), not individual characters.
Yes, clearly at the lowest level of nsFind we would have to pull whole strings out of nodes to pass to ICU, probably with some degree of look-ahead, rather than iterate by character as it does now.
(In reply to Mike de Boer [:mikedeboer] from comment #59)
> That's why the ICU search methods/ algo won't work for nsFind; it's not a
> simple substring-matching-search. However, since we only _really_ need to
> normalize individual characters in the most common case, I'd say using
> normalization is a good first step.

Normalization is a good step, but note that it can't be done on a codepoint-by-codepoint basis, which is how nsFind currently walks; it may involve composing, decomposing and reordering sequences of codepoints.

Updated

3 years ago
Duplicate of this bug: 1028739

Comment 64

3 years ago
Hi I am interested in working on this bug.  I am working on a solution that has two parts:

1) for Hebrew, it ignores combining code points in the web page.  

2) for Latin alphabets, I have some small tables generated by perl that map characters to base letters.
I have included them into nsFind.cpp.  The perl uses normalization algorithms to generate the mappings to the base letter.  Additionally, I map some pairs of characters to map ligatures.
It seems to perform pretty well, and handle quite a few languages. It matches up pretty well with the Chrome behavior.

Does nsFind have an owner?  This is the first time I have tried to contribute to firefox.

Comment 65

3 years ago
Created attachment 8536092 [details]
Here is a perl script that can generate tables for mapping an accented character to base characters.  I ran it with perl v5.14.2

Comment 66

3 years ago
Created attachment 8536095 [details]
Here is a modified version of nsFind.cpp that contains a possible fix.

This nsFind.cpp should ignore Hebrew and Arabic vowel sign combiners, and latin alphabet accents, and should allow you to search Vietnamese without typing accents.  It supports some ligatures.

Comment 67

3 years ago
Additionally it seems desireable to ignore Kashida.  Chrome has that behavior.

Updated

2 years ago
Flags: needinfo?(jst)

Updated

2 years ago
Flags: needinfo?(mdeboer)
Hi Jonathan! Thanks for your interest in working on this bug! Also, apologies for the late reaction; I was on holiday for the past two weeks.

It would help _a lot_ if you'd take a bit of time to read the following - quite short - guide:
https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/How_to_Submit_a_Patch

As for the approach you're describing in comment 64; I don't know if that approach is the correct one, I simply don't have enough knowledge to give you a reasonable answer. For that, I need to defer your question to jfkthame. I hope he doesn't mind ;-)
Flags: needinfo?(mdeboer)
Flags: needinfo?(jst)
Flags: needinfo?(jfkthame)
I'm not really comfortable with the approach here of adding ad hoc character-mapping tables to nsFind.cpp. We should instead be leveraging the Unicode character property tables already present in intl/unicharutil and/or ICU.

Regarding the overall description in comment 64: it sounds like this is focused on just a couple of scripts. Really, an "ignore diacritics" option should be applied uniformly across all scripts.

The "ignore combining code points" part of this could probably be implemented based on looking up the Unicode general category of the character; and support for stripping accents off precomposed letters could be implemented by applying NFD normalization and then stripping any combining marks from the normalized form.

Given that these basic functions are already present in the codebase (best to use the ICU implementations, as nsIUnicodeNormalizer implements an old Unicode version), we shouldn't be adding lookup tables here. If the performance of looking up decompositions, etc., on the fly and then stripping diacritics turns out to be a problem (I'm not sure if it will), then nsFind could perhaps cache the results for individual characters as they're encountered.
Flags: needinfo?(jfkthame)

Comment 70

2 years ago
If we extend the Latin, Arabic, Hebrew NFD analysis to include all scripts, that would create an equivalence in Cyrillic between U+0418 И and U+0419 Й, which NFD decomposes into U+0418 + U+0300.  I suspect this would not be a desirable equivalence in searching Russian and Ukrainian text.  I tried it in chrome, and chrome did not equate  U+0418 И and U+0419 Й, though it ignores accent marks for latin text.

For that reason, along with my general lack of expertise outside of a few languages, it seems to me to be difficult to apply an NFD based fix without special casing across all languages.
(In reply to JonathanW from comment #70)
> If we extend the Latin, Arabic, Hebrew NFD analysis to include all scripts,
> that would create an equivalence in Cyrillic between U+0418 И and U+0419 Й,
> which NFD decomposes into U+0418 + U+0300.  I suspect this would not be a
> desirable equivalence in searching Russian and Ukrainian text.  I tried it
> in chrome, and chrome did not equate  U+0418 И and U+0419 Й, though it
> ignores accent marks for latin text.

I doubt equating 'И' and 'Й' is any less desirable for searching Russian and Ukrainian than it is to equate a/å, a/ä, o/ö when searching in Swedish; they're considered entirely separate letters of the alphabet.

The fact that Chrome ignores diacritics in Latin but not in Cyrillic is just a limitation of their implementation, I suspect. Let's do better -- and be more consistent!

Comment 72

2 years ago
Created attachment 8672412 [details] [diff] [review]
bug-2022251-icu-collator-fix.patch has a possible fix based on icu collators

bug-2022251-icu-collator-fix.patch attached implements asymmetric matching.

It uses icu properties to find "ignorable" combiners in the target text.

Try it on some of the suggested behavior in this bug, and see what you think.

Updated

2 years ago
Flags: needinfo?(jfkthame)
This sounds promising, thanks! I haven't had time to test or go through it carefully yet but will try to do so soon; leaving the needinfo? flag for now to keep it on my radar.

One note: from a brief look, I think the behavior will be a bit inconsistent in that accents in the *search* string will be ignored if they're part of precomposed letters (so searching for "naïve" will find "naive"), but NOT if they're encoded using combining marks; whereas in the *target* string the combining mark versions would also be ignored. That seems unfortunate... either we should consistently ignore all accents, or we should respect accents in the search string regardless whether they're precomposed or combining-char.

Comment 74

2 years ago
Are there some unit tests that cover the find in page functionality?
Flags: needinfo?(mdeboer)

Comment 75

2 years ago
(In reply to Jonathan Kew (:jfkthame) from comment #73)
> This sounds promising, thanks! I haven't had time to test or go through it
> carefully yet but will try to do so soon; leaving the needinfo? flag for now
> to keep it on my radar.
> 
> One note: from a brief look, I think the behavior will be a bit inconsistent
> in that accents in the *search* string will be ignored if they're part of
> precomposed letters (so searching for "naïve" will find "naive"), but NOT if
> they're encoded using combining marks; whereas in the *target* string the
> combining mark versions would also be ignored. That seems unfortunate...
> either we should consistently ignore all accents, or we should respect
> accents in the search string regardless whether they're precomposed or
> combining-char.

I tried the "naive" variations, and they all seem to find each other as expected.

Comment 76

2 years ago
JonathanW: not unit tests, but there's an old page with a collection of manual regression tests for find-in-page covering some of the regressions that used to slip in when the find code was updated:
http://www-archive.mozilla.org/quality/browser/front-end/testcases/xpapps-gui/find-on-page/find-in-page.html
It was written before Mozilla had any automated testing framework (in fact, it probably even predates find-as-you-type); perhaps it would be possible to automate it now. It does not include any diacritics, and should.
There are some automated tests at https://dxr.mozilla.org/mozilla-central/source/embedding/test/test_nsFind.html
Jonathan, apart from the test that Matt mentioned (which is probably the best one for you to start with and copy), the ones covering the findbar are:

https://dxr.mozilla.org/mozilla-central/source/toolkit/content/tests/chrome/test_findbar.xul
https://dxr.mozilla.org/mozilla-central/source/toolkit/content/tests/chrome/test_findbar_events.xul
https://dxr.mozilla.org/mozilla-central/source/toolkit/content/tests/browser/browser_findbar.js
Flags: needinfo?(mdeboer)
Comment on attachment 8672412 [details] [diff] [review]
bug-2022251-icu-collator-fix.patch has a possible fix based on icu collators

Review of attachment 8672412 [details] [diff] [review]:
-----------------------------------------------------------------

This shows a lot of promise -- I tried it in a local build, and it makes a lot of common cases behave as expected, with Find happily ignoring accents.

However, there are still cases that don't work as well, e.g. once multiple accents get involved. With the testcase

  data:text/html;charset=utf-8,x&%23x1d8;x x&%23xfc;&%23x301;x xu&%23x308;&%23x301;x

(which displays "xǘx xǘx xǘx"), searching for "xux" will indeed find all three instances; but searching for xǘx may or may not work depending whether the search string is entered in precomposed or decomposed form.

I think -- without having attempted to code it -- that the right way forward here is indeed to use a collator, but rather than walking through the search and target strings one character at a time, we should walk one *cluster* at a time, where a cluster includes any trailing combining marks. We have an existing ClusterIterator class somewhere, but because nsFind is iterating over the DOM content rather than over simple text strings, I don't think we'll be able to use that; it'll need to be reimplemented on top of the existing nsFindContentIterator, I guess. Or maybe nsFindContentIterator itself should be modified to work in terms of clusters.

Then we'll be calling coll->compare() on a pair of clusters rather than individual codepoints, and that will take care of the accents as well as all canonical-equivalence issues within the cluster.

This will also allow us (in strict-matching mode, i.e. when the user doesn't want to ignore accents -- which I think we need to offer, though what UI if any to expose is still a question) to avoid the problem of finding "u" when the document contains "u&#x308;&#x301;".

The other thing that would be helpful would be if you can take a look at the mozilla coding style guidelines[1] and try to ensure that your patch follows these -- in particular, things like whitespace (no tabs!), bracing of control structures, etc. This makes it easier to review patches, and issues like this will need to be tidied up before a patch here is landed in the tree.

And thanks for working on this -- I believe it'll make a lot of people happier!

::: embedding/components/find/nsFind.cpp
@@ +862,5 @@
> +// ignore them in the target text while matching the pattern
> +// for instance, and A in the pattern can match A followed by
> +// any number of combiners in the text
> +// additionally some other ignorable characters can be coded here
> +// for instance kashida

Please use capitals and periods to punctuate long comments as proper sentences; this makes them much more readable.

@@ +867,5 @@
> +static bool IsCombiner(const char16_t c) {
> +
> +	// ignore Kashida in Arabic
> +	if (0x0640 == c)
> +		return true;

There are a whole bunch of Unicode characters that would probably make sense to ignore. We should try checking for the Unicode property "Default_Ignorable" (I expect ICU exposes an API for this) as a first step, before we consider ad hoc additions like this.

(Maybe better to name this IsIgnorableForFind(), to be more accurate, as it will check for various things besides actual combiners.)

@@ +884,5 @@
> +  }
> +}
> +
> +/* make an ICU collator */
> +using namespace icu;

As there aren't a huge number of ICU types/functions being used, I think I'd prefer to explicitly prefix them with icu:: rather than having the 'using namespace' here.

@@ +886,5 @@
> +
> +/* make an ICU collator */
> +using namespace icu;
> +
> +static RuleBasedCollator* static_collator=NULL;

Please move this inside the getCollator function.
(In reply to Jonathan Kew (:jfkthame) from comment #79)
> And thanks for working on this -- I believe it'll make a lot of people
> happier!

I couldn't agree more! I'm stoked and looking forward to the next iteration of the patch!
(In reply to Jonathan Kew (:jfkthame) from comment #79)

Oops, sorry .... I intended to drop in a link to the Coding Style page:

> The other thing that would be helpful would be if you can take a look at the
> mozilla coding style guidelines[1] and try to ensure that your patch follows
> these -- in particular, things like whitespace (no tabs!), bracing of
> control structures, etc. This makes it easier to review patches, and issues
> like this will need to be tidied up before a patch here is landed in the
> tree.

[1] See https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Coding_Style
Flags: needinfo?(jfkthame)

Comment 82

2 years ago
Created attachment 8680276 [details] [diff] [review]
bug-2022251-icu-collator-fix.patch has a possible fix based on icu collators, includes test, handles multiple combiners in pattern

Here is a new version of the patch that include a mochitest and which handles the case of multiple combiners in the pattern.
Attachment #8536092 - Attachment is obsolete: true
Attachment #8536095 - Attachment is obsolete: true
Attachment #8672412 - Attachment is obsolete: true

Updated

2 years ago
Flags: needinfo?(jfkthame)

Comment 83

2 years ago
Does this new version of the patch fail any use cases?
Sorry, I haven't had time to build and test this yet; I'll try to get to it soon. Leaving the needinfo? flag so that it'll remind me....

One question: will it find things like "ae" in the target text if the search string contains "æ", or is that only supported the other way round (find "æ" when searching for "ae")?

Comment 85

2 years ago
If the search string contains "æ" it will not find "ae" in the target.

Comment 86

2 years ago
Created attachment 8683344 [details] [diff] [review]
New patch with formatting and code review fixes and rebased to latest.
Attachment #8680276 - Attachment is obsolete: true
Comment on attachment 8683344 [details] [diff] [review]
New patch with formatting and code review fixes and rebased to latest.

Review of attachment 8683344 [details] [diff] [review]:
-----------------------------------------------------------------

::: embedding/components/find/nsFind.cpp
@@ +41,5 @@
> +#include "unicode/uloc.h"
> +#include "unicode/ucol.h"
> +#include "unicode/locid.h"
> +#include "unicode/tblcoll.h"
> +#include "unicode/sortkey.h"

I don't think it's necessary to #include all these headers. AFAICT, just "unicode/tblcoll.h" should be sufficient; this is where RuleBasedCollator is declared, and it'll pull in whatever else is needed for that declaration to compile.

@@ +862,5 @@
> +// Ignore them in the target text while matching the pattern.
> +// For instance, an A in the pattern can match A followed by
> +// any number of combiners in the text. Additionally some
> +// other ignorable characters can be coded here, including
> +// kashida.

Thinking about this some more, I'm having doubts as to whether it's a valid approach here.... the trouble is that not everything that has a "combining mark" character type is equally valid to ignore. In particular, this will result in ignoring all vowel marks in all the Indic scripts, so that Find will be unable to distinguish between (for example) बेटा "son" and बेटी "daughter" in Hindi, or between किताब "book" and कातिब "writer", and many more such cases.

To get this right, I think we really need to let the collator compare grapheme clusters (not just individual characters) from the pattern and the target text; and the ignoring of (some) diacritics then occurs because the collator knows what level of difference they represent.

(See also comment 79, where I tried to suggest this approach; I'm increasingly convinced that's what we need to do in order to get this right.)

@@ +864,5 @@
> +// any number of combiners in the text. Additionally some
> +// other ignorable characters can be coded here, including
> +// kashida.
> +
> +static bool IsIgnorableForFind(const char16_t c) {

Note that this won't be able to detect ignorable chars outside the BMP. There are a bunch of combining marks in plane 1, which will require handling surrogate pairs (although not all of them should actually be ignored); and then there are the variation selectors in plane 14.

(However, see above; I don't think this approach in general is sufficiently flexible to handle the variety of things Unicode throws at us.)

@@ +890,5 @@
> +// and does not depend on the language of the html page
> +//
> +
> +static icu::RuleBasedCollator *
> +getCollator()

Capitalize function name.

@@ +892,5 @@
> +
> +static icu::RuleBasedCollator *
> +getCollator()
> +{
> +  static icu::RuleBasedCollator* static_collator = NULL;

We use |nullptr| rather than |NULL|.

@@ +902,5 @@
> +      static_cast<icu::RuleBasedCollator*>
> +        (icu::Collator::createInstance(icu::Locale::getRoot(), status));
> +
> +    if (U_FAILURE(status))
> +      return NULL;

Braces even for a single-line |if| body. (And again later on.)

Comment 88

2 years ago
I think ICU is currently compiled without the break iterator.  

ICU_CPPFLAGS="$ICU_CPPFLAGS -DUCONFIG_NO_BREAK_ITERATION"

Is it okay to add that break iterator for this feature?
(In reply to JonathanW from comment #88)
> I think ICU is currently compiled without the break iterator.  
> 
> ICU_CPPFLAGS="$ICU_CPPFLAGS -DUCONFIG_NO_BREAK_ITERATION"
> 
> Is it okay to add that break iterator for this feature?

I expect so. (How much size does it add to the compiled size?)

An alternative would be to use IsClusterExtender (from nsUnicodeProperties.h), perhaps encapsulated in a simple iterator object, to manage iteration-by-cluster. If the ICU break-iterator pulls in a lot of additional bulk (maybe because it supports iterating by various break types?), that might be preferable.

I'd say try whichever is easiest to code, for a start; if that's ICU break-iterator, but we find that it does add a lot of bulk, we could look into replacing it with a smaller, single-purpose version once the overall functionality is proven to work.
Flags: needinfo?(jfkthame)

Comment 91

2 years ago
Created attachment 8704321 [details]
possible fix using icu char break iterator to do cluster by cluster comparisons

Here is some code that handles clusters.
The test code includes the Hindi Son and Daughter case.

In addition to modifications included in the patch, it is also necessary to change the configure script to include ICU Break iterators by commenting out this line:

30965:        #ICU_CPPFLAGS="$ICU_CPPFLAGS -DUCONFIG_NO_BREAK_ITERATION"

I am not sure why mercurial did note include that file in the patch.

This patch adds a small set of brkitr data files to icu data.  I am not sure that I have determined the minimum set of data files correctly.  I excluded the large dictionary files.  

While this solution shows some promise, it is not perfect, and not totally symmetrical.

Another direction to consider would be to use the icu search service.  It seems like that would be possible to code, too.  However, firefox currently supports two find features that I do not see an easy way to implement with the search service.  First, the "whitespace" feature which allows 1 or more whitespace in the pattern to match a greater number of whitespace chars in the page.  Second, the ignore LF betweeen CJ characters features.  I would be happy to try to code an alternate solution with the searchservice if we are willing to consider removing those two features, or if someone can suggest a way to implement them using collation rules, or other methods.
Attachment #8683344 - Attachment is obsolete: true

Updated

2 years ago
Flags: needinfo?(jfkthame)
Thanks for continuing to work on this!

(In reply to JonathanW from comment #91)
> In addition to modifications included in the patch, it is also necessary to
> change the configure script to include ICU Break iterators by commenting out
> this line:
> 
> 30965:        #ICU_CPPFLAGS="$ICU_CPPFLAGS -DUCONFIG_NO_BREAK_ITERATION"
> 
> I am not sure why mercurial did note include that file in the patch.

This bit I can answer. :) It's because 'configure' is a generated file (it's created from configure.in and auxiliary stuff). So rather than editing it here, you should modify build/autoconf/icu.m4; you'll find the code to "remove chunks of the library that we don't need (yet)" at about line 140 of that file.

(I haven't looked at the actual patch yet; leaving the needinfo? flag to remind myself.)
Jonathan, in comment 92 you mentioned you wanted to take a look at the patch? It'd be nice if we could move this one forward bit by bit!
Duplicate of this bug: 922010
Duplicate of this bug: 1147464
Duplicate of this bug: 1226963
Duplicate of this bug: 374795
Whiteboard: [parity-chrome]

Comment 98

a year ago
Created attachment 8756035 [details] [diff] [review]
update possible fix to accomodate changes to icu build system integration

This is the same nsFind.cpp code as the previous patch. This version includes changes to use the new icu build system.

The sizes added in a Windows build using this patch are 43520 bts to xul.dll, and 535424 bytes to icudt56l.dat.

It appears to work on Ubuntu and windows.
Attachment #8704321 - Attachment is obsolete: true

Comment 99

a year ago
Hi Jonathan Kew, could you please try out this patch and let me know your thoughts.  Thanks, Jonathan W
Jonathan, friendly n-i ping ;-)
I've been experimenting with this a bit... it seems to work nicely for "loose" matching (i.e. ignoring diacritics) in Latin script.

I'm less comfortable with the behavior that results for Indic, however; we may need to re-think this and find a slightly different approach. The user experience that results from the cluster-by-cluster comparison (which was what I suggested we should do) seems less satisfactory than I'd like....

Example: load a page of Hindi text, such as http://www.unicode.org/standard/translations/hindi.html, and then bring up the Find bar. Suppose we want to find "यूनिकोड" (Unicode). So we start typing the word, one character at a time, into the Find field; but see what happens as it grows:

    य    -> highlights a "य" at the end of a word in the middle of the first paragraph
    यू    -> highlights "यू" in the middle of the word "कम्प्यूटर", second paragraph
    यून   -> Phrase not found
    यूनि   -> highlights the beginning of "यूनिकोड" in the first sub-heading
    यूनिक  -> Phrase not found
    यूनिको  -> highlights more of "यूनिकोड" in the first sub-heading
    यूनिकोड -> highlights more of "यूनिकोड" in the first sub-heading

It's quite disconcerting that at certain points while typing the word, we get "Phrase not found", yet adding another character causes it to be found again; but at no point does it actually highlight text in the initial "यूनिकोड" in the title, or the occurrence within the first paragraph, although as a user I would have expected it to search from the beginning of the page. (This is because the first character of the search string, "य", fails to match the initial character of "यूनिकोड", but skips forward to find an occurrence of "य" with no following vowel.)

Still thinking over this; it's not clear to me quite what the best approach is going to be, but I'm toying with some ideas... Maybe something as simple as allowing a partial-cluster match at the end of the search string would help, so that "य" can successfully find the "य" at the beginning of "यूनिकोड".
Flags: needinfo?(jfkthame)

Comment 102

10 months ago
Hi Jonathan Kew,  Thanks for the feedback.  I will look into that undesired behavior.

Updated

10 months ago
Blocks: 1297691

Comment 103

10 months ago
Hi Jonthan Kew,

I did some experiments with icu collators and the Hindi characters in the example यूनिकोड.In addition to using a collator based on the Root locale with strength primary, I also tried the Locale("hi", "IN").   In both cases, the YA and YA + vowel sign UU compare as different.  See output below.  

ICU String Search does find YA as an initial match in YA + UU.

I think we could tweak the proposed fix to look for binary matches in the case where the collator returns not equal.  I plan to code that and resubmit.

Another option would be to refactor the code to use icu StringSearch service as in comment 91.


Using Root locale collator

Cluster #य# len=1 YA
Coll key:[-128, 0, 9, 48, 0, 0, 0, 0]
Cluster #यू# len=2 YA + vowel UU
Coll key:[-128, 0, 9, 48, -128, 0, 9, 67, 0, 0, 0, 0]
compare य यू => -1


Using Hindi locale collator
Cluster #य# len=1  YA
Coll key:[0, -92, 0, 0, 0, 0]
Cluster #यू# len=2 YA + vowel UU
Coll key:[0, -92, 0, -76, 0, 0, 0, 0]
compare य यू => -1

Thanks, Jonathan Walden
(In reply to JonathanW from comment #103)
> Hi Jonthan Kew,
> 
> I did some experiments with icu collators and the Hindi characters in the
> example यूनिकोड.In addition to using a collator based on the Root locale
> with strength primary, I also tried the Locale("hi", "IN").   In both cases,
> the YA and YA + vowel sign UU compare as different.  See output below.  

Right, considered as clusters they definitely differ. The trouble is that while per-cluster matching is the appropriate way to compare whole strings, it is also reasonable for a user to search for a single letter such as YA and expect to find it as part of the YA+UU cluster, not _only_ when it occurs _without_ a following vowel sign.

I think the only situation where this becomes an issue is at the end of the search string. _Within_ the search string, we have clear boundaries on both sides of the current cluster, but at the end, we don't really know if the trailing edge of the search text must be a cluster boundary or may fall in the middle of a cluster.

> 
> ICU String Search does find YA as an initial match in YA + UU.

OK, good to know it seems to agree this is an expected result.

> 
> I think we could tweak the proposed fix to look for binary matches in the
> case where the collator returns not equal.  I plan to code that and resubmit.

Rather than looking for binary matches, I'd suggest trying something along the lines of

    if (clusters don't match) and
       (we are at the last cluster of the search-pattern) {
        while (search cluster length > one codepoint) {
            trim last codepoint from the search cluster (don't forget to check for surrogate pairs!)
            repeat comparison with the text cluster, if it matches we have a hit
        }
    }


> Another option would be to refactor the code to use icu StringSearch service
> as in comment 91.

Yes, that might also be an interesting possibility, though I suspect it may be harder to interface this to the DOM text we're trying to search.

Comment 105

9 months ago
Created attachment 8795052 [details] [diff] [review]
update possible fix to handle Hindi

Here is another version of the test patch.  This patch addresses the concern with Hindi reported by J.Kew in comment 101.  It also supports the find in pages features of whole word matching and strict quote matching.

There are some internationalization issues with the whole word feature which I have not tried to fix in this txn.

Please try it out, and let me know your thoughts.
Attachment #8756035 - Attachment is obsolete: true

Updated

9 months ago
Flags: needinfo?(jfkthame)
You need to log in before you can comment on or make changes to this bug.