Closed Bug 202251 Opened 21 years ago Closed 4 years ago

Find/FindAsYouType will not find text if entered with diacritics ("nikud") in Hebrew, or accented characters in other languages

Categories

(Core :: Find Backend, enhancement)

x86
All
enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla73
Tracking Status
firefox73 --- fixed

People

(Reporter: tomer, Assigned: alexhenrie24)

References

(Blocks 3 open bugs, )

Details

(Keywords: intl, parity-chrome)

Attachments

(2 files, 7 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4a) Gecko/20030401
Build Identifier: 

Following is a link to Passover Haggadah (The jewish story for the passover
holidy), in Hebrew with inline punctuation. If you'll search for passover
("פסח"), you'll get only one result - the one on the title bar, which has no
punctuations at all, but if you'll search for the punctuations version of the
same word ("פֶּסַח"), you'll get the other results. 

Reproducible: Always

Steps to Reproduce:



Expected Results:  
Mozilla shuld ignore punctuation characters in search queries, to make it
possible to omit them in the search.
The original post I sent was in ISO-8859-8-I encoding, which do not support
punctuations as the windows-1255 encoding has. Here is the same words as in the
previous post, hopefully more readable as than.

Passover - "פסח"
Passover with punctuations - "פֶּסַח"
ccing akkana; the problem is that we basically do an opaque match, right?
Right.  We just compare characters in the dom against characters in the search
string, one after another, with no added smarts except for whitespace.  Should
this go to an I18n person who's more familiar with how inline punctuation works?
 Should the find backend be ignoring certain characters in certain encodings?
I don't think we want to automatically ignore all diacritical characters (for
example I might want to search for a specific form with diacritics and not find
the same base characters without diacritics or with different diacritics), but
it would often be useful as an option.
i tend to agree with simon. maybe we should add a checkbox somewhere, "ignore
diacritics".
Adding a checkbox also means altering the find interfaces, since neither nsIFind
nor nsIWebBrowserFind offers that option.  nsIWebBrowserFind (the one used for
searching web pages) is frozen, so you can't add anything to it; you'd have to
make a new version of it.  nsIFind is not frozen, so I guess something like that
could be added.

These are public interfaces, so think carefully about how the interface should
be specified so it will be useful for different charsets under different
circumstances, and will cause minimal disruption to anyone already using the
current interfaces.

Cc'ing Kin and the other Simon, who were involved with the design of the current
interfaces and should have a chance at reviewing any proposal to change them.
Confirmed with WinXP/1.4a/20030321.

IMO, Find-As-You-Type should ignore diacritics by default, while the
dialog-box-search should include a Match Diacritic option, as in IE6.

Prog.
Status: UNCONFIRMED → NEW
Ever confirmed: true
*** Bug 247635 has been marked as a duplicate of this bug. ***
Summary: Find/FindAsYouType will not find text if entered with punctuation ("nikud") in Hebrew, but probably other languages as well. → Find/FindAsYouType will not find text if entered with diacritics ("nikud") in Hebrew, or accented characters in other languages
Keywords: intl
I am the one who opened #247635, and of course, I agree that there should be an
option to disable 'opaque match'. My main problem is on french boards, when
searching for a keyword with accents. Many people mispell accents, and for big
pages, it can be very difficult to find what I am searching for (must do
multiple tries).
Would this feature require us to create a data table of
the accent-less versions of every accented unicode char? Or, is does such a
table or library function already exist in open source somewhere?
Component: XP Apps → Keyboard: Find as you Type
It does not (and it should) ignore the French accents.
For example, I typed "telephonie" on a page where "Téléphonie" was written. And
it did not find it.
I have an american Keyboard without those French accents (like 'é').
It should be so:
é = ê = è = ë = e
same thing for a, i ... I guess other languages have the same problem.
*** Bug 312656 has been marked as a duplicate of this bug. ***
Assignee: jag → nobody
Severity: normal → enhancement
QA Contact: pawyskoczka → keyboard.fayt
*** Bug 353647 has been marked as a duplicate of this bug. ***
>Would this feature require us to create a data table of
>the accent-less versions of every accented unicode char?

For Latin alphabets at least, this can be easily generated from
<http://www.unicode.org/Public/UNIDATA/NamesList.txt>.  Map 'LATIN SMALL LETTER Y WITH WHATEVER' to 'LATIN SMALL LETTER Y' and so on.
As for Hebrew, we should ignore characters 0x0591 to 0x05C7. 
I don't like this. In german language for example "Bär" and "Bar" are different words with different meanings and they shouldn't be mixed up by the search engine. On the other hand, somebody with an US keyboard might find it handy to search for "Bar" and find "Bär" because he cannot easily enter an "ä".

I think adding a checkbox will not make much sense, it will probably confuse users.

Any ideas?
Comment 0 and comment 1 got corrupted when bugzilla.m.o was converted to UTF-8. 
Without diacritics: פסח
With diacritics: פֶּסַח
(In reply to comment #17)
> Comment 0 and comment 1 got corrupted when bugzilla.m.o was converted to UTF-8. 

I just corrected the charsets in the database on those two comments.
URL changed since the previous one have gone. (Thanks Yehuda!)
(In reply to comment #16)
> I don't like this. In german language for example "Bär" and "Bar" are
> different words with different meanings and they shouldn't be mixed up by the
> search engine. On the other hand, somebody with an US keyboard might find it
> handy to search for "Bar" and find "Bär" because he cannot easily enter an
> "ä".
> 
> I think adding a checkbox will not make much sense, it will probably confuse
> users.
> 
> Any ideas?
> 

Even in German there are edge cases, such as "Mass" (mass, large amounts) vs. "Maß" (measure), which are homographs in Switzerland but not in Germany (because the Swiss normally don't use the ß); hence the slur "How do the Swiss drink beer?". OTOH in French, accents are rarely printed on uppercase letters, except sometimes the É; I think French words differing only in accentuation would be regarded as homographs anyway.

I see the following possibilities:
- Checkbox "Ignore accents etc." in the search dialog (but this is not FAYT);
- Additional key (in addition to / etc.) to trigger search regardless of accents;
- Always ignore accents while searching, except when the string input by the user includes accents (but this would find German Bär "bear" when looking for Bar "bar");
- New preference [ ] Ignore accents when searching (where in the prefs UI would depend on the application => additional bugs for Fx prefs, Sm prefs and/or Tb prefs -- or only in about:config but IMHO it ought to have a UI);
- Maybe there are other possibilities: Any other ideas?
(In reply to comment #20)
> Even in German there are edge cases, such as "Mass" (mass, large amounts) vs.
> "Maß" (measure), which are homographs in Switzerland 

You mean "Masse" and "Maß" which have the same plural ("Massen" / "Maßen")  in Switzerland. Note that ß is not an accent of the s but a ligature.

But it is a good idea, to threat "ß" and "ss" the same when ignoring accents is activated.

> - Checkbox "Ignore accents etc." in the search dialog (but this is not FAYT);

There is already a very rarely used "ingore case" checkbox. I don't think that this is a good idea.

> - Additional key (in addition to / etc.) to trigger search regardless of
> accents;

I can't say something to this point as for german people like me, the /-Key isn't very easy to enter (Shift+7) so it's in most time easier to just activate FAYT or use Ctrl+F

> - Always ignore accents while searching, except when the string input by the
> user includes accents (but this would find German Bär "bear" when looking for
> Bar "bar");

... and this behaviour would be considered as a bug by most german people I think.

> - New preference [ ] Ignore accents when searching (where in the prefs UI would
> depend on the application => additional bugs for Fx prefs, Sm prefs and/or Tb
> prefs 

I like that idea. This pref could be on by default, but german people who have problems when searching, or consider that as a bug can deactivate it. 

Maybe it's even possible to activate the pref by default only for non-german* firefox versions?
(In reply to comment #21)
> (In reply to comment #20)
> > Even in German there are edge cases, such as "Mass" (mass, large amounts) vs.
> > "Maß" (measure), which are homographs in Switzerland 
> 
> You mean "Masse" and "Maß" which have the same plural ("Massen" / "Maßen") 
> in Switzerland. Note that ß is not an accent of the s but a ligature.
> 
> But it is a good idea, to threat "ß" and "ss" the same when ignoring accents
> is activated.

Yes, I meant treating ß like ss when ignoring accents; and they also have the same dative: "Wie trinken die Schweizer Bier? -- In Masse."
Finding extra matches (eg Bär when the user meant just Bar) is a bug but it is not a serious one.  The user can just skip to the next match.

On the other hand, failing to find a match is a serious deficiency in a search function.  So I think the right default is to search accent-blind when the search string is just ASCII text.  If the user explicitly enters accents, of course, it's different.
Product: Core → SeaMonkey
Component: Find In Page → Find Backend
Product: SeaMonkey → Core
QA Contact: keyboard.fayt → find-backend
See also bug 475306, a similar bug for smart-quote characters and such.
Hi -- I'm the reporter of Bug 475306

Not sure if it's the best solution, but I feel I should throw this out there: How about a "match case and diacritics etc." box or something to that effect?

Also, if a separate checkbox or pref ends up being the best solution, wouldn't it be possible to disable it by default in German builds?
Oddly, I just ran across the following article while reading something completely unrelated to this bug: http://en.wikipedia.org/wiki/Unicode_normalization

These charts of equivalent characters are a few links from there: http://www.unicode.org/charts/normalization/

Small world, huh?  :)
I'm beginning to think bug 276757 and this should be merged and renamed "Unicode strings not normalized in Find [As You Type]" or something to that effect. They're really the same problem, and fixing this one would (or at least should) fix the other.
I agree.

Normalization of quotation marks and combined characters (ä = ä) should happen in any case.

But I think the replacement of diacritics should depend on the language the text is in.

For example in german language:

ä = ae
ö = oe
ü = ue
ë = e (but aë ≠ ä etc.)
ß = ss

while in some other european languages:

ä = a
ö = o
ü = u
etc.

and in dutch for example there is 

ij = ij = ÿ in some areas

I'm not an expert in typography and different languages, but doing this right seems to be a big load of work.
Normalization is orthogonal to what this bug and bug 276757 are about.

As comment 29 says, normalization means that searching for ä (a sequence of U+0061 LATIN SMALL LETTER A and U+0308 COMBINING DIAERESIS) should find ä (U+00E4 LATIN SMALL LETTER A WITH DIARESIS) and vice versa. This bug requests an option to search for a alone, and find ä.

Normalizing to NFC or NFD forms (canonical normalization) will solve the ä = ä issue, but will not help with bug 276757. Even NFKC or NFKD (compatibility normalization) will only solve some of the cases listed in bug 276757 comment 10.
How about treating characters the same as the first character in one of the equivalent strings?

For example, "à" (LATIN SMALL LETTER A WITH GRAVE) is 00E0, and the normalization chart says it's equivalent to 0061 (LATIN SMALL LETTER A) + 0300 (COMBINING GRAVE ACCENT). Shouldn't be too difficult to imply that if 00E0 = 0061 0300, then 00E0 = 0061.
My humble opinion is that the search tool should be as tolerant and sensitive as possible. While false positives (see ex 1) will be rare and unimportant, this would solve all false negatives issues (see ex 2), which are really painful.

Ex 1: search for 'daem'. Either you are german and you probably want this to match equally 'daem' and 'däm'. Or you are not german and the letter 'ä' probably doesn't appear in your page. Or the string 'däm' does appear in your page and does match, you find it odd and just skip it yourself (should not happen that often).

Ex 2: you are french and using someone else's computer, US-keyboard and US-locale. Not being able to type "café", you search "cafe" and expect it to match, whatever the computer locale!

My suggestion emphasizes that the search tool is a quick and ergonomic feature which doesn't need to be strict and exact like an industrial process, but does need to let you locate a word in the page, even a non-english one.
This info is probably more suited to this ticket.


Chrome default matching:
JP: Katakana=Hiragana (casing difference)
DE: ß = ss (not s)
EU: áéíóú = aeiou = áéíóú  
  : æ = ae; ſ = s (historical variants)
  : ㎑ = khz №=no ℡=tel (precomposed letterlike symbols)
  : ⓕⓞⓧ = fox (circled letters)
KR: 신 = 신 (combining jamo vs precomposed hangeul)
  : ㉦ = ᄉ (stylized characters = base form)

  1 = ੧ = ௧ = ₁ = ۱ = ፩ = ... (numerals, cross script)


From the looks of all this, it seems that they are using ICU Collation Data
with the Sort-Key length set to primary differences only, ignoring all
secondary (accent) tertiary (case) and quatrenary differences (styles);
including expansion handling ß = ss.  With the caveat that the full primary
sort key for any expansions must match to display.  That is: "⒓" will match
"12." but "12" cannot find the partial match to "⒓"

Given the primary ICU-EN sort-key for each of the characters, it becomes more
obvious:
1: 159A
2: 159B
.: 028E
⒓: 159A 159B 028E
a: 15A3
ä: 15A3
℀: 15A3 036B 15D1

The differences that should be ignored in the find-as-you type for a permissive match should be based only on the primary sort-key; which would give the best user-experience, and be the most intuative.

http://www.unicode.org/charts/collation/
Blocks: 670601
The same problem with accent (U+0300, see http://en.wikipedia.org/wiki/Acute_accent) (and I believe with other diacritic symbols).

Test case for Russian (fails for FF v11.0): "за́мок" == "замо́к" = "замок". Although "за́мок" and "замо́к" have different meaning, accent should be ignored during the search for "замок" (ambiguous meaning).
still not resolved? This bug is from 2003... This is thing I miss the most from chromium after switching..
With bug 276757 landed, Firefox now matches curly quotes when searching for straight quotes. Perhaps the same mechanism can be used to solve the very similar issues in this bug.

By the way, this bug is [parity-ie], Since Internet Explorer has had a way of ignoring diacritics for ages.
Please, I'd also like to see this resolved! Not only should at least there be an option to ignore accented characters (and even characters that aren't technically accented, such as "c" and "ç"), but also any characters considered "equivalent," just like bug 413927 proposes.

Also, I believe this should apply everywhere in Firefox where you can search text (not only the findbar), such as the history and the awesomebar.
(In reply to Gustavo Silva from comment #45)
> Please, I'd also like to see this resolved! Not only should at least there
> be an option to ignore accented characters (and even characters that aren't
> technically accented, such as "c" and "ç"), but also any characters
> considered "equivalent," just like bug 413927 proposes.
> 
> Also, I believe this should apply everywhere in Firefox where you can search
> text (not only the findbar), such as the history and the awesomebar.

The current bug is not about treaing precomposed ç as c, ø as o, à as a, but about ignoring combining characters such as Hebrew nikkud, Arabic harakaat, or combining accents, cedillas, ogoneks, etc., when a separate "combining" character is used (e.g. U+0301 COMBINING ACUTE ACCENT) together with an unaccented letter. In the latter case it might be nice, but I don't know how feasible, to treat precomposed é (U+00E9 SMALL LATIN LETTER E WITH ACUTE) and decomposed é  (U+0065 SMALL LATIN LETTER E then U+0301 COMBINING ACUTE ACCENT) as equivalent (and similarly for other "canonical" equivalences), and anyway I suppose it would be a different bug than this one.
See Also: → 374795
Blocks: 565552
I imagine this is related to bug 640856 in some way. (Not sure if enough to dupe, though.)
I've got the same problem so I've taken a look at it. One possible way to solve it is to use ICU Transliterator to do the work for us:

Given the example given by Ice Wolf:
DE: ß = ss (not s)
EU: áéíóú = aeiou = áéíóú  
  : æ = ae; ſ = s (historical variants)
  : ㎑ = khz №=no ℡=tel (precomposed letterlike symbols)
  : ⓕⓞⓧ = fox (circled letters)
KR: 신 = 신 (combining jamo vs precomposed hangeul)
  : ㉦ = ᄉ (stylized characters = base form)

  1 = ੧ = ௧ = ₁ = ۱ = ፩ = ... (numerals, cross script)

Using ICU Transform tool (http://www.icu-project.org/icu-bin/translit) with the rule ":: latin; :: nfkd; :: Latin-ASCII;" gives a not so bad result:

DE: ss = ss (not s)
EU: aeiou = aeiou = aeiou  
  : ae = ae; s = s (historical variants)
  : kHz = khz No=no TEL=tel (precomposed letterlike symbols)
  : fox = fox (circled letters)
KR: sin = sin (combining jamo vs precomposed hangeul)
  : s = s (stylized characters = base form)

  1 = 1 = 1 = 1 = 1̱ = ፩ = ... (numerals, cross script)

Some problems with this rule is the fact that Non-latin characters have been converted (due to the rule ::latin), if we disable it we get:

DE: ss = ss (not s)
EU: aeiou = aeiou = aeiou  
  : ae = ae; s = s (historical variants)
  : kHz = khz No=no TEL=tel (precomposed letterlike symbols)
  : fox = fox (circled letters)
KR: 신 = 신 (combining jamo vs precomposed hangeul)
  : ᄉ = ᄉ (stylized characters = base form)

  1 = ੧ = ௧ = 1 = ۱ = ፩ = ... (numerals, cross script)

In this case, numerals are not transformed. I do not have much knowledge in ICU's Transliteration, but if anybody thinks that this can be used to solve this bug I can look into it more deeper.

One problem left if we consider this solution, by default we compile with -DUCONFIG_NO_TRANSLITERATION that disables these features. A check has to be done also that enabling it does not break something else.
Flags: needinfo?(gphemsley)
Flags: needinfo?(smontagu)
Flags: needinfo?(jfkthame)
Flags: needinfo?(gphemsley)
Great idea, Mohamed! It certainly looks as if ICU Transliteration/Transforms can do what we need for this bug and bug 640856 (as comment 46 and earlier comments point out the two bugs have rather different scope).

We need to spec out what it is that we want to do, and what UI we need (if any). As an opening bid, my feeling is that we should always be doing canonical normalization, and omitting accents should be available as an option (An option to do compatibility normalization is probably too esoteric to expose in UI). Retaining needinfo=jfkthame for all of this.
Blocks: 724529
Flags: needinfo?(smontagu)
BTW, I deliberately didn't make this depend on bug 864843 and bug 866301 like other bugs that depend on using ICU: on the one hand this feature could be especially useful on mobile devices where entering accented characters can be even more challenging than with a conventional keyboard, but enabling the feature isn't going to break anything even if isn't available on such devices.
Actually, while I certainly agree that we want a solution here, I don't think the Transliteration/Transforms APIs are the appropriate tool. For one thing, running the content to be searched through one of these APIs would make it difficult to retain a mapping back to the original text -- note that the number of characters present may change considerably; it wouldn't be 1:1. So after finding a "hit" in the transformed content, it would be difficult to identify the correct range in the actual document.

The correct approach, IMO, is what's outlined already in comment 38: what "matches" for search purposes should be based on collation data. Depending whether we use primary keys only, or include secondary/tertiary keys, we could readily support varying levels of "strictness".

http://userguide.icu-project.org/collation

Note that ICU also includes a String Search Service API that is built on top of collation, although I have not studied this API to see how readily we could use it within Gecko.

http://userguide.icu-project.org/collation/icu-string-search-service
Flags: needinfo?(jfkthame)
Jonathan, I agree with you :)

I've written a simple POC using ICU's collation library. I'm not familiar with it, so read a little bit and written what seems a working example http://pastebin.mozilla.org/5311663
The output is http://pastebin.mozilla.org/5311662
To test the code a little bit more, I test now different locales: http://pastebin.mozilla.org/5311881 and the output http://pastebin.mozilla.org/5311882
In addition to the basic collator I took a look at the string search. The results are quite interesting:
Sample code (French and Arabic used): http://pastebin.mozilla.org/5314754 
Output: http://pastebin.mozilla.org/5314757

I don't know how you feel about it, but it seems that with the different features of ICU's string search (next word, backward search, case sensitive and insensitive) we can do something good here.
Flags: needinfo?(smontagu)
Flags: needinfo?(jfkthame)
The real question here isn't whether the collation-based approach (whether using the basic collator interface, or the string-search module) can provide the kind of "loose matching" that would be useful. That's already pretty well understood. The interesting (perhaps tricky?) part here will be figuring out how to efficiently use those tools within Gecko, where the text to be searched isn't simply a C or C++ string, it's data within the DOM, probably spread across an extensive hierarchy of different elements.

Perhaps we could implement mozilla::DOMIterator as a concrete subclass of icu::SearchIterator, and then use that as the tool to implement Find-in-Page behavior. But I haven't looked into exactly what that would involve.
Flags: needinfo?(jfkthame)
(In reply to Jonathan Kew (:jfkthame) from comment #56)
> The interesting (perhaps tricky?) part here will be
> figuring out how to efficiently use those tools within Gecko, where the text
> to be searched isn't simply a C or C++ string, it's data within the DOM,
> probably spread across an extensive hierarchy of different elements.

I have no idea how this would/could work in C++ as I barely have any knowledge of it, but FWIW I was (still am, until this bug is resolved at least) going to implement this search feature in FindBar Tweak by making use of the nodes' .textContent property, stripping it out of head, style and scripts tags text through a custom method I wrote: https://github.com/Quicksaver/Bootstraped-Addon-Backbone/blob/master/resource/modules/utils/HTMLElements.jsm

I haven't done any extensive testing, but from what I could tell when I first wrote that, it's a pretty fast routine, perfectly usable in any normal-to-large sized webpage. Also, I'm guessing this question comes because that API needs a full text string provided or something, and the current fastFind (which if I read the code correctly, goes through the page's nodes on the spot) wouldn't work with something like this?
(In reply to Jonathan Kew (:jfkthame) from comment #56)
> The interesting (perhaps tricky?) part here will be
> figuring out how to efficiently use those tools within Gecko, where the text
> to be searched isn't simply a C or C++ string, it's data within the DOM,
> probably spread across an extensive hierarchy of different elements.

Isn't that a solved problem? I was thinking that we could plug the ICU search tools into the current implementations of nsIWebBrowserFind and nsIFind and not have to reinvent the wheel.
Flags: needinfo?(smontagu)
(In reply to Simon Montagu :smontagu from comment #58)
> Isn't that a solved problem? I was thinking that we could plug the ICU
> search tools into the current implementations of nsIWebBrowserFind and
> nsIFind and not have to reinvent the wheel.

We can do that by `normalize`-ing the character data extracted from text nodes. nsFind is basically a tree walker that extracts matches by pulling one character off the text content and advancing to the next character in the needle upon match.

That's why the ICU search methods/ algo won't work for nsFind; it's not a simple substring-matching-search. However, since we only _really_ need to normalize individual characters in the most common case, I'd say using normalization is a good first step.

If we'd like locale & grammar aware normalization, we'd need to keep track of character positions before and after. Or can we assure that no character positions are changed when we feed it the entire node contents?
(In reply to Simon Montagu :smontagu from comment #58)
> (In reply to Jonathan Kew (:jfkthame) from comment #56)
> Isn't that a solved problem? I was thinking that we could plug the ICU
> search tools into the current implementations of nsIWebBrowserFind and
> nsIFind and not have to reinvent the wheel.

Well, we certainly have existing code that handles this - what I'm unsure about is how easily we'll be able to fit the ICU tools into that. Maybe it'll be straightforward, but I suspect not entirely. Ultimately, nsFind iterates through the document elements by characters, but we can't plug in a collator-based match at that level; it needs to see strings (which may span element boundaries), not individual characters.
Yes, clearly at the lowest level of nsFind we would have to pull whole strings out of nodes to pass to ICU, probably with some degree of look-ahead, rather than iterate by character as it does now.
(In reply to Mike de Boer [:mikedeboer] from comment #59)
> That's why the ICU search methods/ algo won't work for nsFind; it's not a
> simple substring-matching-search. However, since we only _really_ need to
> normalize individual characters in the most common case, I'd say using
> normalization is a good first step.

Normalization is a good step, but note that it can't be done on a codepoint-by-codepoint basis, which is how nsFind currently walks; it may involve composing, decomposing and reordering sequences of codepoints.
Hi I am interested in working on this bug.  I am working on a solution that has two parts:

1) for Hebrew, it ignores combining code points in the web page.  

2) for Latin alphabets, I have some small tables generated by perl that map characters to base letters.
I have included them into nsFind.cpp.  The perl uses normalization algorithms to generate the mappings to the base letter.  Additionally, I map some pairs of characters to map ligatures.
It seems to perform pretty well, and handle quite a few languages. It matches up pretty well with the Chrome behavior.

Does nsFind have an owner?  This is the first time I have tried to contribute to firefox.
This nsFind.cpp should ignore Hebrew and Arabic vowel sign combiners, and latin alphabet accents, and should allow you to search Vietnamese without typing accents.  It supports some ligatures.
Additionally it seems desireable to ignore Kashida.  Chrome has that behavior.
Flags: needinfo?(jst)
Flags: needinfo?(mdeboer)
Hi Jonathan! Thanks for your interest in working on this bug! Also, apologies for the late reaction; I was on holiday for the past two weeks.

It would help _a lot_ if you'd take a bit of time to read the following - quite short - guide:
https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/How_to_Submit_a_Patch

As for the approach you're describing in comment 64; I don't know if that approach is the correct one, I simply don't have enough knowledge to give you a reasonable answer. For that, I need to defer your question to jfkthame. I hope he doesn't mind ;-)
Flags: needinfo?(mdeboer)
Flags: needinfo?(jst)
Flags: needinfo?(jfkthame)
I'm not really comfortable with the approach here of adding ad hoc character-mapping tables to nsFind.cpp. We should instead be leveraging the Unicode character property tables already present in intl/unicharutil and/or ICU.

Regarding the overall description in comment 64: it sounds like this is focused on just a couple of scripts. Really, an "ignore diacritics" option should be applied uniformly across all scripts.

The "ignore combining code points" part of this could probably be implemented based on looking up the Unicode general category of the character; and support for stripping accents off precomposed letters could be implemented by applying NFD normalization and then stripping any combining marks from the normalized form.

Given that these basic functions are already present in the codebase (best to use the ICU implementations, as nsIUnicodeNormalizer implements an old Unicode version), we shouldn't be adding lookup tables here. If the performance of looking up decompositions, etc., on the fly and then stripping diacritics turns out to be a problem (I'm not sure if it will), then nsFind could perhaps cache the results for individual characters as they're encountered.
Flags: needinfo?(jfkthame)
If we extend the Latin, Arabic, Hebrew NFD analysis to include all scripts, that would create an equivalence in Cyrillic between U+0418 И and U+0419 Й, which NFD decomposes into U+0418 + U+0300.  I suspect this would not be a desirable equivalence in searching Russian and Ukrainian text.  I tried it in chrome, and chrome did not equate  U+0418 И and U+0419 Й, though it ignores accent marks for latin text.

For that reason, along with my general lack of expertise outside of a few languages, it seems to me to be difficult to apply an NFD based fix without special casing across all languages.
(In reply to JonathanW from comment #70)
> If we extend the Latin, Arabic, Hebrew NFD analysis to include all scripts,
> that would create an equivalence in Cyrillic between U+0418 И and U+0419 Й,
> which NFD decomposes into U+0418 + U+0300.  I suspect this would not be a
> desirable equivalence in searching Russian and Ukrainian text.  I tried it
> in chrome, and chrome did not equate  U+0418 И and U+0419 Й, though it
> ignores accent marks for latin text.

I doubt equating 'И' and 'Й' is any less desirable for searching Russian and Ukrainian than it is to equate a/å, a/ä, o/ö when searching in Swedish; they're considered entirely separate letters of the alphabet.

The fact that Chrome ignores diacritics in Latin but not in Cyrillic is just a limitation of their implementation, I suspect. Let's do better -- and be more consistent!
bug-2022251-icu-collator-fix.patch attached implements asymmetric matching.

It uses icu properties to find "ignorable" combiners in the target text.

Try it on some of the suggested behavior in this bug, and see what you think.
Flags: needinfo?(jfkthame)
This sounds promising, thanks! I haven't had time to test or go through it carefully yet but will try to do so soon; leaving the needinfo? flag for now to keep it on my radar.

One note: from a brief look, I think the behavior will be a bit inconsistent in that accents in the *search* string will be ignored if they're part of precomposed letters (so searching for "naïve" will find "naive"), but NOT if they're encoded using combining marks; whereas in the *target* string the combining mark versions would also be ignored. That seems unfortunate... either we should consistently ignore all accents, or we should respect accents in the search string regardless whether they're precomposed or combining-char.
Are there some unit tests that cover the find in page functionality?
Flags: needinfo?(mdeboer)
(In reply to Jonathan Kew (:jfkthame) from comment #73)
> This sounds promising, thanks! I haven't had time to test or go through it
> carefully yet but will try to do so soon; leaving the needinfo? flag for now
> to keep it on my radar.
> 
> One note: from a brief look, I think the behavior will be a bit inconsistent
> in that accents in the *search* string will be ignored if they're part of
> precomposed letters (so searching for "naïve" will find "naive"), but NOT if
> they're encoded using combining marks; whereas in the *target* string the
> combining mark versions would also be ignored. That seems unfortunate...
> either we should consistently ignore all accents, or we should respect
> accents in the search string regardless whether they're precomposed or
> combining-char.

I tried the "naive" variations, and they all seem to find each other as expected.
JonathanW: not unit tests, but there's an old page with a collection of manual regression tests for find-in-page covering some of the regressions that used to slip in when the find code was updated:
http://www-archive.mozilla.org/quality/browser/front-end/testcases/xpapps-gui/find-on-page/find-in-page.html
It was written before Mozilla had any automated testing framework (in fact, it probably even predates find-as-you-type); perhaps it would be possible to automate it now. It does not include any diacritics, and should.
Comment on attachment 8672412 [details] [diff] [review]
bug-2022251-icu-collator-fix.patch has a possible fix based on icu collators

Review of attachment 8672412 [details] [diff] [review]:
-----------------------------------------------------------------

This shows a lot of promise -- I tried it in a local build, and it makes a lot of common cases behave as expected, with Find happily ignoring accents.

However, there are still cases that don't work as well, e.g. once multiple accents get involved. With the testcase

  data:text/html;charset=utf-8,x&%23x1d8;x x&%23xfc;&%23x301;x xu&%23x308;&%23x301;x

(which displays "xǘx xǘx xǘx"), searching for "xux" will indeed find all three instances; but searching for xǘx may or may not work depending whether the search string is entered in precomposed or decomposed form.

I think -- without having attempted to code it -- that the right way forward here is indeed to use a collator, but rather than walking through the search and target strings one character at a time, we should walk one *cluster* at a time, where a cluster includes any trailing combining marks. We have an existing ClusterIterator class somewhere, but because nsFind is iterating over the DOM content rather than over simple text strings, I don't think we'll be able to use that; it'll need to be reimplemented on top of the existing nsFindContentIterator, I guess. Or maybe nsFindContentIterator itself should be modified to work in terms of clusters.

Then we'll be calling coll->compare() on a pair of clusters rather than individual codepoints, and that will take care of the accents as well as all canonical-equivalence issues within the cluster.

This will also allow us (in strict-matching mode, i.e. when the user doesn't want to ignore accents -- which I think we need to offer, though what UI if any to expose is still a question) to avoid the problem of finding "u" when the document contains "u&#x308;&#x301;".

The other thing that would be helpful would be if you can take a look at the mozilla coding style guidelines[1] and try to ensure that your patch follows these -- in particular, things like whitespace (no tabs!), bracing of control structures, etc. This makes it easier to review patches, and issues like this will need to be tidied up before a patch here is landed in the tree.

And thanks for working on this -- I believe it'll make a lot of people happier!

::: embedding/components/find/nsFind.cpp
@@ +862,5 @@
> +// ignore them in the target text while matching the pattern
> +// for instance, and A in the pattern can match A followed by
> +// any number of combiners in the text
> +// additionally some other ignorable characters can be coded here
> +// for instance kashida

Please use capitals and periods to punctuate long comments as proper sentences; this makes them much more readable.

@@ +867,5 @@
> +static bool IsCombiner(const char16_t c) {
> +
> +	// ignore Kashida in Arabic
> +	if (0x0640 == c)
> +		return true;

There are a whole bunch of Unicode characters that would probably make sense to ignore. We should try checking for the Unicode property "Default_Ignorable" (I expect ICU exposes an API for this) as a first step, before we consider ad hoc additions like this.

(Maybe better to name this IsIgnorableForFind(), to be more accurate, as it will check for various things besides actual combiners.)

@@ +884,5 @@
> +  }
> +}
> +
> +/* make an ICU collator */
> +using namespace icu;

As there aren't a huge number of ICU types/functions being used, I think I'd prefer to explicitly prefix them with icu:: rather than having the 'using namespace' here.

@@ +886,5 @@
> +
> +/* make an ICU collator */
> +using namespace icu;
> +
> +static RuleBasedCollator* static_collator=NULL;

Please move this inside the getCollator function.
(In reply to Jonathan Kew (:jfkthame) from comment #79)
> And thanks for working on this -- I believe it'll make a lot of people
> happier!

I couldn't agree more! I'm stoked and looking forward to the next iteration of the patch!
(In reply to Jonathan Kew (:jfkthame) from comment #79)

Oops, sorry .... I intended to drop in a link to the Coding Style page:

> The other thing that would be helpful would be if you can take a look at the
> mozilla coding style guidelines[1] and try to ensure that your patch follows
> these -- in particular, things like whitespace (no tabs!), bracing of
> control structures, etc. This makes it easier to review patches, and issues
> like this will need to be tidied up before a patch here is landed in the
> tree.

[1] See https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Coding_Style
Flags: needinfo?(jfkthame)
Here is a new version of the patch that include a mochitest and which handles the case of multiple combiners in the pattern.
Attachment #8536092 - Attachment is obsolete: true
Attachment #8536095 - Attachment is obsolete: true
Attachment #8672412 - Attachment is obsolete: true
Flags: needinfo?(jfkthame)
Does this new version of the patch fail any use cases?
Sorry, I haven't had time to build and test this yet; I'll try to get to it soon. Leaving the needinfo? flag so that it'll remind me....

One question: will it find things like "ae" in the target text if the search string contains "æ", or is that only supported the other way round (find "æ" when searching for "ae")?
If the search string contains "æ" it will not find "ae" in the target.
Comment on attachment 8683344 [details] [diff] [review]
New patch with formatting and code review fixes and rebased to latest.

Review of attachment 8683344 [details] [diff] [review]:
-----------------------------------------------------------------

::: embedding/components/find/nsFind.cpp
@@ +41,5 @@
> +#include "unicode/uloc.h"
> +#include "unicode/ucol.h"
> +#include "unicode/locid.h"
> +#include "unicode/tblcoll.h"
> +#include "unicode/sortkey.h"

I don't think it's necessary to #include all these headers. AFAICT, just "unicode/tblcoll.h" should be sufficient; this is where RuleBasedCollator is declared, and it'll pull in whatever else is needed for that declaration to compile.

@@ +862,5 @@
> +// Ignore them in the target text while matching the pattern.
> +// For instance, an A in the pattern can match A followed by
> +// any number of combiners in the text. Additionally some
> +// other ignorable characters can be coded here, including
> +// kashida.

Thinking about this some more, I'm having doubts as to whether it's a valid approach here.... the trouble is that not everything that has a "combining mark" character type is equally valid to ignore. In particular, this will result in ignoring all vowel marks in all the Indic scripts, so that Find will be unable to distinguish between (for example) बेटा "son" and बेटी "daughter" in Hindi, or between किताब "book" and कातिब "writer", and many more such cases.

To get this right, I think we really need to let the collator compare grapheme clusters (not just individual characters) from the pattern and the target text; and the ignoring of (some) diacritics then occurs because the collator knows what level of difference they represent.

(See also comment 79, where I tried to suggest this approach; I'm increasingly convinced that's what we need to do in order to get this right.)

@@ +864,5 @@
> +// any number of combiners in the text. Additionally some
> +// other ignorable characters can be coded here, including
> +// kashida.
> +
> +static bool IsIgnorableForFind(const char16_t c) {

Note that this won't be able to detect ignorable chars outside the BMP. There are a bunch of combining marks in plane 1, which will require handling surrogate pairs (although not all of them should actually be ignored); and then there are the variation selectors in plane 14.

(However, see above; I don't think this approach in general is sufficiently flexible to handle the variety of things Unicode throws at us.)

@@ +890,5 @@
> +// and does not depend on the language of the html page
> +//
> +
> +static icu::RuleBasedCollator *
> +getCollator()

Capitalize function name.

@@ +892,5 @@
> +
> +static icu::RuleBasedCollator *
> +getCollator()
> +{
> +  static icu::RuleBasedCollator* static_collator = NULL;

We use |nullptr| rather than |NULL|.

@@ +902,5 @@
> +      static_cast<icu::RuleBasedCollator*>
> +        (icu::Collator::createInstance(icu::Locale::getRoot(), status));
> +
> +    if (U_FAILURE(status))
> +      return NULL;

Braces even for a single-line |if| body. (And again later on.)
I think ICU is currently compiled without the break iterator.  

ICU_CPPFLAGS="$ICU_CPPFLAGS -DUCONFIG_NO_BREAK_ITERATION"

Is it okay to add that break iterator for this feature?
(In reply to JonathanW from comment #88)
> I think ICU is currently compiled without the break iterator.  
> 
> ICU_CPPFLAGS="$ICU_CPPFLAGS -DUCONFIG_NO_BREAK_ITERATION"
> 
> Is it okay to add that break iterator for this feature?

I expect so. (How much size does it add to the compiled size?)

An alternative would be to use IsClusterExtender (from nsUnicodeProperties.h), perhaps encapsulated in a simple iterator object, to manage iteration-by-cluster. If the ICU break-iterator pulls in a lot of additional bulk (maybe because it supports iterating by various break types?), that might be preferable.

I'd say try whichever is easiest to code, for a start; if that's ICU break-iterator, but we find that it does add a lot of bulk, we could look into replacing it with a smaller, single-purpose version once the overall functionality is proven to work.
Flags: needinfo?(jfkthame)
Here is some code that handles clusters.
The test code includes the Hindi Son and Daughter case.

In addition to modifications included in the patch, it is also necessary to change the configure script to include ICU Break iterators by commenting out this line:

30965:        #ICU_CPPFLAGS="$ICU_CPPFLAGS -DUCONFIG_NO_BREAK_ITERATION"

I am not sure why mercurial did note include that file in the patch.

This patch adds a small set of brkitr data files to icu data.  I am not sure that I have determined the minimum set of data files correctly.  I excluded the large dictionary files.  

While this solution shows some promise, it is not perfect, and not totally symmetrical.

Another direction to consider would be to use the icu search service.  It seems like that would be possible to code, too.  However, firefox currently supports two find features that I do not see an easy way to implement with the search service.  First, the "whitespace" feature which allows 1 or more whitespace in the pattern to match a greater number of whitespace chars in the page.  Second, the ignore LF betweeen CJ characters features.  I would be happy to try to code an alternate solution with the searchservice if we are willing to consider removing those two features, or if someone can suggest a way to implement them using collation rules, or other methods.
Attachment #8683344 - Attachment is obsolete: true
Flags: needinfo?(jfkthame)
Thanks for continuing to work on this!

(In reply to JonathanW from comment #91)
> In addition to modifications included in the patch, it is also necessary to
> change the configure script to include ICU Break iterators by commenting out
> this line:
> 
> 30965:        #ICU_CPPFLAGS="$ICU_CPPFLAGS -DUCONFIG_NO_BREAK_ITERATION"
> 
> I am not sure why mercurial did note include that file in the patch.

This bit I can answer. :) It's because 'configure' is a generated file (it's created from configure.in and auxiliary stuff). So rather than editing it here, you should modify build/autoconf/icu.m4; you'll find the code to "remove chunks of the library that we don't need (yet)" at about line 140 of that file.

(I haven't looked at the actual patch yet; leaving the needinfo? flag to remind myself.)
Jonathan, in comment 92 you mentioned you wanted to take a look at the patch? It'd be nice if we could move this one forward bit by bit!
Whiteboard: [parity-chrome]
This is the same nsFind.cpp code as the previous patch. This version includes changes to use the new icu build system.

The sizes added in a Windows build using this patch are 43520 bts to xul.dll, and 535424 bytes to icudt56l.dat.

It appears to work on Ubuntu and windows.
Attachment #8704321 - Attachment is obsolete: true
Hi Jonathan Kew, could you please try out this patch and let me know your thoughts.  Thanks, Jonathan W
Jonathan, friendly n-i ping ;-)
I've been experimenting with this a bit... it seems to work nicely for "loose" matching (i.e. ignoring diacritics) in Latin script.

I'm less comfortable with the behavior that results for Indic, however; we may need to re-think this and find a slightly different approach. The user experience that results from the cluster-by-cluster comparison (which was what I suggested we should do) seems less satisfactory than I'd like....

Example: load a page of Hindi text, such as http://www.unicode.org/standard/translations/hindi.html, and then bring up the Find bar. Suppose we want to find "यूनिकोड" (Unicode). So we start typing the word, one character at a time, into the Find field; but see what happens as it grows:

    य    -> highlights a "य" at the end of a word in the middle of the first paragraph
    यू    -> highlights "यू" in the middle of the word "कम्प्यूटर", second paragraph
    यून   -> Phrase not found
    यूनि   -> highlights the beginning of "यूनिकोड" in the first sub-heading
    यूनिक  -> Phrase not found
    यूनिको  -> highlights more of "यूनिकोड" in the first sub-heading
    यूनिकोड -> highlights more of "यूनिकोड" in the first sub-heading

It's quite disconcerting that at certain points while typing the word, we get "Phrase not found", yet adding another character causes it to be found again; but at no point does it actually highlight text in the initial "यूनिकोड" in the title, or the occurrence within the first paragraph, although as a user I would have expected it to search from the beginning of the page. (This is because the first character of the search string, "य", fails to match the initial character of "यूनिकोड", but skips forward to find an occurrence of "य" with no following vowel.)

Still thinking over this; it's not clear to me quite what the best approach is going to be, but I'm toying with some ideas... Maybe something as simple as allowing a partial-cluster match at the end of the search string would help, so that "य" can successfully find the "य" at the beginning of "यूनिकोड".
Flags: needinfo?(jfkthame)
Hi Jonathan Kew,  Thanks for the feedback.  I will look into that undesired behavior.
Blocks: 1297691
Hi Jonthan Kew,

I did some experiments with icu collators and the Hindi characters in the example यूनिकोड.In addition to using a collator based on the Root locale with strength primary, I also tried the Locale("hi", "IN").   In both cases, the YA and YA + vowel sign UU compare as different.  See output below.  

ICU String Search does find YA as an initial match in YA + UU.

I think we could tweak the proposed fix to look for binary matches in the case where the collator returns not equal.  I plan to code that and resubmit.

Another option would be to refactor the code to use icu StringSearch service as in comment 91.


Using Root locale collator

Cluster #य# len=1 YA
Coll key:[-128, 0, 9, 48, 0, 0, 0, 0]
Cluster #यू# len=2 YA + vowel UU
Coll key:[-128, 0, 9, 48, -128, 0, 9, 67, 0, 0, 0, 0]
compare य यू => -1


Using Hindi locale collator
Cluster #य# len=1  YA
Coll key:[0, -92, 0, 0, 0, 0]
Cluster #यू# len=2 YA + vowel UU
Coll key:[0, -92, 0, -76, 0, 0, 0, 0]
compare य यू => -1

Thanks, Jonathan Walden
(In reply to JonathanW from comment #103)
> Hi Jonthan Kew,
> 
> I did some experiments with icu collators and the Hindi characters in the
> example यूनिकोड.In addition to using a collator based on the Root locale
> with strength primary, I also tried the Locale("hi", "IN").   In both cases,
> the YA and YA + vowel sign UU compare as different.  See output below.  

Right, considered as clusters they definitely differ. The trouble is that while per-cluster matching is the appropriate way to compare whole strings, it is also reasonable for a user to search for a single letter such as YA and expect to find it as part of the YA+UU cluster, not _only_ when it occurs _without_ a following vowel sign.

I think the only situation where this becomes an issue is at the end of the search string. _Within_ the search string, we have clear boundaries on both sides of the current cluster, but at the end, we don't really know if the trailing edge of the search text must be a cluster boundary or may fall in the middle of a cluster.

> 
> ICU String Search does find YA as an initial match in YA + UU.

OK, good to know it seems to agree this is an expected result.

> 
> I think we could tweak the proposed fix to look for binary matches in the
> case where the collator returns not equal.  I plan to code that and resubmit.

Rather than looking for binary matches, I'd suggest trying something along the lines of

    if (clusters don't match) and
       (we are at the last cluster of the search-pattern) {
        while (search cluster length > one codepoint) {
            trim last codepoint from the search cluster (don't forget to check for surrogate pairs!)
            repeat comparison with the text cluster, if it matches we have a hit
        }
    }


> Another option would be to refactor the code to use icu StringSearch service
> as in comment 91.

Yes, that might also be an interesting possibility, though I suspect it may be harder to interface this to the DOM text we're trying to search.
Here is another version of the test patch.  This patch addresses the concern with Hindi reported by J.Kew in comment 101.  It also supports the find in pages features of whole word matching and strict quote matching.

There are some internationalization issues with the whole word feature which I have not tried to fix in this txn.

Please try it out, and let me know your thoughts.
Attachment #8756035 - Attachment is obsolete: true
Flags: needinfo?(jfkthame)
Jonathan (Kew) and Jet, I'd really love to see something happen here, since:

1. The Find Backend is a rather important accessibility feature,
2. ICU has been integrated into m-c for quite a while now, so there should be more(?) domain expertise spread out to other devs,
3. This is something we really need to catch up to Webkit/ Blink with.

What can I do to make this bug move forward?
Flags: needinfo?(bugs)
This is a hard one partly because it's difficult to pin down exactly what the "desired" behavior would be, once you start considering a wide variety of languages/writing systems. Rules that seem obviously "correct/useful" from a Western point of view, for example, may give results that are equally obviously "wrong" in some Asian language, etc. One person's "insignificant diacritic that should be ignored when searching" may be another person's "critical distinguishing feature". That's a large part of why we've been through several iterations of ideas, attempts to implement, and then reconsidering when we see how the ideas actually behave in practice. Sorry it's been stalled for so long, though; we really should try to move it forward.

It's not clear how much we can directly benefit from the ICU integration here, because although ICU has a StringSearch service that offers potentially useful behaviors, the task here is not actually to search strings; it's to search the textual content in the DOM. Interfacing text in the DOM to ICU's service would be non-trivial. We might be able to do this by providing an implementation of icu::CharacterIterator that iterates over text in the DOM, instead of over a simple string.

Anyhow, part of what we need here -- besides people to actually hack on the code, where JonathanW was making good progress -- is a way to get testing and feedback from people using a wide variety of languages, and having a good awareness of how different levels of search (like exact vs approximate matching) ought to behave. Not many reviewers/testers have that sort of global perspective, unfortunately; it's about a different kind of experience than just having knowledge of the code.


JonathanW, I'm sorry this has languished so long, after all the time and effort you put into it; if we try to get it moving again, are you still available/interested to continue working on this?

Mike, would you be the right person to discuss what potential UI should look like here? One of the tricky questions is figuring out how many "knobs" can/should be exposed to users. Currently, we have "Match Case" and "Whole Words" toggles in the Find bar; can we have an additional one for "Match Diacritics"? Or should we replace "Match Case" with "Exact Match" and expand its scope, rather than a separate option? Or have a more general "Fuzzy Match"? All sorts of possibilities/questions.... but knowing what we can reasonably expose in UI may also affect what options the code needs to implement.
Flags: needinfo?(mdeboer)
Flags: needinfo?(jonathan_walden)
Flags: needinfo?(jfkthame)
Hi Jonathan Kew,

In internationalization organizations in the past where I have worked, we have identified individuals in the company who can provide specific advice and review about the languages / locales that we support.  That seems like something Mozilla needs, and that potential contributors around the world would provide.

Naturally, Mozilla has more locales to support than those past commercial efforts, which never even got to 20 languages.  Even then, there was plenty of hair splitting over the correct handling of relatively simple things, like French Canadian capitalization versus French France capitalization.

There are a million varieties of feature we could explore here.  For instance, as someone who learned Traditional Chinese characters, and oddly Pin Yin, I like a search that "transliterates" pinyin and Simplified Chinese, along with ZhuYinFuHao.

As someone who has a hard time even picking out Hebrew and Arabic characters, a transliteration search system would be useful for me with those languages.  No browser has that yet.

But in order to actually implement something useful, we have to settle on something.  That was the spirit of my first submission 3 years ago.  To handle the 90% case with Latin diacritics, and Hebrew, since users had asked for those, and to do so in a way that has very limited impact to the code base and the memory and diskspace footprint.

Since then we have gone down the path of more complex trials, involving significant disk space and performance impacts.  Including some of the features in ICU created a lot of bloat.
Flags: needinfo?(jonathan_walden)
(In reply to JonathanW from comment #109)
> But in order to actually implement something useful, we have to settle on something...  
<snip>
> ...to handle the 90% case with Latin diacritics, and Hebrew, since users had asked for those, and to do so in a way that has very limited impact to the code base and the memory and diskspace footprint.

Indeed. At this time, we are settling with what we have today. I know that sounds unsatisfying, but I feel it should be clear that there is currently no work going on to fix this bug.

I can be convinced to take a fix that gets us to this proverbial 90% case if we can devise something clever like we did for bug 196175 (another bug for which a wonderfully complex solution was proposed and ultimately a simpler fix did the trick.)
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(bugs)
Resolution: --- → WONTFIX
I'm re-opening this bug, because I don't want to cancel out our wish to eventually have this supported. I don't have the required skills to implement this, so it'd need time from another knowledgeable contributor! I'm certainly available to mentor.
Status: RESOLVED → REOPENED
Flags: needinfo?(mdeboer)
Resolution: WONTFIX → ---
Keywords: parity-chrome
Whiteboard: [parity-chrome]
Status: REOPENED → NEW
QA Contact: mdeboer
QA Contact: mdeboer

I'm trying to permanently switch from Chrome and I'm hitting this issue almost every time I search for something. Chrome's behavior seems very natural, I'd never realized / appreciated it was doing that until I tried to search for something in Firefox and it didn't find anything.

I think a reasonable approach forward here is to introduce a toggle for it (to be consistent with the existing options (Match Case, Whole Words...)), start with languages where the right behavior is more or less obvious (ě -> e, á -> a etc.), and handle more languages in new tickets as requested by users. You will never be able to handle every language in existence and trying to encapsulate everything in one go is how you get these 17 year old tickets.

Also, I find it funny that there are people in the Chromium's bug tracker who are complaining about the opposite and are calling for the same behavior as in Firefox.

https://bugs.chromium.org/p/chromium/issues/detail?id=71741
https://bugs.chromium.org/p/chromium/issues/detail?id=687552

I just uploaded a relatively simple patch that implements this feature using Unicode decomposition and a cache. It's a 90% solution that will not work as well in some languages as in others, but it keeps searches fast. I turned it on by default and added a new button to the find toolbar to turn it off.

For the same reason as in bug 969980 (because a complete solution would also do things like split digraphs into two characters), I put the diacritic-stripping functions in the "internal API" section.

What I've implemented is exactly what was proposed in comment #69. I realize that there has been a lot of discussion on this topic and short of implementing locale-specific rules, there is no solution that will make everyone happy. However, I believe that what I have is a reasonable solution that will help a lot more people than it annoys.

Thank you Alex!

Just to present my position from phabricator clear - I believe we should minimize our reliance on ICU. We currently use it where needed, but we're also slowly replacing pieces of it with Rust Unicode crates, and in the future I can see us replacing it completely.

If Manish says that plugging this into unicode-normalization is easy, I'd love to see us using that crate (which is already vendored in so we carry the data), but I believe it's more important to get this feature, so if it's much easier to land it as-is, I'd just like to ask to file a follow-up bug to replace the backend to unicode-normalization.

I wouldn't mind using Rust here, it just seems like a big pile of extra work for something that we'll eventually have to redesign anyway to support locale-specific and multi-character decompositions. So yeah, I'd rather leave the transition to Rust for a follow-up bug.

So yeah, I'd rather leave the transition to Rust for a follow-up bug.

Sounds good to me! It would be good then to use the review process here to verify that the way the API is written makes the switch to unicode-normalization is easy.

Since you just landed bug 1590167 which migrated hyphenation to Rust. Do you have any thoughts on using ICU vs. unicode-normalize - both vendored in, for this?

Flags: needinfo?(jfkthame)

It looks like it shouldn't be very hard to use the Rust unicode-normalization crate for this, by implementing a GetNaked function based on https://docs.rs/unicode-normalization/0.1.9/unicode_normalization/char/fn.decompose_canonical.html, but it's a little more effort than using ICU, as it doesn't look like unicode-normalization currently offers an appropriate FFI interface.

So given that we're already using ICU's normalizer2 in Firefox C++ code (it's used in at least js/src/builtin/String.cpp and gfx/thebes/gfxHarfBuzzShaper.cpp), ISTM the simplest way forward is to use it here as well. We could have a followup bug to replace the use of ICU normalization APIs with the Rust crate, which will presumably involve first adding FFI functions on the Rust side.

Flags: needinfo?(jfkthame)
Pushed by cbrindusan@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b89936db7178
Add an option to ignore diacritics when searching. r=fluent-reviewers,mikedeboer,jfkthame,flod
Pushed by apavel@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/828a10a79e71
Add an option to ignore diacritics when searching. r=fluent-reviewers,mikedeboer,jfkthame,flod
Status: NEW → RESOLVED
Closed: 6 years ago4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla73
Assignee: nobody → alexhenrie24
Flags: needinfo?(alexhenrie24)
Regressions: 1603636

Thank you for working on this! Works like a charm.

on firefox 73.0.1, this is not working for me with Hebrew diacritics

e.g. on: https://www.mechon-mamre.org/i/t/t0101.htm
בְּרֵאשִׁית

is not matched when searching for בראשית

(In reply to eyal gruss (eyaler) from comment #130)

on firefox 73.0.1, this is not working for me with Hebrew diacritics

e.g. on: https://www.mechon-mamre.org/i/t/t0101.htm
בְּרֵאשִׁית

is not matched when searching for בראשית

That is bug 1611568 and it will be fixed since Firefox 75.

On Firefox 74.0, this is not working with "ı" letter. ı/I should get matched with i/İ.

"I" is described in Turkic section here https://en.wikipedia.org/wiki/Diacritic#Languages_with_letters_containing_diacritics

Ozkuslar, could you open a new bug report about that and add me to its CC list please?

Depends on: 1624244

(In reply to Tomer Cohen :tomer from comment #0)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4a)
Gecko/20030401
Build Identifier:

Following is a link to Passover Haggadah (The jewish story for the passover
holidy), in Hebrew with inline punctuation. If you'll search for passover
("פסח"), you'll get only one result - the one on the title bar, which has no
punctuations at all, but if you'll search for the punctuations version of the
same word ("פֶּסַח"), you'll get the other results.

Reproducible: Always

Steps to Reproduce:

Expected Results:
Mozilla shuld ignore punctuation characters in search queries, to make it
possible to omit them in the search.

after fixing of https://bugzilla.mozilla.org/show_bug.cgi?id=1611568 in firefox 75 it works
happy passover.

I just noticed that this feature was never documented
https://support.mozilla.org/en-US/kb/search-contents-current-page-text-or-links

Not exactly sure what's the process to get if fixed.

Flags: needinfo?(jsavage)

(In reply to Francesco Lodolo [:flod] from comment #135)

I just noticed that this feature was never documented
https://support.mozilla.org/en-US/kb/search-contents-current-page-text-or-links

Not exactly sure what's the process to get if fixed.

Thanks for flagging me on this. We'll get this documented by the end of the week.

Flags: needinfo?(jsavage) → needinfo?(anlazar)
Blocks: 1640217
Depends on: 1620826
You need to log in before you can comment on or make changes to this bug.