202251 - Find/FindAsYouType will not find text if entered with diacritics ("nikud") in Hebrew, or accented characters in other languages

Reporter

Description

•

22 years ago

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4a) Gecko/20030401 Build Identifier: Following is a link to Passover Haggadah (The jewish story for the passover holidy), in Hebrew with inline punctuation. If you'll search for passover ("פסח"), you'll get only one result - the one on the title bar, which has no punctuations at all, but if you'll search for the punctuations version of the same word ("פֶּסַח"), you'll get the other results. Reproducible: Always Steps to Reproduce: Expected Results: Mozilla shuld ignore punctuation characters in search queries, to make it possible to omit them in the search.

Tomer Cohen :tomer

Reporter

Comment 1

•

22 years ago

The original post I sent was in ISO-8859-8-I encoding, which do not support punctuations as the windows-1255 encoding has. Here is the same words as in the previous post, hopefully more readable as than. Passover - "פסח" Passover with punctuations - "פֶּסַח"

Boris Zbarsky [:bzbarsky]

Comment 2

•

22 years ago

ccing akkana; the problem is that we basically do an opaque match, right?

Akkana Peck

Comment 3

•

22 years ago

Right. We just compare characters in the dom against characters in the search string, one after another, with no added smarts except for whitespace. Should this go to an I18n person who's more familiar with how inline punctuation works? Should the find backend be ignoring certain characters in certain encodings?

Simon Montagu :smontagu

Comment 4

•

22 years ago

I don't think we want to automatically ignore all diacritical characters (for example I might want to search for a specific form with diacritics and not find the same base characters without diacritics or with different diacritics), but it would often be useful as an option.

Tsahi Asher

Comment 5

•

22 years ago

i tend to agree with simon. maybe we should add a checkbox somewhere, "ignore diacritics".

Akkana Peck

Comment 6

•

22 years ago

Adding a checkbox also means altering the find interfaces, since neither nsIFind nor nsIWebBrowserFind offers that option. nsIWebBrowserFind (the one used for searching web pages) is frozen, so you can't add anything to it; you'd have to make a new version of it. nsIFind is not frozen, so I guess something like that could be added. These are public interfaces, so think carefully about how the interface should be specified so it will be useful for different charsets under different circumstances, and will cause minimal disruption to anyone already using the current interfaces. Cc'ing Kin and the other Simon, who were involved with the design of the current interfaces and should have a chance at reviewing any proposal to change them.

Prognathous

Comment 7

•

22 years ago

Confirmed with WinXP/1.4a/20030321. IMO, Find-As-You-Type should ignore diacritics by default, while the dialog-box-search should include a Match Diacritic option, as in IE6. Prog.

Prognathous

Updated

•

22 years ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

Simon Montagu :smontagu

Comment 8

•

21 years ago

*** Bug 247635 has been marked as a duplicate of this bug. ***

Simon Montagu :smontagu

Updated

•

21 years ago

Summary: Find/FindAsYouType will not find text if entered with punctuation ("nikud") in Hebrew, but probably other languages as well. → Find/FindAsYouType will not find text if entered with diacritics ("nikud") in Hebrew, or accented characters in other languages

Steffen Wilberg

Updated

•

21 years ago

Keywords: intl

Francois Ingelrest

Comment 9

•

21 years ago

I am the one who opened #247635, and of course, I agree that there should be an option to disable 'opaque match'. My main problem is on french boards, when searching for a keyword with accents. Many people mispell accents, and for big pages, it can be very difficult to find what I am searching for (must do multiple tries).

Aaron Leventhal

Comment 10

•

21 years ago

Would this feature require us to create a data table of the accent-less versions of every accented unicode char? Or, is does such a table or library function already exist in open source somewhere?

Component: XP Apps → Keyboard: Find as you Type

Nathan

Comment 11

•

20 years ago

It does not (and it should) ignore the French accents. For example, I typed "telephonie" on a page where "Téléphonie" was written. And it did not find it. I have an american Keyboard without those French accents (like 'é'). It should be so: é = ê = è = ë = e same thing for a, i ... I guess other languages have the same problem.

Simon Montagu :smontagu

Comment 12

•

20 years ago

*** Bug 312656 has been marked as a duplicate of this bug. ***

Wayne Mery (:wsmwk)

Updated

•

19 years ago

Assignee: jag → nobody

Severity: normal → enhancement

QA Contact: pawyskoczka → keyboard.fayt

Adam Guthrie

Comment 13

•

19 years ago

*** Bug 353647 has been marked as a duplicate of this bug. ***

Ed Avis

Comment 14

•

19 years ago

>Would this feature require us to create a data table of >the accent-less versions of every accented unicode char? For Latin alphabets at least, this can be easily generated from <http://www.unicode.org/Public/UNIDATA/NamesList.txt>. Map 'LATIN SMALL LETTER Y WITH WHATEVER' to 'LATIN SMALL LETTER Y' and so on.

Tomer Cohen :tomer

Reporter

Comment 15

•

19 years ago

As for Hebrew, we should ignore characters 0x0591 to 0x05C7.

Jonathan Haas

Comment 16

•

18 years ago

I don't like this. In german language for example "Bär" and "Bar" are different words with different meanings and they shouldn't be mixed up by the search engine. On the other hand, somebody with an US keyboard might find it handy to search for "Bar" and find "Bär" because he cannot easily enter an "ä". I think adding a checkbox will not make much sense, it will probably confuse users. Any ideas?

Simon Montagu :smontagu

Comment 17

•

18 years ago

Comment 0 and comment 1 got corrupted when bugzilla.m.o was converted to UTF-8. Without diacritics: פסח With diacritics: פֶּסַח

Dave Miller [:justdave]

Comment 18

•

18 years ago

(In reply to comment #17) > Comment 0 and comment 1 got corrupted when bugzilla.m.o was converted to UTF-8. I just corrected the charsets in the database on those two comments.

Tomer Cohen :tomer

Reporter

Comment 19

•

17 years ago

URL changed since the previous one have gone. (Thanks Yehuda!)

URL: http://halemo.fresh.co.il/info/hagada... → http://he.wikisource.org/wiki/%D7%94%...

Tony Mechelynck [:tonymec]

Comment 20

•

17 years ago

(In reply to comment #16) > I don't like this. In german language for example "Bär" and "Bar" are > different words with different meanings and they shouldn't be mixed up by the > search engine. On the other hand, somebody with an US keyboard might find it > handy to search for "Bar" and find "Bär" because he cannot easily enter an > "ä". > > I think adding a checkbox will not make much sense, it will probably confuse > users. > > Any ideas? > Even in German there are edge cases, such as "Mass" (mass, large amounts) vs. "Maß" (measure), which are homographs in Switzerland but not in Germany (because the Swiss normally don't use the ß); hence the slur "How do the Swiss drink beer?". OTOH in French, accents are rarely printed on uppercase letters, except sometimes the É; I think French words differing only in accentuation would be regarded as homographs anyway. I see the following possibilities: - Checkbox "Ignore accents etc." in the search dialog (but this is not FAYT); - Additional key (in addition to / etc.) to trigger search regardless of accents; - Always ignore accents while searching, except when the string input by the user includes accents (but this would find German Bär "bear" when looking for Bar "bar"); - New preference [ ] Ignore accents when searching (where in the prefs UI would depend on the application => additional bugs for Fx prefs, Sm prefs and/or Tb prefs -- or only in about:config but IMHO it ought to have a UI); - Maybe there are other possibilities: Any other ideas?

Jonathan Haas

Comment 21

•

17 years ago

(In reply to comment #20) > Even in German there are edge cases, such as "Mass" (mass, large amounts) vs. > "Maß" (measure), which are homographs in Switzerland You mean "Masse" and "Maß" which have the same plural ("Massen" / "Maßen") in Switzerland. Note that ß is not an accent of the s but a ligature. But it is a good idea, to threat "ß" and "ss" the same when ignoring accents is activated. > - Checkbox "Ignore accents etc." in the search dialog (but this is not FAYT); There is already a very rarely used "ingore case" checkbox. I don't think that this is a good idea. > - Additional key (in addition to / etc.) to trigger search regardless of > accents; I can't say something to this point as for german people like me, the /-Key isn't very easy to enter (Shift+7) so it's in most time easier to just activate FAYT or use Ctrl+F > - Always ignore accents while searching, except when the string input by the > user includes accents (but this would find German Bär "bear" when looking for > Bar "bar"); ... and this behaviour would be considered as a bug by most german people I think. > - New preference [ ] Ignore accents when searching (where in the prefs UI would > depend on the application => additional bugs for Fx prefs, Sm prefs and/or Tb > prefs I like that idea. This pref could be on by default, but german people who have problems when searching, or consider that as a bug can deactivate it. Maybe it's even possible to activate the pref by default only for non-german* firefox versions?

Tony Mechelynck [:tonymec]

Comment 22

•

17 years ago

(In reply to comment #21) > (In reply to comment #20) > > Even in German there are edge cases, such as "Mass" (mass, large amounts) vs. > > "Maß" (measure), which are homographs in Switzerland > > You mean "Masse" and "Maß" which have the same plural ("Massen" / "Maßen") > in Switzerland. Note that ß is not an accent of the s but a ligature. > > But it is a good idea, to threat "ß" and "ss" the same when ignoring accents > is activated. Yes, I meant treating ß like ss when ignoring accents; and they also have the same dative: "Wie trinken die Schweizer Bier? -- In Masse."

Ed Avis

Comment 23

•

17 years ago

Finding extra matches (eg Bär when the user meant just Bar) is a bug but it is not a serious one. The user can just skip to the next match. On the other hand, failing to find a match is a serious deficiency in a search function. So I think the right default is to search accent-blind when the search string is just ASCII text. If the user explicitly enters accents, of course, it's different.

Nobody; OK to take it and work on it

Updated

•

17 years ago

Product: Core → SeaMonkey

Jesse Ruderman

Updated

•

17 years ago

Component: Find In Page → Find Backend

Product: SeaMonkey → Core

QA Contact: keyboard.fayt → find-backend

Jesse Ruderman

Comment 25

•

17 years ago

See also bug 475306, a similar bug for smart-quote characters and such.

WulfTheSaxon [:Wulf]

Comment 26

•

17 years ago

Hi -- I'm the reporter of Bug 475306 Not sure if it's the best solution, but I feel I should throw this out there: How about a "match case and diacritics etc." box or something to that effect? Also, if a separate checkbox or pref ends up being the best solution, wouldn't it be possible to disable it by default in German builds?

WulfTheSaxon [:Wulf]

Comment 27

•

17 years ago

Oddly, I just ran across the following article while reading something completely unrelated to this bug: http://en.wikipedia.org/wiki/Unicode_normalization These charts of equivalent characters are a few links from there: http://www.unicode.org/charts/normalization/ Small world, huh? :)

WulfTheSaxon [:Wulf]

Comment 28

•

17 years ago

I'm beginning to think bug 276757 and this should be merged and renamed "Unicode strings not normalized in Find [As You Type]" or something to that effect. They're really the same problem, and fixing this one would (or at least should) fix the other.

Jonathan Haas

Comment 29

•

17 years ago

I agree. Normalization of quotation marks and combined characters (ä = ä) should happen in any case. But I think the replacement of diacritics should depend on the language the text is in. For example in german language: ä = ae ö = oe ü = ue ë = e (but aë ≠ ä etc.) ß = ss while in some other european languages: ä = a ö = o ü = u etc. and in dutch for example there is ĳ = ij = ÿ in some areas I'm not an expert in typography and different languages, but doing this right seems to be a big load of work.

Simon Montagu :smontagu

Comment 30

•

17 years ago

Normalization is orthogonal to what this bug and bug 276757 are about. As comment 29 says, normalization means that searching for ä (a sequence of U+0061 LATIN SMALL LETTER A and U+0308 COMBINING DIAERESIS) should find ä (U+00E4 LATIN SMALL LETTER A WITH DIARESIS) and vice versa. This bug requests an option to search for a alone, and find ä. Normalizing to NFC or NFD forms (canonical normalization) will solve the ä = ä issue, but will not help with bug 276757. Even NFKC or NFKD (compatibility normalization) will only solve some of the cases listed in bug 276757 comment 10.

WulfTheSaxon [:Wulf]

Comment 31

•

17 years ago

How about treating characters the same as the first character in one of the equivalent strings? For example, "à" (LATIN SMALL LETTER A WITH GRAVE) is 00E0, and the normalization chart says it's equivalent to 0061 (LATIN SMALL LETTER A) + 0300 (COMBINING GRAVE ACCENT). Shouldn't be too difficult to imply that if 00E0 = 0061 0300, then 00E0 = 0061.

Delplace

Comment 32

•

17 years ago

My humble opinion is that the search tool should be as tolerant and sensitive as possible. While false positives (see ex 1) will be rare and unimportant, this would solve all false negatives issues (see ex 2), which are really painful. Ex 1: search for 'daem'. Either you are german and you probably want this to match equally 'daem' and 'däm'. Or you are not german and the letter 'ä' probably doesn't appear in your page. Or the string 'däm' does appear in your page and does match, you find it odd and just skip it yourself (should not happen that often). Ex 2: you are french and using someone else's computer, US-keyboard and US-locale. Not being able to type "café", you search "cafe" and expect it to match, whatever the computer locale! My suggestion emphasizes that the search tool is a quick and ergonomic feature which doesn't need to be strict and exact like an industrial process, but does need to let you locate a word in the page, even a non-english one.

Ice Wolf

Comment 38

•

14 years ago

This info is probably more suited to this ticket. Chrome default matching: JP: Katakana=Hiragana (casing difference) DE: ß = ss (not s) EU: áéíóú = aeiou = áéíóú : æ = ae; ſ = s (historical variants) : ㎑ = khz №=no ℡=tel (precomposed letterlike symbols) : ⓕⓞⓧ = fox (circled letters) KR: 신 = 신 (combining jamo vs precomposed hangeul) : ㉦ = ᄉ (stylized characters = base form) 1 = ੧ = ௧ = ₁ = ۱ = ፩ = ... (numerals, cross script) From the looks of all this, it seems that they are using ICU Collation Data with the Sort-Key length set to primary differences only, ignoring all secondary (accent) tertiary (case) and quatrenary differences (styles); including expansion handling ß = ss. With the caveat that the full primary sort key for any expansions must match to display. That is: "⒓" will match "12." but "12" cannot find the partial match to "⒓" Given the primary ICU-EN sort-key for each of the characters, it becomes more obvious: 1: 159A 2: 159B .: 028E ⒓: 159A 159B 028E a: 15A3 ä: 15A3 ℀: 15A3 036B 15D1 The differences that should be ignored in the find-as-you type for a permissive match should be based only on the primary sort-key; which would give the best user-experience, and be the most intuative. http://www.unicode.org/charts/collation/

Boris Zbarsky [:bzbarsky]

Updated

•

14 years ago

Blocks: 670601

Dmitry

Comment 40

•

13 years ago

The same problem with accent (U+0300, see http://en.wikipedia.org/wiki/Acute_accent) (and I believe with other diacritic symbols). Test case for Russian (fails for FF v11.0): "за́мок" == "замо́к" = "замок". Although "за́мок" and "замо́к" have different meaning, accent should be ignored during the search for "замок" (ambiguous meaning).

paladin

Comment 43

•

12 years ago

still not resolved? This bug is from 2003... This is thing I miss the most from chromium after switching..

David Regev

Comment 44

•

12 years ago

With bug 276757 landed, Firefox now matches curly quotes when searching for straight quotes. Perhaps the same mechanism can be used to solve the very similar issues in this bug. By the way, this bug is [parity-ie], Since Internet Explorer has had a way of ignoring diacritics for ages.

Gustavo Silva

Comment 45

•

12 years ago

Please, I'd also like to see this resolved! Not only should at least there be an option to ignore accented characters (and even characters that aren't technically accented, such as "c" and "ç"), but also any characters considered "equivalent," just like bug 413927 proposes. Also, I believe this should apply everywhere in Firefox where you can search text (not only the findbar), such as the history and the awesomebar.

Tony Mechelynck [:tonymec]

Comment 46

•

12 years ago

(In reply to Gustavo Silva from comment #45) > Please, I'd also like to see this resolved! Not only should at least there > be an option to ignore accented characters (and even characters that aren't > technically accented, such as "c" and "ç"), but also any characters > considered "equivalent," just like bug 413927 proposes. > > Also, I believe this should apply everywhere in Firefox where you can search > text (not only the findbar), such as the history and the awesomebar. The current bug is not about treaing precomposed ç as c, ø as o, à as a, but about ignoring combining characters such as Hebrew nikkud, Arabic harakaat, or combining accents, cedillas, ogoneks, etc., when a separate "combining" character is used (e.g. U+0301 COMBINING ACUTE ACCENT) together with an unaccented letter. In the latter case it might be nice, but I don't know how feasible, to treat precomposed é (U+00E9 SMALL LATIN LETTER E WITH ACUTE) and decomposed é (U+0065 SMALL LATIN LETTER E then U+0301 COMBINING ACUTE ACCENT) as equivalent (and similarly for other "canonical" equivalences), and anyway I suppose it would be a different bug than this one.

Frank Yan (:fryn)

Updated

•

12 years ago

Updated

•

11 years ago

Blocks: 565552

Gordon P. Hemsley [:GPHemsley]

Comment 48

•

11 years ago

I imagine this is related to bug 640856 in some way. (Not sure if enough to dupe, though.)

Mohamed Zenadi [:zeapo]

Comment 49

•

11 years ago

I've got the same problem so I've taken a look at it. One possible way to solve it is to use ICU Transliterator to do the work for us: Given the example given by Ice Wolf: DE: ß = ss (not s) EU: áéíóú = aeiou = áéíóú : æ = ae; ſ = s (historical variants) : ㎑ = khz №=no ℡=tel (precomposed letterlike symbols) : ⓕⓞⓧ = fox (circled letters) KR: 신 = 신 (combining jamo vs precomposed hangeul) : ㉦ = ᄉ (stylized characters = base form) 1 = ੧ = ௧ = ₁ = ۱ = ፩ = ... (numerals, cross script) Using ICU Transform tool (http://www.icu-project.org/icu-bin/translit) with the rule ":: latin; :: nfkd; :: Latin-ASCII;" gives a not so bad result: DE: ss = ss (not s) EU: aeiou = aeiou = aeiou : ae = ae; s = s (historical variants) : kHz = khz No=no TEL=tel (precomposed letterlike symbols) : fox = fox (circled letters) KR: sin = sin (combining jamo vs precomposed hangeul) : s = s (stylized characters = base form) 1 = 1 = 1 = 1 = 1̱ = ፩ = ... (numerals, cross script) Some problems with this rule is the fact that Non-latin characters have been converted (due to the rule ::latin), if we disable it we get: DE: ss = ss (not s) EU: aeiou = aeiou = aeiou : ae = ae; s = s (historical variants) : kHz = khz No=no TEL=tel (precomposed letterlike symbols) : fox = fox (circled letters) KR: 신 = 신 (combining jamo vs precomposed hangeul) : ᄉ = ᄉ (stylized characters = base form) 1 = ੧ = ௧ = 1 = ۱ = ፩ = ... (numerals, cross script) In this case, numerals are not transformed. I do not have much knowledge in ICU's Transliteration, but if anybody thinks that this can be used to solve this bug I can look into it more deeper. One problem left if we consider this solution, by default we compile with -DUCONFIG_NO_TRANSLITERATION that disables these features. A check has to be done also that enabling it does not break something else.

Flags: needinfo?(gphemsley)

Gordon P. Hemsley [:GPHemsley]

Updated

•

11 years ago

Flags: needinfo?(smontagu)

Flags: needinfo?(jfkthame)

Flags: needinfo?(gphemsley)

Simon Montagu :smontagu

Comment 50

•

11 years ago

Great idea, Mohamed! It certainly looks as if ICU Transliteration/Transforms can do what we need for this bug and bug 640856 (as comment 46 and earlier comments point out the two bugs have rather different scope). We need to spec out what it is that we want to do, and what UI we need (if any). As an opening bid, my feeling is that we should always be doing canonical normalization, and omitting accents should be available as an option (An option to do compatibility normalization is probably too esoteric to expose in UI). Retaining needinfo=jfkthame for all of this.

Blocks: 724529

Flags: needinfo?(smontagu)

Simon Montagu :smontagu

Comment 51

•

11 years ago

BTW, I deliberately didn't make this depend on bug 864843 and bug 866301 like other bugs that depend on using ICU: on the one hand this feature could be especially useful on mobile devices where entering accented characters can be even more challenging than with a conventional keyboard, but enabling the feature isn't going to break anything even if isn't available on such devices.

Jonathan Kew [:jfkthame]

Comment 52

•

11 years ago

Actually, while I certainly agree that we want a solution here, I don't think the Transliteration/Transforms APIs are the appropriate tool. For one thing, running the content to be searched through one of these APIs would make it difficult to retain a mapping back to the original text -- note that the number of characters present may change considerably; it wouldn't be 1:1. So after finding a "hit" in the transformed content, it would be difficult to identify the correct range in the actual document. The correct approach, IMO, is what's outlined already in comment 38: what "matches" for search purposes should be based on collation data. Depending whether we use primary keys only, or include secondary/tertiary keys, we could readily support varying levels of "strictness". http://userguide.icu-project.org/collation Note that ICU also includes a String Search Service API that is built on top of collation, although I have not studied this API to see how readily we could use it within Gecko. http://userguide.icu-project.org/collation/icu-string-search-service

Flags: needinfo?(jfkthame)

Mohamed Zenadi [:zeapo]

Comment 53

•

11 years ago

Jonathan, I agree with you :) I've written a simple POC using ICU's collation library. I'm not familiar with it, so read a little bit and written what seems a working example http://pastebin.mozilla.org/5311663 The output is http://pastebin.mozilla.org/5311662

Mohamed Zenadi [:zeapo]

Comment 54

•

11 years ago

To test the code a little bit more, I test now different locales: http://pastebin.mozilla.org/5311881 and the output http://pastebin.mozilla.org/5311882

Mohamed Zenadi [:zeapo]

Comment 55

•

11 years ago

In addition to the basic collator I took a look at the string search. The results are quite interesting: Sample code (French and Arabic used): http://pastebin.mozilla.org/5314754 Output: http://pastebin.mozilla.org/5314757 I don't know how you feel about it, but it seems that with the different features of ICU's string search (next word, backward search, case sensitive and insensitive) we can do something good here.

Flags: needinfo?(smontagu)

Flags: needinfo?(jfkthame)

Jonathan Kew [:jfkthame]

Comment 56

•

11 years ago

The real question here isn't whether the collation-based approach (whether using the basic collator interface, or the string-search module) can provide the kind of "loose matching" that would be useful. That's already pretty well understood. The interesting (perhaps tricky?) part here will be figuring out how to efficiently use those tools within Gecko, where the text to be searched isn't simply a C or C++ string, it's data within the DOM, probably spread across an extensive hierarchy of different elements. Perhaps we could implement mozilla::DOMIterator as a concrete subclass of icu::SearchIterator, and then use that as the tool to implement Find-in-Page behavior. But I haven't looked into exactly what that would involve.

Flags: needinfo?(jfkthame)

Luís Miguel [:quicksaver] (not reading bugmail)

Comment 57

•

11 years ago

(In reply to Jonathan Kew (:jfkthame) from comment #56) > The interesting (perhaps tricky?) part here will be > figuring out how to efficiently use those tools within Gecko, where the text > to be searched isn't simply a C or C++ string, it's data within the DOM, > probably spread across an extensive hierarchy of different elements. I have no idea how this would/could work in C++ as I barely have any knowledge of it, but FWIW I was (still am, until this bug is resolved at least) going to implement this search feature in FindBar Tweak by making use of the nodes' .textContent property, stripping it out of head, style and scripts tags text through a custom method I wrote: https://github.com/Quicksaver/Bootstraped-Addon-Backbone/blob/master/resource/modules/utils/HTMLElements.jsm I haven't done any extensive testing, but from what I could tell when I first wrote that, it's a pretty fast routine, perfectly usable in any normal-to-large sized webpage. Also, I'm guessing this question comes because that API needs a full text string provided or something, and the current fastFind (which if I read the code correctly, goes through the page's nodes on the spot) wouldn't work with something like this?

Simon Montagu :smontagu

Comment 58

•

11 years ago

(In reply to Jonathan Kew (:jfkthame) from comment #56) > The interesting (perhaps tricky?) part here will be > figuring out how to efficiently use those tools within Gecko, where the text > to be searched isn't simply a C or C++ string, it's data within the DOM, > probably spread across an extensive hierarchy of different elements. Isn't that a solved problem? I was thinking that we could plug the ICU search tools into the current implementations of nsIWebBrowserFind and nsIFind and not have to reinvent the wheel.

Flags: needinfo?(smontagu)

Mike de Boer [:mikedeboer]

Comment 59

•

11 years ago

(In reply to Simon Montagu :smontagu from comment #58) > Isn't that a solved problem? I was thinking that we could plug the ICU > search tools into the current implementations of nsIWebBrowserFind and > nsIFind and not have to reinvent the wheel. We can do that by `normalize`-ing the character data extracted from text nodes. nsFind is basically a tree walker that extracts matches by pulling one character off the text content and advancing to the next character in the needle upon match. That's why the ICU search methods/ algo won't work for nsFind; it's not a simple substring-matching-search. However, since we only _really_ need to normalize individual characters in the most common case, I'd say using normalization is a good first step. If we'd like locale & grammar aware normalization, we'd need to keep track of character positions before and after. Or can we assure that no character positions are changed when we feed it the entire node contents?

Jonathan Kew [:jfkthame]

Comment 60

•

11 years ago

(In reply to Simon Montagu :smontagu from comment #58) > (In reply to Jonathan Kew (:jfkthame) from comment #56) > Isn't that a solved problem? I was thinking that we could plug the ICU > search tools into the current implementations of nsIWebBrowserFind and > nsIFind and not have to reinvent the wheel. Well, we certainly have existing code that handles this - what I'm unsure about is how easily we'll be able to fit the ICU tools into that. Maybe it'll be straightforward, but I suspect not entirely. Ultimately, nsFind iterates through the document elements by characters, but we can't plug in a collator-based match at that level; it needs to see strings (which may span element boundaries), not individual characters.

Simon Montagu :smontagu

Comment 61

•

11 years ago

Yes, clearly at the lowest level of nsFind we would have to pull whole strings out of nodes to pass to ICU, probably with some degree of look-ahead, rather than iterate by character as it does now.

Jonathan Kew [:jfkthame]

Comment 62

•

11 years ago

(In reply to Mike de Boer [:mikedeboer] from comment #59) > That's why the ICU search methods/ algo won't work for nsFind; it's not a > simple substring-matching-search. However, since we only _really_ need to > normalize individual characters in the most common case, I'd say using > normalization is a good first step. Normalization is a good step, but note that it can't be done on a codepoint-by-codepoint basis, which is how nsFind currently walks; it may involve composing, decomposing and reordering sequences of codepoints.

JonathanW

Comment 64

•

11 years ago

Hi I am interested in working on this bug. I am working on a solution that has two parts: 1) for Hebrew, it ignores combining code points in the web page. 2) for Latin alphabets, I have some small tables generated by perl that map characters to base letters. I have included them into nsFind.cpp. The perl uses normalization algorithms to generate the mappings to the base letter. Additionally, I map some pairs of characters to map ligatures. It seems to perform pretty well, and handle quite a few languages. It matches up pretty well with the Chrome behavior. Does nsFind have an owner? This is the first time I have tried to contribute to firefox.

JonathanW

Comment 65

•

11 years ago

Attached file Here is a perl script that can generate tables for mapping an accented character to base characters. I ran it with perl v5.14.2 (obsolete) — Details

JonathanW

Comment 66

•

11 years ago

Attached file Here is a modified version of nsFind.cpp that contains a possible fix. (obsolete) — Details

This nsFind.cpp should ignore Hebrew and Arabic vowel sign combiners, and latin alphabet accents, and should allow you to search Vietnamese without typing accents. It supports some ligatures.

JonathanW

Comment 67

•

11 years ago

Additionally it seems desireable to ignore Kashida. Chrome has that behavior.

JonathanW

Updated

•

10 years ago

Flags: needinfo?(jst)

JonathanW

Updated

•

10 years ago

Flags: needinfo?(mdeboer)

Mike de Boer [:mikedeboer]

Comment 68

•

10 years ago

Hi Jonathan! Thanks for your interest in working on this bug! Also, apologies for the late reaction; I was on holiday for the past two weeks. It would help _a lot_ if you'd take a bit of time to read the following - quite short - guide: https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/How_to_Submit_a_Patch As for the approach you're describing in comment 64; I don't know if that approach is the correct one, I simply don't have enough knowledge to give you a reasonable answer. For that, I need to defer your question to jfkthame. I hope he doesn't mind ;-)

Flags: needinfo?(mdeboer)

Flags: needinfo?(jst)

Flags: needinfo?(jfkthame)

Jonathan Kew [:jfkthame]

Comment 69

•

10 years ago

I'm not really comfortable with the approach here of adding ad hoc character-mapping tables to nsFind.cpp. We should instead be leveraging the Unicode character property tables already present in intl/unicharutil and/or ICU. Regarding the overall description in comment 64: it sounds like this is focused on just a couple of scripts. Really, an "ignore diacritics" option should be applied uniformly across all scripts. The "ignore combining code points" part of this could probably be implemented based on looking up the Unicode general category of the character; and support for stripping accents off precomposed letters could be implemented by applying NFD normalization and then stripping any combining marks from the normalized form. Given that these basic functions are already present in the codebase (best to use the ICU implementations, as nsIUnicodeNormalizer implements an old Unicode version), we shouldn't be adding lookup tables here. If the performance of looking up decompositions, etc., on the fly and then stripping diacritics turns out to be a problem (I'm not sure if it will), then nsFind could perhaps cache the results for individual characters as they're encountered.

Flags: needinfo?(jfkthame)

JonathanW

Comment 70

•

10 years ago

If we extend the Latin, Arabic, Hebrew NFD analysis to include all scripts, that would create an equivalence in Cyrillic between U+0418 И and U+0419 Й, which NFD decomposes into U+0418 + U+0300. I suspect this would not be a desirable equivalence in searching Russian and Ukrainian text. I tried it in chrome, and chrome did not equate U+0418 И and U+0419 Й, though it ignores accent marks for latin text. For that reason, along with my general lack of expertise outside of a few languages, it seems to me to be difficult to apply an NFD based fix without special casing across all languages.

Jonathan Kew [:jfkthame]

Comment 71

•

10 years ago

(In reply to JonathanW from comment #70) > If we extend the Latin, Arabic, Hebrew NFD analysis to include all scripts, > that would create an equivalence in Cyrillic between U+0418 И and U+0419 Й, > which NFD decomposes into U+0418 + U+0300. I suspect this would not be a > desirable equivalence in searching Russian and Ukrainian text. I tried it > in chrome, and chrome did not equate U+0418 И and U+0419 Й, though it > ignores accent marks for latin text. I doubt equating 'И' and 'Й' is any less desirable for searching Russian and Ukrainian than it is to equate a/å, a/ä, o/ö when searching in Swedish; they're considered entirely separate letters of the alphabet. The fact that Chrome ignores diacritics in Latin but not in Cyrillic is just a limitation of their implementation, I suspect. Let's do better -- and be more consistent!

JonathanW

Comment 72

•

10 years ago

Attached patch bug-2022251-icu-collator-fix.patch has a possible fix based on icu collators (obsolete) — Details — Splinter Review

bug-2022251-icu-collator-fix.patch attached implements asymmetric matching. It uses icu properties to find "ignorable" combiners in the target text. Try it on some of the suggested behavior in this bug, and see what you think.

JonathanW

Updated

•

10 years ago

Flags: needinfo?(jfkthame)

Jonathan Kew [:jfkthame]

Comment 73

•

10 years ago

This sounds promising, thanks! I haven't had time to test or go through it carefully yet but will try to do so soon; leaving the needinfo? flag for now to keep it on my radar. One note: from a brief look, I think the behavior will be a bit inconsistent in that accents in the *search* string will be ignored if they're part of precomposed letters (so searching for "naïve" will find "naive"), but NOT if they're encoded using combining marks; whereas in the *target* string the combining mark versions would also be ignored. That seems unfortunate... either we should consistently ignore all accents, or we should respect accents in the search string regardless whether they're precomposed or combining-char.

JonathanW

Comment 74

•

10 years ago

Are there some unit tests that cover the find in page functionality?

Flags: needinfo?(mdeboer)

JonathanW

Comment 75

•

10 years ago

(In reply to Jonathan Kew (:jfkthame) from comment #73) > This sounds promising, thanks! I haven't had time to test or go through it > carefully yet but will try to do so soon; leaving the needinfo? flag for now > to keep it on my radar. > > One note: from a brief look, I think the behavior will be a bit inconsistent > in that accents in the *search* string will be ignored if they're part of > precomposed letters (so searching for "naïve" will find "naive"), but NOT if > they're encoded using combining marks; whereas in the *target* string the > combining mark versions would also be ignored. That seems unfortunate... > either we should consistently ignore all accents, or we should respect > accents in the search string regardless whether they're precomposed or > combining-char. I tried the "naive" variations, and they all seem to find each other as expected.

Akkana Peck

Comment 76

•

10 years ago

JonathanW: not unit tests, but there's an old page with a collection of manual regression tests for find-in-page covering some of the regressions that used to slip in when the find code was updated: http://www-archive.mozilla.org/quality/browser/front-end/testcases/xpapps-gui/find-on-page/find-in-page.html It was written before Mozilla had any automated testing framework (in fact, it probably even predates find-as-you-type); perhaps it would be possible to automate it now. It does not include any diacritics, and should.

Matt Brubeck (:mbrubeck)

Comment 77

•

10 years ago

There are some automated tests at https://dxr.mozilla.org/mozilla-central/source/embedding/test/test_nsFind.html

Mike de Boer [:mikedeboer]

Comment 78

•

10 years ago

Jonathan, apart from the test that Matt mentioned (which is probably the best one for you to start with and copy), the ones covering the findbar are: https://dxr.mozilla.org/mozilla-central/source/toolkit/content/tests/chrome/test_findbar.xul https://dxr.mozilla.org/mozilla-central/source/toolkit/content/tests/chrome/test_findbar_events.xul https://dxr.mozilla.org/mozilla-central/source/toolkit/content/tests/browser/browser_findbar.js

Flags: needinfo?(mdeboer)

Jonathan Kew [:jfkthame]

Comment 79

•

10 years ago

Comment on attachment 8672412 [details] [diff] [review] bug-2022251-icu-collator-fix.patch has a possible fix based on icu collators Review of attachment 8672412 [details] [diff] [review]: ----------------------------------------------------------------- This shows a lot of promise -- I tried it in a local build, and it makes a lot of common cases behave as expected, with Find happily ignoring accents. However, there are still cases that don't work as well, e.g. once multiple accents get involved. With the testcase data:text/html;charset=utf-8,x&%23x1d8;x x&%23xfc;&%23x301;x xu&%23x308;&%23x301;x (which displays "xǘx xǘx xǘx"), searching for "xux" will indeed find all three instances; but searching for xǘx may or may not work depending whether the search string is entered in precomposed or decomposed form. I think -- without having attempted to code it -- that the right way forward here is indeed to use a collator, but rather than walking through the search and target strings one character at a time, we should walk one *cluster* at a time, where a cluster includes any trailing combining marks. We have an existing ClusterIterator class somewhere, but because nsFind is iterating over the DOM content rather than over simple text strings, I don't think we'll be able to use that; it'll need to be reimplemented on top of the existing nsFindContentIterator, I guess. Or maybe nsFindContentIterator itself should be modified to work in terms of clusters. Then we'll be calling coll->compare() on a pair of clusters rather than individual codepoints, and that will take care of the accents as well as all canonical-equivalence issues within the cluster. This will also allow us (in strict-matching mode, i.e. when the user doesn't want to ignore accents -- which I think we need to offer, though what UI if any to expose is still a question) to avoid the problem of finding "u" when the document contains "ǘ". The other thing that would be helpful would be if you can take a look at the mozilla coding style guidelines[1] and try to ensure that your patch follows these -- in particular, things like whitespace (no tabs!), bracing of control structures, etc. This makes it easier to review patches, and issues like this will need to be tidied up before a patch here is landed in the tree. And thanks for working on this -- I believe it'll make a lot of people happier! ::: embedding/components/find/nsFind.cpp @@ +862,5 @@ > +// ignore them in the target text while matching the pattern > +// for instance, and A in the pattern can match A followed by > +// any number of combiners in the text > +// additionally some other ignorable characters can be coded here > +// for instance kashida Please use capitals and periods to punctuate long comments as proper sentences; this makes them much more readable. @@ +867,5 @@ > +static bool IsCombiner(const char16_t c) { > + > + // ignore Kashida in Arabic > + if (0x0640 == c) > + return true; There are a whole bunch of Unicode characters that would probably make sense to ignore. We should try checking for the Unicode property "Default_Ignorable" (I expect ICU exposes an API for this) as a first step, before we consider ad hoc additions like this. (Maybe better to name this IsIgnorableForFind(), to be more accurate, as it will check for various things besides actual combiners.) @@ +884,5 @@ > + } > +} > + > +/* make an ICU collator */ > +using namespace icu; As there aren't a huge number of ICU types/functions being used, I think I'd prefer to explicitly prefix them with icu:: rather than having the 'using namespace' here. @@ +886,5 @@ > + > +/* make an ICU collator */ > +using namespace icu; > + > +static RuleBasedCollator* static_collator=NULL; Please move this inside the getCollator function.

Mike de Boer [:mikedeboer]

Comment 80

•

10 years ago

(In reply to Jonathan Kew (:jfkthame) from comment #79) > And thanks for working on this -- I believe it'll make a lot of people > happier! I couldn't agree more! I'm stoked and looking forward to the next iteration of the patch!

Jonathan Kew [:jfkthame]

Comment 81

•

10 years ago

(In reply to Jonathan Kew (:jfkthame) from comment #79) Oops, sorry .... I intended to drop in a link to the Coding Style page: > The other thing that would be helpful would be if you can take a look at the > mozilla coding style guidelines[1] and try to ensure that your patch follows > these -- in particular, things like whitespace (no tabs!), bracing of > control structures, etc. This makes it easier to review patches, and issues > like this will need to be tidied up before a patch here is landed in the > tree. [1] See https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Coding_Style

Flags: needinfo?(jfkthame)

JonathanW

Comment 82

•

10 years ago

Attached patch bug-2022251-icu-collator-fix.patch has a possible fix based on icu collators, includes test, handles multiple combiners in pattern (obsolete) — Details — Splinter Review

Here is a new version of the patch that include a mochitest and which handles the case of multiple combiners in the pattern.

Attachment #8536092 - Attachment is obsolete: true

Attachment #8536095 - Attachment is obsolete: true

Attachment #8672412 - Attachment is obsolete: true

JonathanW

Updated

•

10 years ago

Flags: needinfo?(jfkthame)

JonathanW

Comment 83

•

10 years ago

Does this new version of the patch fail any use cases?

Jonathan Kew [:jfkthame]

Comment 84

•

10 years ago

Sorry, I haven't had time to build and test this yet; I'll try to get to it soon. Leaving the needinfo? flag so that it'll remind me.... One question: will it find things like "ae" in the target text if the search string contains "æ", or is that only supported the other way round (find "æ" when searching for "ae")?

JonathanW

Comment 85

•

10 years ago

If the search string contains "æ" it will not find "ae" in the target.

JonathanW

Comment 86

•

10 years ago

Attached patch New patch with formatting and code review fixes and rebased to latest. (obsolete) — Details — Splinter Review

Attachment #8680276 - Attachment is obsolete: true

Jonathan Kew [:jfkthame]

Comment 87

•

10 years ago

Comment on attachment 8683344 [details] [diff] [review] New patch with formatting and code review fixes and rebased to latest. Review of attachment 8683344 [details] [diff] [review]: ----------------------------------------------------------------- ::: embedding/components/find/nsFind.cpp @@ +41,5 @@ > +#include "unicode/uloc.h" > +#include "unicode/ucol.h" > +#include "unicode/locid.h" > +#include "unicode/tblcoll.h" > +#include "unicode/sortkey.h" I don't think it's necessary to #include all these headers. AFAICT, just "unicode/tblcoll.h" should be sufficient; this is where RuleBasedCollator is declared, and it'll pull in whatever else is needed for that declaration to compile. @@ +862,5 @@ > +// Ignore them in the target text while matching the pattern. > +// For instance, an A in the pattern can match A followed by > +// any number of combiners in the text. Additionally some > +// other ignorable characters can be coded here, including > +// kashida. Thinking about this some more, I'm having doubts as to whether it's a valid approach here.... the trouble is that not everything that has a "combining mark" character type is equally valid to ignore. In particular, this will result in ignoring all vowel marks in all the Indic scripts, so that Find will be unable to distinguish between (for example) बेटा "son" and बेटी "daughter" in Hindi, or between किताब "book" and कातिब "writer", and many more such cases. To get this right, I think we really need to let the collator compare grapheme clusters (not just individual characters) from the pattern and the target text; and the ignoring of (some) diacritics then occurs because the collator knows what level of difference they represent. (See also comment 79, where I tried to suggest this approach; I'm increasingly convinced that's what we need to do in order to get this right.) @@ +864,5 @@ > +// any number of combiners in the text. Additionally some > +// other ignorable characters can be coded here, including > +// kashida. > + > +static bool IsIgnorableForFind(const char16_t c) { Note that this won't be able to detect ignorable chars outside the BMP. There are a bunch of combining marks in plane 1, which will require handling surrogate pairs (although not all of them should actually be ignored); and then there are the variation selectors in plane 14. (However, see above; I don't think this approach in general is sufficiently flexible to handle the variety of things Unicode throws at us.) @@ +890,5 @@ > +// and does not depend on the language of the html page > +// > + > +static icu::RuleBasedCollator * > +getCollator() Capitalize function name. @@ +892,5 @@ > + > +static icu::RuleBasedCollator * > +getCollator() > +{ > + static icu::RuleBasedCollator* static_collator = NULL; We use |nullptr| rather than |NULL|. @@ +902,5 @@ > + static_cast<icu::RuleBasedCollator*> > + (icu::Collator::createInstance(icu::Locale::getRoot(), status)); > + > + if (U_FAILURE(status)) > + return NULL; Braces even for a single-line |if| body. (And again later on.)

JonathanW

Comment 88

•

10 years ago

I think ICU is currently compiled without the break iterator. ICU_CPPFLAGS="$ICU_CPPFLAGS -DUCONFIG_NO_BREAK_ITERATION" Is it okay to add that break iterator for this feature?

Jonathan Kew [:jfkthame]

Comment 89

•

10 years ago

(In reply to JonathanW from comment #88) > I think ICU is currently compiled without the break iterator. > > ICU_CPPFLAGS="$ICU_CPPFLAGS -DUCONFIG_NO_BREAK_ITERATION" > > Is it okay to add that break iterator for this feature? I expect so. (How much size does it add to the compiled size?) An alternative would be to use IsClusterExtender (from nsUnicodeProperties.h), perhaps encapsulated in a simple iterator object, to manage iteration-by-cluster. If the ICU break-iterator pulls in a lot of additional bulk (maybe because it supports iterating by various break types?), that might be preferable. I'd say try whichever is easiest to code, for a start; if that's ICU break-iterator, but we find that it does add a lot of bulk, we could look into replacing it with a smaller, single-purpose version once the overall functionality is proven to work.

Flags: needinfo?(jfkthame)

JonathanW

Comment 91

•

10 years ago

Attached file possible fix using icu char break iterator to do cluster by cluster comparisons (obsolete) — Details

Here is some code that handles clusters. The test code includes the Hindi Son and Daughter case. In addition to modifications included in the patch, it is also necessary to change the configure script to include ICU Break iterators by commenting out this line: 30965: #ICU_CPPFLAGS="$ICU_CPPFLAGS -DUCONFIG_NO_BREAK_ITERATION" I am not sure why mercurial did note include that file in the patch. This patch adds a small set of brkitr data files to icu data. I am not sure that I have determined the minimum set of data files correctly. I excluded the large dictionary files. While this solution shows some promise, it is not perfect, and not totally symmetrical. Another direction to consider would be to use the icu search service. It seems like that would be possible to code, too. However, firefox currently supports two find features that I do not see an easy way to implement with the search service. First, the "whitespace" feature which allows 1 or more whitespace in the pattern to match a greater number of whitespace chars in the page. Second, the ignore LF betweeen CJ characters features. I would be happy to try to code an alternate solution with the searchservice if we are willing to consider removing those two features, or if someone can suggest a way to implement them using collation rules, or other methods.

Attachment #8683344 - Attachment is obsolete: true

JonathanW

Updated

•

10 years ago

Flags: needinfo?(jfkthame)

Jonathan Kew [:jfkthame]

Comment 92

•

10 years ago

Thanks for continuing to work on this! (In reply to JonathanW from comment #91) > In addition to modifications included in the patch, it is also necessary to > change the configure script to include ICU Break iterators by commenting out > this line: > > 30965: #ICU_CPPFLAGS="$ICU_CPPFLAGS -DUCONFIG_NO_BREAK_ITERATION" > > I am not sure why mercurial did note include that file in the patch. This bit I can answer. :) It's because 'configure' is a generated file (it's created from configure.in and auxiliary stuff). So rather than editing it here, you should modify build/autoconf/icu.m4; you'll find the code to "remove chunks of the library that we don't need (yet)" at about line 140 of that file. (I haven't looked at the actual patch yet; leaving the needinfo? flag to remind myself.)

Mike de Boer [:mikedeboer]

Comment 93

•

9 years ago

Jonathan, in comment 92 you mentioned you wanted to take a look at the patch? It'd be nice if we could move this one forward bit by bit!

Mike de Boer [:mikedeboer]

Updated

•

9 years ago

Whiteboard: [parity-chrome]

JonathanW

Comment 98

•

9 years ago

Attached patch update possible fix to accomodate changes to icu build system integration (obsolete) — Details — Splinter Review

This is the same nsFind.cpp code as the previous patch. This version includes changes to use the new icu build system. The sizes added in a Windows build using this patch are 43520 bts to xul.dll, and 535424 bytes to icudt56l.dat. It appears to work on Ubuntu and windows.

Attachment #8704321 - Attachment is obsolete: true

JonathanW

Comment 99

•

9 years ago

Hi Jonathan Kew, could you please try out this patch and let me know your thoughts. Thanks, Jonathan W

Mike de Boer [:mikedeboer]

Comment 100

•

9 years ago

Jonathan, friendly n-i ping ;-)

Jonathan Kew [:jfkthame]

Comment 101

•

9 years ago

I've been experimenting with this a bit... it seems to work nicely for "loose" matching (i.e. ignoring diacritics) in Latin script. I'm less comfortable with the behavior that results for Indic, however; we may need to re-think this and find a slightly different approach. The user experience that results from the cluster-by-cluster comparison (which was what I suggested we should do) seems less satisfactory than I'd like.... Example: load a page of Hindi text, such as http://www.unicode.org/standard/translations/hindi.html, and then bring up the Find bar. Suppose we want to find "यूनिकोड" (Unicode). So we start typing the word, one character at a time, into the Find field; but see what happens as it grows: य -> highlights a "य" at the end of a word in the middle of the first paragraph यू -> highlights "यू" in the middle of the word "कम्प्यूटर", second paragraph यून -> Phrase not found यूनि -> highlights the beginning of "यूनिकोड" in the first sub-heading यूनिक -> Phrase not found यूनिको -> highlights more of "यूनिकोड" in the first sub-heading यूनिकोड -> highlights more of "यूनिकोड" in the first sub-heading It's quite disconcerting that at certain points while typing the word, we get "Phrase not found", yet adding another character causes it to be found again; but at no point does it actually highlight text in the initial "यूनिकोड" in the title, or the occurrence within the first paragraph, although as a user I would have expected it to search from the beginning of the page. (This is because the first character of the search string, "य", fails to match the initial character of "यूनिकोड", but skips forward to find an occurrence of "य" with no following vowel.) Still thinking over this; it's not clear to me quite what the best approach is going to be, but I'm toying with some ideas... Maybe something as simple as allowing a partial-cluster match at the end of the search string would help, so that "य" can successfully find the "य" at the beginning of "यूनिकोड".

Flags: needinfo?(jfkthame)

JonathanW

Comment 102

•

9 years ago

Hi Jonathan Kew, Thanks for the feedback. I will look into that undesired behavior.

Jonathan Kew [:jfkthame]

Updated

•

9 years ago

Blocks: 1297691

JonathanW

Comment 103

•

9 years ago

Hi Jonthan Kew, I did some experiments with icu collators and the Hindi characters in the example यूनिकोड.In addition to using a collator based on the Root locale with strength primary, I also tried the Locale("hi", "IN"). In both cases, the YA and YA + vowel sign UU compare as different. See output below. ICU String Search does find YA as an initial match in YA + UU. I think we could tweak the proposed fix to look for binary matches in the case where the collator returns not equal. I plan to code that and resubmit. Another option would be to refactor the code to use icu StringSearch service as in comment 91. Using Root locale collator Cluster #य# len=1 YA Coll key:[-128, 0, 9, 48, 0, 0, 0, 0] Cluster #यू# len=2 YA + vowel UU Coll key:[-128, 0, 9, 48, -128, 0, 9, 67, 0, 0, 0, 0] compare य यू => -1 Using Hindi locale collator Cluster #य# len=1 YA Coll key:[0, -92, 0, 0, 0, 0] Cluster #यू# len=2 YA + vowel UU Coll key:[0, -92, 0, -76, 0, 0, 0, 0] compare य यू => -1 Thanks, Jonathan Walden

Jonathan Kew [:jfkthame]

Comment 104

•

9 years ago

(In reply to JonathanW from comment #103) > Hi Jonthan Kew, > > I did some experiments with icu collators and the Hindi characters in the > example यूनिकोड.In addition to using a collator based on the Root locale > with strength primary, I also tried the Locale("hi", "IN"). In both cases, > the YA and YA + vowel sign UU compare as different. See output below. Right, considered as clusters they definitely differ. The trouble is that while per-cluster matching is the appropriate way to compare whole strings, it is also reasonable for a user to search for a single letter such as YA and expect to find it as part of the YA+UU cluster, not _only_ when it occurs _without_ a following vowel sign. I think the only situation where this becomes an issue is at the end of the search string. _Within_ the search string, we have clear boundaries on both sides of the current cluster, but at the end, we don't really know if the trailing edge of the search text must be a cluster boundary or may fall in the middle of a cluster. > > ICU String Search does find YA as an initial match in YA + UU. OK, good to know it seems to agree this is an expected result. > > I think we could tweak the proposed fix to look for binary matches in the > case where the collator returns not equal. I plan to code that and resubmit. Rather than looking for binary matches, I'd suggest trying something along the lines of if (clusters don't match) and (we are at the last cluster of the search-pattern) { while (search cluster length > one codepoint) { trim last codepoint from the search cluster (don't forget to check for surrogate pairs!) repeat comparison with the text cluster, if it matches we have a hit } } > Another option would be to refactor the code to use icu StringSearch service > as in comment 91. Yes, that might also be an interesting possibility, though I suspect it may be harder to interface this to the DOM text we're trying to search.

JonathanW

Comment 105

•

9 years ago

Attached patch update possible fix to handle Hindi — Details — Splinter Review

Here is another version of the test patch. This patch addresses the concern with Hindi reported by J.Kew in comment 101. It also supports the find in pages features of whole word matching and strict quote matching. There are some internationalization issues with the whole word feature which I have not tried to fix in this txn. Please try it out, and let me know your thoughts.

Attachment #8756035 - Attachment is obsolete: true

JonathanW

Updated

•

9 years ago

Flags: needinfo?(jfkthame)

Mike de Boer [:mikedeboer]

Comment 107

•

8 years ago

Jonathan (Kew) and Jet, I'd really love to see something happen here, since: 1. The Find Backend is a rather important accessibility feature, 2. ICU has been integrated into m-c for quite a while now, so there should be more(?) domain expertise spread out to other devs, 3. This is something we really need to catch up to Webkit/ Blink with. What can I do to make this bug move forward?

Flags: needinfo?(bugs)

Jonathan Kew [:jfkthame]

Comment 108

•

8 years ago

This is a hard one partly because it's difficult to pin down exactly what the "desired" behavior would be, once you start considering a wide variety of languages/writing systems. Rules that seem obviously "correct/useful" from a Western point of view, for example, may give results that are equally obviously "wrong" in some Asian language, etc. One person's "insignificant diacritic that should be ignored when searching" may be another person's "critical distinguishing feature". That's a large part of why we've been through several iterations of ideas, attempts to implement, and then reconsidering when we see how the ideas actually behave in practice. Sorry it's been stalled for so long, though; we really should try to move it forward. It's not clear how much we can directly benefit from the ICU integration here, because although ICU has a StringSearch service that offers potentially useful behaviors, the task here is not actually to search strings; it's to search the textual content in the DOM. Interfacing text in the DOM to ICU's service would be non-trivial. We might be able to do this by providing an implementation of icu::CharacterIterator that iterates over text in the DOM, instead of over a simple string. Anyhow, part of what we need here -- besides people to actually hack on the code, where JonathanW was making good progress -- is a way to get testing and feedback from people using a wide variety of languages, and having a good awareness of how different levels of search (like exact vs approximate matching) ought to behave. Not many reviewers/testers have that sort of global perspective, unfortunately; it's about a different kind of experience than just having knowledge of the code. JonathanW, I'm sorry this has languished so long, after all the time and effort you put into it; if we try to get it moving again, are you still available/interested to continue working on this? Mike, would you be the right person to discuss what potential UI should look like here? One of the tricky questions is figuring out how many "knobs" can/should be exposed to users. Currently, we have "Match Case" and "Whole Words" toggles in the Find bar; can we have an additional one for "Match Diacritics"? Or should we replace "Match Case" with "Exact Match" and expand its scope, rather than a separate option? Or have a more general "Fuzzy Match"? All sorts of possibilities/questions.... but knowing what we can reasonably expose in UI may also affect what options the code needs to implement.

Flags: needinfo?(mdeboer)

Flags: needinfo?(jonathan_walden)

Flags: needinfo?(jfkthame)

JonathanW

Comment 109

•

8 years ago

Hi Jonathan Kew, In internationalization organizations in the past where I have worked, we have identified individuals in the company who can provide specific advice and review about the languages / locales that we support. That seems like something Mozilla needs, and that potential contributors around the world would provide. Naturally, Mozilla has more locales to support than those past commercial efforts, which never even got to 20 languages. Even then, there was plenty of hair splitting over the correct handling of relatively simple things, like French Canadian capitalization versus French France capitalization. There are a million varieties of feature we could explore here. For instance, as someone who learned Traditional Chinese characters, and oddly Pin Yin, I like a search that "transliterates" pinyin and Simplified Chinese, along with ZhuYinFuHao. As someone who has a hard time even picking out Hebrew and Arabic characters, a transliteration search system would be useful for me with those languages. No browser has that yet. But in order to actually implement something useful, we have to settle on something. That was the spirit of my first submission 3 years ago. To handle the 90% case with Latin diacritics, and Hebrew, since users had asked for those, and to do so in a way that has very limited impact to the code base and the memory and diskspace footprint. Since then we have gone down the path of more complex trials, involving significant disk space and performance impacts. Including some of the features in ICU created a lot of bloat.

Flags: needinfo?(jonathan_walden)

Jet Villegas (inactive)

Comment 110

•

8 years ago

(In reply to JonathanW from comment #109) > But in order to actually implement something useful, we have to settle on something... <snip> > ...to handle the 90% case with Latin diacritics, and Hebrew, since users had asked for those, and to do so in a way that has very limited impact to the code base and the memory and diskspace footprint. Indeed. At this time, we are settling with what we have today. I know that sounds unsatisfying, but I feel it should be clear that there is currently no work going on to fix this bug. I can be convinced to take a fix that gets us to this proverbial 90% case if we can devise something clever like we did for bug 196175 (another bug for which a wonderfully complex solution was proposed and ultimately a simpler fix did the trick.)

Status: NEW → RESOLVED

Closed: 8 years ago

Flags: needinfo?(bugs)

Resolution: --- → WONTFIX

Mike de Boer [:mikedeboer]

Comment 111

•

7 years ago

I'm re-opening this bug, because I don't want to cancel out our wish to eventually have this supported. I don't have the required skills to implement this, so it'd need time from another knowledgeable contributor! I'm certainly available to mentor.

Status: RESOLVED → REOPENED

Flags: needinfo?(mdeboer)

Resolution: WONTFIX → ---

Jonathan Watt [:jwatt]

Updated

•

7 years ago

Keywords: parity-chrome

Whiteboard: [parity-chrome]

Mike de Boer [:mikedeboer]

Updated

•

7 years ago

Status: REOPENED → NEW

QA Contact: mdeboer

Firefox Bug Husbandry Bot

Updated

•

7 years ago

QA Contact: mdeboer

jiri.pospisil

Comment 114

•

6 years ago

I'm trying to permanently switch from Chrome and I'm hitting this issue almost every time I search for something. Chrome's behavior seems very natural, I'd never realized / appreciated it was doing that until I tried to search for something in Firefox and it didn't find anything.

I think a reasonable approach forward here is to introduce a toggle for it (to be consistent with the existing options (Match Case, Whole Words...)), start with languages where the right behavior is more or less obvious (ě -> e, á -> a etc.), and handle more languages in new tickets as requested by users. You will never be able to handle every language in existence and trying to encapsulate everything in one go is how you get these 17 year old tickets.

Also, I find it funny that there are people in the Chromium's bug tracker who are complaining about the opposite and are calling for the same behavior as in Firefox.

https://bugs.chromium.org/p/chromium/issues/detail?id=71741
https://bugs.chromium.org/p/chromium/issues/detail?id=687552

Alex Henrie

Assignee

Comment 117

•

6 years ago

Attached file Bug 202251 - Add an option to ignore diacritics when searching. — Details

Alex Henrie

Assignee

Comment 118

•

6 years ago

I just uploaded a relatively simple patch that implements this feature using Unicode decomposition and a cache. It's a 90% solution that will not work as well in some languages as in others, but it keeps searches fast. I turned it on by default and added a new button to the find toolbar to turn it off.

For the same reason as in bug 969980 (because a complete solution would also do things like split digraphs into two characters), I put the diacritic-stripping functions in the "internal API" section.

What I've implemented is exactly what was proposed in comment #69. I realize that there has been a lot of discussion on this topic and short of implementing locale-specific rules, there is no solution that will make everyone happy. However, I believe that what I have is a reasonable solution that will help a lot more people than it annoys.

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 119

•

6 years ago

Thank you Alex!

Just to present my position from phabricator clear - I believe we should minimize our reliance on ICU. We currently use it where needed, but we're also slowly replacing pieces of it with Rust Unicode crates, and in the future I can see us replacing it completely.

If Manish says that plugging this into unicode-normalization is easy, I'd love to see us using that crate (which is already vendored in so we carry the data), but I believe it's more important to get this feature, so if it's much easier to land it as-is, I'd just like to ask to file a follow-up bug to replace the backend to unicode-normalization.

Alex Henrie

Assignee

Comment 120

•

6 years ago

I wouldn't mind using Rust here, it just seems like a big pile of extra work for something that we'll eventually have to redesign anyway to support locale-specific and multi-character decompositions. So yeah, I'd rather leave the transition to Rust for a follow-up bug.

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 121

•

6 years ago

So yeah, I'd rather leave the transition to Rust for a follow-up bug.

Sounds good to me! It would be good then to use the review process here to verify that the way the API is written makes the switch to unicode-normalization is easy.

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 122

•

6 years ago

Since you just landed bug 1590167 which migrated hyphenation to Rust. Do you have any thoughts on using ICU vs. unicode-normalize - both vendored in, for this?

Flags: needinfo?(jfkthame)

Jonathan Kew [:jfkthame]

Comment 123

•

6 years ago

It looks like it shouldn't be very hard to use the Rust unicode-normalization crate for this, by implementing a GetNaked function based on https://docs.rs/unicode-normalization/0.1.9/unicode_normalization/char/fn.decompose_canonical.html, but it's a little more effort than using ICU, as it doesn't look like unicode-normalization currently offers an appropriate FFI interface.

So given that we're already using ICU's normalizer2 in Firefox C++ code (it's used in at least js/src/builtin/String.cpp and gfx/thebes/gfxHarfBuzzShaper.cpp), ISTM the simplest way forward is to use it here as well. We could have a followup bug to replace the use of ICU normalization APIs with the Rust crate, which will presumably involve first adding FFI functions on the Rust side.

Flags: needinfo?(jfkthame)

Pulsebot

Comment 125

•

6 years ago

Pushed by cbrindusan@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b89936db7178 Add an option to ignore diacritics when searching. r=fluent-reviewers,mikedeboer,jfkthame,flod

Cristian Brindusan [:cbrindusan]

Comment 126

•

6 years ago

Backed out changeset b89936db7178 (Bug 202251) for bc failures at browser_misused_characters_in_strings.js.
https://hg.mozilla.org/integration/autoland/rev/469a80cf68cabebfebe740c77fb6f488587beb03

Push with failures:
https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=279869012&resultStatus=pending%2Crunning%2Csuccess%2Ctestfailed%2Cbusted%2Cexception&revision=8d0875c13729451c6d98ce531148a3b170568f75

Failure log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=279869012&repo=autoland&lineNumber=2570

Flags: needinfo?(alexhenrie24)

Pulsebot

Comment 127

•

6 years ago

Pushed by apavel@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/828a10a79e71 Add an option to ignore diacritics when searching. r=fluent-reviewers,mikedeboer,jfkthame,flod

Narcis Beleuzu [:NarcisB]

Comment 128

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/828a10a79e71

Status: NEW → RESOLVED

Closed: 8 years ago → 6 years ago

status-firefox73: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla73

BugBot [:suhaib / :marco/ :calixte]

Updated

•

6 years ago

Assignee: nobody → alexhenrie24

Alex Henrie

Assignee

Updated

•

6 years ago

Flags: needinfo?(alexhenrie24)

Gingerbread Man

Updated

•

6 years ago

Regressions: 1603636

jiri.pospisil

Comment 129

•

6 years ago

Thank you for working on this! Works like a charm.

eyal gruss

Comment 130

•

5 years ago

on firefox 73.0.1, this is not working for me with Hebrew diacritics

e.g. on: https://www.mechon-mamre.org/i/t/t0101.htm
בְּרֵאשִׁית

is not matched when searching for בראשית

Masatoshi Kimura [:emk]

Comment 131

•

5 years ago

(In reply to eyal gruss (eyaler) from comment #130)

on firefox 73.0.1, this is not working for me with Hebrew diacritics

e.g. on: https://www.mechon-mamre.org/i/t/t0101.htm
בְּרֵאשִׁית

is not matched when searching for בראשית

That is bug 1611568 and it will be fixed since Firefox 75.

Baris Ozkuslar

Comment 132

•

5 years ago

On Firefox 74.0, this is not working with "ı" letter. ı/I should get matched with i/İ.

"I" is described in Turkic section here https://en.wikipedia.org/wiki/Diacritic#Languages_with_letters_containing_diacritics

Alex Henrie

Assignee

Comment 133

•

5 years ago

Ozkuslar, could you open a new bug report about that and add me to its CC list please?

Makoto Kato [:m_kato]

Updated

•

5 years ago

Depends on: 1624244

eyal gruss

Comment 134

•

5 years ago

(In reply to Tomer Cohen :tomer from comment #0)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.4a)
Gecko/20030401
Build Identifier:

Following is a link to Passover Haggadah (The jewish story for the passover
holidy), in Hebrew with inline punctuation. If you'll search for passover
("פסח"), you'll get only one result - the one on the title bar, which has no
punctuations at all, but if you'll search for the punctuations version of the
same word ("פֶּסַח"), you'll get the other results.

Reproducible: Always

Steps to Reproduce:

Expected Results:
Mozilla shuld ignore punctuation characters in search queries, to make it
possible to omit them in the search.

after fixing of https://bugzilla.mozilla.org/show_bug.cgi?id=1611568 in firefox 75 it works
happy passover.

Francesco Lodolo [:flod]

Comment 135

•

5 years ago

I just noticed that this feature was never documented
https://support.mozilla.org/en-US/kb/search-contents-current-page-text-or-links

Not exactly sure what's the process to get if fixed.

Flags: needinfo?(jsavage)

Joni Chan

Comment 136

•

5 years ago

(In reply to Francesco Lodolo [:flod] from comment #135)

I just noticed that this feature was never documented
https://support.mozilla.org/en-US/kb/search-contents-current-page-text-or-links

Not exactly sure what's the process to get if fixed.

Thanks for flagging me on this. We'll get this documented by the end of the week.

Flags: needinfo?(jsavage) → needinfo?(anlazar)

Angela Lazar

Comment 137

•

5 years ago

https://support.mozilla.org/en-US/kb/search-contents-current-page-text-or-links has been updated.

Flags: needinfo?(anlazar)

eyal gruss

Updated

•

5 years ago

Blocks: 1640217

Dão Gottwald [:dao]

Updated

•

5 years ago

Depends on: 1620826

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

5 years ago

Depends on: 1647335

Here is a perl script that can generate tables for mapping an accented character to base characters. I ran it with perl v5.14.2 11 years ago JonathanW 3.04 KB, text/plain		Details
Here is a modified version of nsFind.cpp that contains a possible fix. 11 years ago JonathanW 44.40 KB, text/plain		Details
bug-2022251-icu-collator-fix.patch has a possible fix based on icu collators 10 years ago JonathanW 7.54 KB, patch		Details \| Diff \| Splinter Review
bug-2022251-icu-collator-fix.patch has a possible fix based on icu collators, includes test, handles multiple combiners in pattern 10 years ago JonathanW 20.63 KB, patch		Details \| Diff \| Splinter Review
New patch with formatting and code review fixes and rebased to latest. 10 years ago JonathanW 20.74 KB, patch		Details \| Diff \| Splinter Review
possible fix using icu char break iterator to do cluster by cluster comparisons 10 years ago JonathanW 73.42 KB, text/plain		Details
update possible fix to accomodate changes to icu build system integration 9 years ago JonathanW 4.31 MB, patch		Details \| Diff \| Splinter Review
update possible fix to handle Hindi 9 years ago JonathanW 4.32 MB, patch		Details \| Diff \| Splinter Review
Bug 202251 - Add an option to ignore diacritics when searching. 6 years ago Alex Henrie 47 bytes, text/x-phabricator-request		Details \| Review