Open Bug 842280 Opened 13 years ago Updated 3 years ago

Consider accents and umlauts for approximate search/matching

Categories

(Firefox :: Address Bar, defect, P3)

defect
Points:
8

Tracking

()

People

(Reporter: bruant.d, Unassigned)

References

Details

(Whiteboard: [qx][fxsearch])

Attachments

(1 file)

Basically, I think the rule would be something like: if there is a letter that can have an accent, look for all variants with an accent. So, typing "e" would find results with é, è, ê, ë, etc. Likewise for pretty much any letter. And typing "fete" would find bookmarks/history entries, etc with "fête", etc.
If results are found for different accents, put the exact matching first and accent-approximate matching afterwards.
Transliterating diacritics to their regular forms is a naive yet effective form of fuzzy search. To get truly efficient search results, it is required to know the language of the query in advance, which is probably not possible in this case. I don't know of any fuzzy search algorithms working with no specific language context, but the transliteration described here should be a part of it. Some cases might be tricky though. U umlaut, `Ü`, can be transliterated to `U` in French, but 'UE' in German. All these possibles transliterations must be taken into account.
Attached image awesomerbar.png
It's more than an issue of accents, though. Even common curly apostrophes and quotes break searches. See attached.
Whiteboard: [qx]
Adding Marco at Shorlander’s suggestion. Mapping diacriticals to ascii characters seems like an easy win for many locales, especially Germany.
Flags: needinfo?(mak77)
currently to search we just use the plain case-insensitive string comparator CaseInsensitiveUTF8CharsEqual from intl/unicharutil/util/nsUnicharUtils.cpp We could clearly write a wrapper that maps those chars to ascii both sides (in the search string and in the searched string). It would add some string walking costs clearly, but should be acceptable. Until we move to an FTS index in the future, then we'd have to study something different. I don't know if this should be part of our intl library (so another util in intl) or we should look into using ICU, now that we bundle it, I think it can do Latin -> ASCII transforms too. But could it be better to retain control over the mapping (something similar to comment 3)? The result wouldn't be perfect, since as comment 2 points out, we don't know neither the language of the search string, nor the language of the searched string. So the best we could do, off-hand, is an oracle-like conversion. This suggestion goes in the direction we always followed in the awesomebar, that is to try giving more useful results rather than trusting pedantically the search string. I'm adding this to the backlog, prioritization needs to be discussed yet.
Flags: needinfo?(mak77)
Priority: -- → P3
Whiteboard: [qx] → [qx][fxsearch]
Summary: Consider accents for approximate search → Consider accents and umlauts for approximate search

How about a general "accent insensitive" approach based on regex equivalence classes (a.k.a. POSIX collating elements)? The underlying libs might already support it... err... just dreaming (libpcre2/10.32):

$ echo -e "Müller\nMuller" | pcre2grep '[[:u:]]'
pcre2grep: Error in command-line regex at offset 3: unknown POSIX class name
<marcoep@sphakka:~/tmp>
$ echo -e "Müller\nMuller" | pcre2grep '[[.u.]]'
pcre2grep: Error in command-line regex at offset 1: POSIX collating elements are not supported

Whatever the implementation details are, I'd love to see this working on anything searchable -- find in page (Ctrl+F), mail search (Ctrl+K), etc.

OS: Linux → All
Hardware: x86 → All
Summary: Consider accents and umlauts for approximate search → Consider accents and umlauts for approximate search/matching
Points: --- → 8
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: