Open
Bug 842280
Opened 13 years ago
Updated 3 years ago
Consider accents and umlauts for approximate search/matching
Categories
(Firefox :: Address Bar, defect, P3)
Firefox
Address Bar
Tracking
()
NEW
People
(Reporter: bruant.d, Unassigned)
References
Details
(Whiteboard: [qx][fxsearch])
Attachments
(1 file)
71.25 KB,
image/png
|
Details |
Basically, I think the rule would be something like: if there is a letter that can have an accent, look for all variants with an accent. So, typing "e" would find results with é, è, ê, ë, etc. Likewise for pretty much any letter.
And typing "fete" would find bookmarks/history entries, etc with "fête", etc.
Reporter | ||
Comment 1•13 years ago
|
||
If results are found for different accents, put the exact matching first and accent-approximate matching afterwards.
Comment 2•13 years ago
|
||
Transliterating diacritics to their regular forms is a naive yet effective form of fuzzy search.
To get truly efficient search results, it is required to know the language of the query in advance, which is probably not possible in this case.
I don't know of any fuzzy search algorithms working with no specific language context, but the transliteration described here should be a part of it.
Some cases might be tricky though. U umlaut, `Ü`, can be transliterated to `U` in French, but 'UE' in German. All these possibles transliterations must be taken into account.
Comment 3•9 years ago
|
||
Comment 4•9 years ago
|
||
It's more than an issue of accents, though. Even common curly apostrophes and quotes break searches. See attached.
Updated•9 years ago
|
Whiteboard: [qx]
Comment 5•9 years ago
|
||
Adding Marco at Shorlander’s suggestion. Mapping diacriticals to ascii characters seems like an easy win for many locales, especially Germany.
Flags: needinfo?(mak77)
Comment 6•9 years ago
|
||
currently to search we just use the plain case-insensitive string comparator CaseInsensitiveUTF8CharsEqual from intl/unicharutil/util/nsUnicharUtils.cpp
We could clearly write a wrapper that maps those chars to ascii both sides (in the search string and in the searched string). It would add some string walking costs clearly, but should be acceptable. Until we move to an FTS index in the future, then we'd have to study something different.
I don't know if this should be part of our intl library (so another util in intl) or we should look into using ICU, now that we bundle it, I think it can do Latin -> ASCII transforms too. But could it be better to retain control over the mapping (something similar to comment 3)?
The result wouldn't be perfect, since as comment 2 points out, we don't know neither the language of the search string, nor the language of the searched string. So the best we could do, off-hand, is an oracle-like conversion.
This suggestion goes in the direction we always followed in the awesomebar, that is to try giving more useful results rather than trusting pedantically the search string.
I'm adding this to the backlog, prioritization needs to be discussed yet.
Flags: needinfo?(mak77)
Priority: -- → P3
Whiteboard: [qx] → [qx][fxsearch]
Updated•9 years ago
|
Summary: Consider accents for approximate search → Consider accents and umlauts for approximate search
How about a general "accent insensitive" approach based on regex equivalence classes (a.k.a. POSIX collating elements)? The underlying libs might already support it... err... just dreaming (libpcre2/10.32
):
$ echo -e "Müller\nMuller" | pcre2grep '[[:u:]]'
pcre2grep: Error in command-line regex at offset 3: unknown POSIX class name
<marcoep@sphakka:~/tmp>
$ echo -e "Müller\nMuller" | pcre2grep '[[.u.]]'
pcre2grep: Error in command-line regex at offset 1: POSIX collating elements are not supported
Whatever the implementation details are, I'd love to see this working on anything searchable -- find in page (Ctrl+F), mail search (Ctrl+K), etc.
Updated•6 years ago
|
OS: Linux → All
Hardware: x86 → All
Summary: Consider accents and umlauts for approximate search → Consider accents and umlauts for approximate search/matching
Updated•5 years ago
|
Points: --- → 8
Updated•3 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•