Open Bug 842280 Opened 13 years ago Updated 3 years ago

Consider accents and umlauts for approximate search/matching

Tracking

()

Status:

NEW

People

(Reporter: bruant.d, Unassigned)

References

Details

(Whiteboard: [qx][fxsearch])

Attachments

(1 file)

awesomerbar.png 9 years ago Ryan Feeley [:rfeeley] 71.25 KB, image/png		Details

David Bruant

Reporter

Description

•

13 years ago

Basically, I think the rule would be something like: if there is a letter that can have an accent, look for all variants with an accent. So, typing "e" would find results with é, è, ê, ë, etc. Likewise for pretty much any letter. And typing "fete" would find bookmarks/history entries, etc with "fête", etc.

David Bruant

Reporter

Comment 1

•

13 years ago

If results are found for different accents, put the exact matching first and accent-approximate matching afterwards.

Guillaume Marty [:gmarty]

Comment 2

•

13 years ago

Transliterating diacritics to their regular forms is a naive yet effective form of fuzzy search. To get truly efficient search results, it is required to know the language of the query in advance, which is probably not possible in this case. I don't know of any fuzzy search algorithms working with no specific language context, but the transliteration described here should be a part of it. Some cases might be tricky though. U umlaut, `Ü`, can be transliterated to `U` in French, but 'UE' in German. All these possibles transliterations must be taken into account.

Ryan Feeley [:rfeeley]

Comment 3

•

9 years ago

What Oracle uses: https://docs.oracle.com/cd/E29584_01/webhelp/mdex_basicDev/src/rbdv_chars_mapping.html

Ryan Feeley [:rfeeley]

Comment 4

•

9 years ago

Attached image awesomerbar.png — Details

It's more than an issue of accents, though. Even common curly apostrophes and quotes break searches. See attached.

Ryan Feeley [:rfeeley]

Updated

•

9 years ago

Whiteboard: [qx]

Ryan Feeley [:rfeeley]

Comment 5

•

9 years ago

Adding Marco at Shorlander’s suggestion. Mapping diacriticals to ascii characters seems like an easy win for many locales, especially Germany.

Flags: needinfo?(mak77)

Marco Bonardo [:mak]

Comment 6

•

9 years ago

currently to search we just use the plain case-insensitive string comparator CaseInsensitiveUTF8CharsEqual from intl/unicharutil/util/nsUnicharUtils.cpp We could clearly write a wrapper that maps those chars to ascii both sides (in the search string and in the searched string). It would add some string walking costs clearly, but should be acceptable. Until we move to an FTS index in the future, then we'd have to study something different. I don't know if this should be part of our intl library (so another util in intl) or we should look into using ICU, now that we bundle it, I think it can do Latin -> ASCII transforms too. But could it be better to retain control over the mapping (something similar to comment 3)? The result wouldn't be perfect, since as comment 2 points out, we don't know neither the language of the search string, nor the language of the searched string. So the best we could do, off-hand, is an oracle-like conversion. This suggestion goes in the direction we always followed in the awesomebar, that is to try giving more useful results rather than trusting pedantically the search string. I'm adding this to the backlog, prioritization needs to be discussed yet.

Flags: needinfo?(mak77)

Priority: -- → P3

Whiteboard: [qx] → [qx][fxsearch]

Marco Bonardo [:mak]

Updated

•

9 years ago

Summary: Consider accents for approximate search → Consider accents and umlauts for approximate search

sphakka

Comment 7

•

6 years ago

How about a general "accent insensitive" approach based on regex equivalence classes (a.k.a. POSIX collating elements)? The underlying libs might already support it... err... just dreaming (libpcre2/10.32):

$ echo -e "Müller\nMuller" | pcre2grep '[[:u:]]'
pcre2grep: Error in command-line regex at offset 3: unknown POSIX class name
<marcoep@sphakka:~/tmp>
$ echo -e "Müller\nMuller" | pcre2grep '[[.u.]]'
pcre2grep: Error in command-line regex at offset 1: POSIX collating elements are not supported

Whatever the implementation details are, I'd love to see this working on anything searchable -- find in page (Ctrl+F), mail search (Ctrl+K), etc.

Drew Willcoxon :adw

Updated

•

6 years ago

OS: Linux → All

Hardware: x86 → All

Summary: Consider accents and umlauts for approximate search → Consider accents and umlauts for approximate search/matching

Marco Bonardo [:mak]

Updated

•

5 years ago

Points: --- → 8

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Consider accents and umlauts for approximate search/matching

Categories

(Firefox :: Address Bar, defect, P3)

Tracking

()

People

(Reporter: bruant.d, Unassigned)

References

Details

(Whiteboard: [qx][fxsearch])

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Updated

Comment 7

Updated

Updated

Updated

Attachment

General

Description

File Name

Content Type