Closed Bug 1649187 Opened 4 years ago Closed 4 years ago

L does not match Ł and O does not match Ø when "Match Diacritics" is off

Categories

(Core :: Find Backend, enhancement)

77 Branch
enhancement

Tracking

()

RESOLVED FIXED
88 Branch
Tracking Status
firefox88 --- fixed

People

(Reporter: alexhenrie24, Assigned: alexhenrie24)

References

Details

Attachments

(3 files, 1 obsolete file)

Some Unicode characters (such as Ł and Ø) are clearly Latin letters plus a combining mark, but have no defined Unicode decomposition. We can use ICU's Latin-to-ASCII fallback table to ignore this kind of diacritic when "Match Diacritics" is off. This would be particularly helpful for people who can read multiple languages but do not have all of the letters that those languages use on their keyboard.

Additional advantages to using the transliteration table include i matching ı when "Match Case" is on and "Match Diacritics" is off (Bug 1622719) and being able to drop the custom code for straight/curly quotation mark matching.

Is there a convenient reference to show what this Latin-to-ASCII fallback table includes?

Yes, you can look at intl/icu/source/data/translit/Latin_ASCII.txt in the Mozilla source code or https://github.com/unicode-org/icu/blob/master/icu4c/source/data/translit/Latin_ASCII.txt for the current upstream version.

I have code that implements this feature, but moz-phab keeps giving me the following error when I try to submit it:

No configured storage engine can store this file. See "Configuring File Storage" in the documentation for information on configuring storage engines.

I'm guessing that this is because the commit changes icudt67l.dat, which is a binary file. Any ideas on how to resolve the error?

Sorry, I'm not sure what to do about that... :( I think :anba usually deals with ICU updates, so may be able to help here.

(There may also be concerns about making sure what you're doing here doesn't get stomped on by future ICU updates, unless you've already looked into how that is handled.)

Flags: needinfo?(andrebargull)

(In reply to Jonathan Kew (:jfkthame) from comment #3)

(There may also be concerns about making sure what you're doing here doesn't get stomped on by future ICU updates, unless you've already looked into how that is handled.)

I spent a lot of time figuring out the ICU build system and I don't think that will be a problem.

Since Phabricator is not working, I'm uploading the patch here.

Assignee: nobody → alexhenrie24

Hmm, I've never seen that moz-phab error, so I can't really give any pointers how to resolve it. :-(

Flags: needinfo?(andrebargull)

This looks pretty awesome, Alex -- do you think it's ready for review/landing, or is there further work you're aiming to do on it?

I think it's ready to land, I just can't get the patch uploaded to Phabricator. I have opened Bug 1649359 about that and I'm hoping that it will attract the attention of someone who knows what to do.

See Also: → 1647335

I'd love for this to be merged and eventually available! As a native Polish speaker, who's quite lazy, I find the diacritic insensitive search very useful. Thanks so, so much for implementing it!

However, the fact that "l" does not match "ł" is rather inconvenient — despite being aware of the issue, I'm still occasionally caught out by this while searching.

I understand the reason why the initial implementation did not allow matching "ł" with "l" (unicode decomposition rules), but irrespective of what unicode claims, from the point of view of the Polish language "ł" is as much and as little a separate letter as "ó" or "ź", so if "o" matches "ó" and "z" matches "ź", then it's just weird and inconsistent that "l" doesn't match "ł".

I apologise for writing a "+1" comment, but since the patch is already written and hence at least part of the effort has been already put in, I hope that I'm not being too annoying.

Attachment #9160689 - Attachment is obsolete: true
Pushed by jkew@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d61a3c845eb6 Use a fallback table to strip diacritics from non-decomposable characters. r=jfkthame
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 88 Branch

Due to an unfortunate typo I made in base_chars.py, I thought that there
were no mappings we care about outside of the basic multilingual plane.
This patch adds back the non-BMP mappings that we do care about.

Regressions: 1697076
Pushed by jkew@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0c69d6d5aab7 Fix diacritic stripping for characters outside the BMP. r=jfkthame
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: