L does not match Ł and O does not match Ø when "Match Diacritics" is off
Categories
(Core :: Find Backend, enhancement)
Tracking
()
Tracking | Status | |
---|---|---|
firefox88 | --- | fixed |
People
(Reporter: alexhenrie24, Assigned: alexhenrie24)
References
Details
Attachments
(3 files, 1 obsolete file)
Some Unicode characters (such as Ł and Ø) are clearly Latin letters plus a combining mark, but have no defined Unicode decomposition. We can use ICU's Latin-to-ASCII fallback table to ignore this kind of diacritic when "Match Diacritics" is off. This would be particularly helpful for people who can read multiple languages but do not have all of the letters that those languages use on their keyboard.
Additional advantages to using the transliteration table include i matching ı when "Match Case" is on and "Match Diacritics" is off (Bug 1622719) and being able to drop the custom code for straight/curly quotation mark matching.
Comment 1•4 years ago
|
||
Is there a convenient reference to show what this Latin-to-ASCII fallback table includes?
Assignee | ||
Comment 2•4 years ago
|
||
Yes, you can look at intl/icu/source/data/translit/Latin_ASCII.txt
in the Mozilla source code or https://github.com/unicode-org/icu/blob/master/icu4c/source/data/translit/Latin_ASCII.txt for the current upstream version.
I have code that implements this feature, but moz-phab
keeps giving me the following error when I try to submit it:
No configured storage engine can store this file. See "Configuring File Storage" in the documentation for information on configuring storage engines.
I'm guessing that this is because the commit changes icudt67l.dat, which is a binary file. Any ideas on how to resolve the error?
Comment 3•4 years ago
|
||
Sorry, I'm not sure what to do about that... :( I think :anba usually deals with ICU updates, so may be able to help here.
(There may also be concerns about making sure what you're doing here doesn't get stomped on by future ICU updates, unless you've already looked into how that is handled.)
Assignee | ||
Comment 4•4 years ago
|
||
(In reply to Jonathan Kew (:jfkthame) from comment #3)
(There may also be concerns about making sure what you're doing here doesn't get stomped on by future ICU updates, unless you've already looked into how that is handled.)
I spent a lot of time figuring out the ICU build system and I don't think that will be a problem.
Assignee | ||
Comment 5•4 years ago
|
||
Since Phabricator is not working, I'm uploading the patch here.
Comment 6•4 years ago
|
||
Hmm, I've never seen that moz-phab error, so I can't really give any pointers how to resolve it. :-(
Comment 7•4 years ago
|
||
This looks pretty awesome, Alex -- do you think it's ready for review/landing, or is there further work you're aiming to do on it?
Assignee | ||
Comment 8•4 years ago
|
||
I think it's ready to land, I just can't get the patch uploaded to Phabricator. I have opened Bug 1649359 about that and I'm hoping that it will attract the attention of someone who knows what to do.
Assignee | ||
Comment 9•4 years ago
|
||
Comment 10•4 years ago
|
||
I'd love for this to be merged and eventually available! As a native Polish speaker, who's quite lazy, I find the diacritic insensitive search very useful. Thanks so, so much for implementing it!
However, the fact that "l" does not match "ł" is rather inconvenient — despite being aware of the issue, I'm still occasionally caught out by this while searching.
I understand the reason why the initial implementation did not allow matching "ł" with "l" (unicode decomposition rules), but irrespective of what unicode claims, from the point of view of the Polish language "ł" is as much and as little a separate letter as "ó" or "ź", so if "o" matches "ó" and "z" matches "ź", then it's just weird and inconsistent that "l" doesn't match "ł".
I apologise for writing a "+1" comment, but since the patch is already written and hence at least part of the effort has been already put in, I hope that I'm not being too annoying.
Assignee | ||
Comment 11•4 years ago
|
||
Updated•4 years ago
|
Comment 12•4 years ago
|
||
Comment 13•4 years ago
|
||
bugherder |
Assignee | ||
Comment 14•4 years ago
|
||
Due to an unfortunate typo I made in base_chars.py, I thought that there
were no mappings we care about outside of the basic multilingual plane.
This patch adds back the non-BMP mappings that we do care about.
Comment 15•4 years ago
|
||
Comment 16•4 years ago
|
||
bugherder |
Description
•