Closed Bug 1649187 Opened 5 years ago Closed 5 years ago

L does not match Ł and O does not match Ø when "Match Diacritics" is off

Tracking

()

Status:

RESOLVED FIXED

Milestone:

88 Branch

Tracking Flags:

Tracking

Status

firefox88

---

fixed

People

(Reporter: alexhenrie24, Assigned: alexhenrie24)

References

Details

Attachments

(3 files, 1 obsolete file)

[PATCH] Use a transliterator to strip diacritics from non-decomposable characters 5 years ago Alex Henrie 471.74 KB, patch		Details \| Diff \| Splinter Review
Bug 1649187 - Use a transliterator to strip diacritics from non-decomposable characters. 5 years ago Alex Henrie 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1649187 - Use a fallback table to strip diacritics from non-decomposable characters. 5 years ago Alex Henrie 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1649187 - Fix diacritic stripping for characters outside the BMP. 5 years ago Alex Henrie 48 bytes, text/x-phabricator-request		Details \| Review

Alex Henrie

Assignee

Description

•

5 years ago

Some Unicode characters (such as Ł and Ø) are clearly Latin letters plus a combining mark, but have no defined Unicode decomposition. We can use ICU's Latin-to-ASCII fallback table to ignore this kind of diacritic when "Match Diacritics" is off. This would be particularly helpful for people who can read multiple languages but do not have all of the letters that those languages use on their keyboard.

Additional advantages to using the transliteration table include i matching ı when "Match Case" is on and "Match Diacritics" is off (Bug 1622719) and being able to drop the custom code for straight/curly quotation mark matching.

Jonathan Kew [:jfkthame]

Comment 1

•

5 years ago

Is there a convenient reference to show what this Latin-to-ASCII fallback table includes?

Alex Henrie

Assignee

Comment 2

•

5 years ago

Yes, you can look at intl/icu/source/data/translit/Latin_ASCII.txt in the Mozilla source code or https://github.com/unicode-org/icu/blob/master/icu4c/source/data/translit/Latin_ASCII.txt for the current upstream version.

I have code that implements this feature, but moz-phab keeps giving me the following error when I try to submit it:

No configured storage engine can store this file. See "Configuring File Storage" in the documentation for information on configuring storage engines.

I'm guessing that this is because the commit changes icudt67l.dat, which is a binary file. Any ideas on how to resolve the error?

Jonathan Kew [:jfkthame]

Comment 3

•

5 years ago

Sorry, I'm not sure what to do about that... :( I think :anba usually deals with ICU updates, so may be able to help here.

(There may also be concerns about making sure what you're doing here doesn't get stomped on by future ICU updates, unless you've already looked into how that is handled.)

Flags: needinfo?(andrebargull)

Alex Henrie

Assignee

Comment 4

•

5 years ago

(In reply to Jonathan Kew (:jfkthame) from comment #3)

(There may also be concerns about making sure what you're doing here doesn't get stomped on by future ICU updates, unless you've already looked into how that is handled.)

I spent a lot of time figuring out the ICU build system and I don't think that will be a problem.

Alex Henrie

Assignee

Comment 5

•

5 years ago

Attached patch [PATCH] Use a transliterator to strip diacritics from non-decomposable characters — Details — Splinter Review

Since Phabricator is not working, I'm uploading the patch here.

Assignee: nobody → alexhenrie24

André Bargull [:anba]

Comment 6

•

5 years ago

Hmm, I've never seen that moz-phab error, so I can't really give any pointers how to resolve it. :-(

Flags: needinfo?(andrebargull)

Jonathan Kew [:jfkthame]

Comment 7

•

5 years ago

This looks pretty awesome, Alex -- do you think it's ready for review/landing, or is there further work you're aiming to do on it?

Alex Henrie

Assignee

Comment 8

•

5 years ago

I think it's ready to land, I just can't get the patch uploaded to Phabricator. I have opened Bug 1649359 about that and I'm hoping that it will attract the attention of someone who knows what to do.

Alex Henrie

Assignee

Comment 9

•

5 years ago

Attached file Bug 1649187 - Use a transliterator to strip diacritics from non-decomposable characters. (obsolete) — Details

Alex Henrie

Assignee

Updated

•

5 years ago

Comment 10

•

5 years ago

I'd love for this to be merged and eventually available! As a native Polish speaker, who's quite lazy, I find the diacritic insensitive search very useful. Thanks so, so much for implementing it!

However, the fact that "l" does not match "ł" is rather inconvenient — despite being aware of the issue, I'm still occasionally caught out by this while searching.

I understand the reason why the initial implementation did not allow matching "ł" with "l" (unicode decomposition rules), but irrespective of what unicode claims, from the point of view of the Polish language "ł" is as much and as little a separate letter as "ó" or "ź", so if "o" matches "ó" and "z" matches "ź", then it's just weird and inconsistent that "l" doesn't match "ł".

I apologise for writing a "+1" comment, but since the patch is already written and hence at least part of the effort has been already put in, I hope that I'm not being too annoying.

Alex Henrie

Assignee

Comment 11

•

5 years ago

Attached file Bug 1649187 - Use a fallback table to strip diacritics from non-decomposable characters. — Details

Phabricator Automation

Updated

•

5 years ago

Attachment #9160689 - Attachment is obsolete: true

Pulsebot

Comment 12

•

5 years ago

Pushed by jkew@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d61a3c845eb6 Use a fallback table to strip diacritics from non-decomposable characters. r=jfkthame

Bogdan Tara[:bogdan_tara | bogdant]

Comment 13

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/d61a3c845eb6

Status: NEW → RESOLVED

Closed: 5 years ago

status-firefox88: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 88 Branch

Alex Henrie

Assignee

Comment 14

•

5 years ago

Attached file Bug 1649187 - Fix diacritic stripping for characters outside the BMP. — Details

Due to an unfortunate typo I made in base_chars.py, I thought that there
were no mappings we care about outside of the basic multilingual plane.
This patch adds back the non-BMP mappings that we do care about.

Tyson Smith [:tsmith]

Updated

•

5 years ago

Regressions: 1697076

Pulsebot

Comment 15

•

5 years ago

Pushed by jkew@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0c69d6d5aab7 Fix diacritic stripping for characters outside the BMP. r=jfkthame

Dorel Luca [:dluca]

Comment 16

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/0c69d6d5aab7

You need to log in before you can comment on or make changes to this bug.