1620826 - Incorrect behavior of "Match diacritics" for cyrillic languages

vtd

Reporter

Description

•

5 years ago

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:73.0) Gecko/20100101 Firefox/73.0

Steps to reproduce:

Search for "и" with "Match diacritics" turned on

Actual results:

Й (a different letter) is matched

Expected results:

"й" shouldn't be matched

This also happens for letters е and ё

in Russian language, Й(й) and Ё(ё) are separate letters, not И(и) and Е(е) with diacritical markings. Probably, this also happens for Ў(ў) in Belorussian (also a separate letter, not У(у) and ukrainian І(і) and Ї(ї).

Gingerbread Man

Comment 1

•

5 years ago

Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0
20200217142647

(In reply to vtd from comment #0)

Search for "и" with "Match diacritics" turned on

Works for me. With the option on, I only see 3 matches in comment 0.

With the option off, I think the whole point is to find similar letters, which is especially useful since a lot of web content doesn't use the correct characters.

Has STR: --- → yes

Component: Untriaged → Find Backend

Product: Firefox → Core

vtd

Reporter

Comment 2

•

5 years ago

Sorry, I meant "off".
UI is a bit confusing, don't know if it is Firefox or my theme, but after changing the state a couple times, the button appears with a border.

BugBot [:suhaib / :marco/ :calixte]

Comment 3

•

5 years ago

The priority flag is not set for this bug.
:mikedeboer, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(mdeboer)

Firefox Bug Husbandry Bot

Comment 4

•

5 years ago

Because this bug's Severity has not been changed from the default since it was filed, and it's Priority is -- (Backlog,) indicating it has has not been previously triaged, the bug's Severity is being updated to -- (default, untriaged.)

Firefox Bug Husbandry Bot

Comment 5

•

5 years ago

Because this bug's Severity has not been changed from the default since it was filed, and it's Priority is -- (Backlog,) indicating it has has not been previously triaged, the bug's Severity is being updated to -- (default, untriaged.)

Firefox Bug Husbandry Bot

Comment 6

•

5 years ago

Because this bug's Severity has not been changed from the default since it was filed, and it's Priority is -- (Backlog,) indicating it has has not been previously triaged, the bug's Severity is being updated to -- (default, untriaged.)

Severity: normal → S3

Firefox Bug Husbandry Bot

Comment 7

•

5 years ago

The severity of these bugs was changed, mistakenly, from normal to S3.

Because these bugs have a priority of --, indicating that they have not been previously triaged, these bugs should be changed to Severity of --.

Severity: S3 → --

Dão Gottwald [:dao]

Comment 8

•

5 years ago

Alex, what if anything should be done here? Is this the same as bug 1622719?

Blocks: 202251

Flags: needinfo?(mdeboer) → needinfo?(alexhenrie24)

Updated

•

5 years ago

Severity: -- → S3

Priority: -- → P3

Alex Henrie

Comment 9

•

5 years ago

This has been discussed several times before:

https://bugzilla.mozilla.org/show_bug.cgi?id=202251#c70
https://bugzilla.mozilla.org/show_bug.cgi?id=202251#c71
https://bugzilla.mozilla.org/show_bug.cgi?id=202251#c108
https://bugzilla.mozilla.org/show_bug.cgi?id=202251#c118
https://phabricator.services.mozilla.com/D51841#1715069
https://phabricator.services.mozilla.com/D51841#1715773
https://phabricator.services.mozilla.com/D51841#1716236
https://phabricator.services.mozilla.com/D51841#1718197
https://phabricator.services.mozilla.com/D51841#1718211
https://phabricator.services.mozilla.com/D51841#1718316

It's essentially the same problem as n and ñ. In Spanish ñ is considered a separate letter, but in other languages (e.g. Breton) it is not. Furthermore, users who do not have ñ on their keyboards like to be able to search Spanish text using n instead.

Similarly, й is used in Russian and Ukrainian, but not in Serbian or Macedonian, and indeed the letter is not present on Serbian or Macedonian keyboards.

We could try to add some kind of language detection to handle this, but after the previous discussions, it's pretty clear that changing the behavior depending on language would cause more confusion than it would solve. Therefore in my opinion this bug is a WONTFIX.

Flags: needinfo?(alexhenrie24)

vtd

Reporter

Comment 10

•

5 years ago

changing the behavior depending on language would cause more confusion than it would solve.

If there is some language where й is a variant of и, that would be the case. As far as I know, no such language exists, so the behavior still looks incorrect. It is true that one can add diacritics to any symbol and produce an abomination like "2 with breve", but I doubt that anyone sane would write Й using combining Unicode characters.

Е/Ё is a different case.

Alex Henrie

Updated

•

5 years ago

Comment 11

•

5 years ago

(In reply to vtd from comment #10)

changing the behavior depending on language would cause more confusion than it would solve.

If there is some language where й is a variant of и, that would be the case. As far as I know, no such language exists, so the behavior still looks incorrect. It is true that one can add diacritics to any symbol and produce an abomination like "2 with breve", but I doubt that anyone sane would write Й using combining Unicode characters.

According to the Wikipedia entry for Й, which I realize is not an authoritative source, but nevertheless may be a useful summary:

Active use of ⟨Й⟩ (or, rather, the breve over ⟨И⟩) began around the 15th and the 16th centuries. Since the middle of the 17th century, the differentiation between ⟨И⟩ and ⟨Й⟩ is obligatory in the Russian variant of Church Slavonic orthography (used for the Russian language as well). During the alphabet reforms of Peter I, all diacritic marks were removed from the Russian writing system, but shortly after his death, in 1735, the distinction between ⟨И⟩ and ⟨Й⟩ was restored. ⟨Й⟩ was not officially considered a separate letter of the alphabet until the 1930s.

there's a definite link between Й and И. In the article about И, we find more about the relationship:

⟨И⟩ with a breve forms the letter ⟨й⟩ for the consonant /j/ or a similar semivowel, like the y in English "yes." The form has been used regularly in Church Slavonic since the 16th century, but it officially became a separate letter of alphabet much later (in Russian, only in 1918). The original name of ⟨й⟩ was I s kratkoy ('I with the short [line]'), later I kratkoye ('short I') in Russian. It is known similarly as I kratko in Bulgarian but as Yot in Ukrainian.

(interestingly, the two articles seem to disagree slightly regarding when й became a separate letter of the alphabet: 1918 or the 1930s?)

Е/Ё is a different case.

Not really. From the point of view of most (Latin-script) languages, it would be exactly equivalent to A / Ä or O / Ö, for example; yet there are languages where Ä and Ö are considered to be independent letters of the alphabet; see bug 1647335.

But while we recognize that for users of some languages, certain specific diacritic-modified characters (in Unicode terms) are understood to be entirely separate letters from their diacritic-less base forms, it's not at all clear that there is any good way to handle this in a language-sensitive way. Much web content lacks correct language tagging, so we can't rely on that; and many users may read pages in one language using a browser and/or operating system localized to a different locale, so neither of those is a reliable guide to what the user would expect to happen.

In general, I think it's better for a search to find a bit too much -- like finding words with й when searching for и -- than to miss finding things because the user didn't know exactly how to type the search words in the precise form they occur in the text.

So I agree with Alex that this is WONTFIX at this point -- at least until someone comes up with an idea for how it could work better in a way that generalizes to all users across all languages.

Status: UNCONFIRMED → RESOLVED

Closed: 5 years ago

Resolution: --- → WONTFIX

Bugzilla

Incorrect behavior of "Match diacritics" for cyrillic languages

Categories

(Core :: Find Backend, defect, P3)

Tracking

()

People

(Reporter: greatterrible, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Comment 10

Updated

Comment 11