1164263 - Investigate why making the en-US dictionary UTF-8 results in non-Latin text being marked as misspelled

(no longer active)

Reporter

Description

•

11 years ago

Follow-up from bug 1162823, in order to investigate the root cause of the issue.

Jorg K (CEST = GMT+2)

Comment 1

•

10 years ago

I'd like to understand this a little better. When I use the en-US dictionary which is ISO 8859-1 encoded, then a non-Latin word, like a Korean word, 안녕하세요 ("Hello"), or a Greek word, "Ευχαριστώ" ("Thanks"), does *not* get marked as misspelled. When I use the en-GB dictionary, which is UTF-8 encoded, these words are marked as misspelled. When I write Korean text using an English dictionary, all Korean words are by definition misspelled. When I write a Spanish text while using an English dictionary, also most words will be misspelled. Misspelled words should be flagged. Why is this a problem? The spellchecker should mark words which are not in the dictionary as misspelled. And that's what it does. IMHO the bug is that it currently doesn't mark non-Latin words with the ISO 8859-1 encoded en-US dictionary. This might be due to the fact that (quoted from https://bugzilla.redhat.com/show_bug.cgi?id=240696#c2): Hunspell spellchecks only in the stated character set. That would explain why non-Latin words are just ignored if the dictionary uses ISO 8859-1. (The ISO 8859-7 (8bit) encoded Greek dictionary behaves like the en-US dictionary.) More experiments: I've installed that comparatively simple Sanskrit dictionary (UTF-8) https://addons.mozilla.org/en-US/firefox/addon/sanskrit-spell-checker/contribute/roadblock/?src=dp-btn-primary&version=2.0 and also the Hebrew dictionary (UTF-8) https://addons.mozilla.org/en-US/firefox/addon/hebrew-spell-checking-dictiona/contribute/roadblock/?src=dp-btn-primary&version=1.3.0.1 and both flag all English words as misspelled as expected. I installed the Korean dictionary: https://addons.mozilla.org/en-us/firefox/addon/korean-spellchecker/ It marks "안녕하세요요" as misspelled (see correct spelling above), and it does *not* mark (most) Latin text as misspelled, although clearly English words don't appear in the Korean dictionary. However, English words with an apostrophe, like "I'd" and "don't" are marked as misspelled, as are accented works, for example, like "résumé" or "mañana". Also Greek text is marked as misspelled. I reckon there is a trick in the Korean dictionary that lets all words [a-zA-Z]* pass. Sadly the Korean dictionary is a fine piece of art with 78000 (!) lines in the affix file, so I wouldn't know where to look. So my questions are: Ehsan: Why is it bad to show non-Latin words as misspelled? Kevin: Is there a trick that can be done in the dictionary to ignore non-Latin words in the spellckeck? We know the UTF-8 ranges for CJK and other non-Latin languages. The Koreans seem to manage to ignore (most) Latin words, but how do they do it?

Flags: needinfo?(kevin.bugzilla)

Flags: needinfo?(ehsan)

Kevin Atkinson

Comment 2

•

10 years ago

Sorry I do not know enough about Hunspell to be of help.

Flags: needinfo?(kevin.bugzilla)

Jorg K (CEST = GMT+2)

Comment 3

•

10 years ago

OK, I've looked at the Korean stuff. Looks like they allow [0-9a-z] using a tricky configuration of suffixes. They have 0123456789abcdefghijklmnopqrstuvwxyz 4173 times in their suffix file. So as described in comment #1: Latin words consisting of [a-z] only are allowed. I conclude that the Korean approach won't help us. I think it comes down to: If we use UTF-8, Hunspell flags all words which are not in the dictionary as errors (as I would naïvely expect). In 8bit mode it doesn't check words outside the given character set (as per comment #1 above). I'd call the current behaviour a bug and back out bug 1162823 to enable UTF-8. Flagging non-Latin words when using a Latin dictionary is not a bug/regression but a natural feature. BTW, going for UTF-8 would also allow the use of the two different single quotes: ’ ' - right?

Kevin Atkinson

Comment 4

•

10 years ago

(In reply to Jorg K (GMT+1) from comment #3) > I'd call the current behaviour a bug and back out bug 1162823 to enable > UTF-8. Flagging non-Latin words when using a Latin dictionary is not a > bug/regression but a natural feature. I do not agree, but it is not my place to decide this. > BTW, going for UTF-8 would also allow the use of the two different single > quotes: ’ ' - right? Yes it should, assuming Mozilla implementation of Hunspell supports this.

(no longer active)

Reporter

Comment 5

•

10 years ago

(In reply to Jorg K (GMT+1) from comment #1) > Ehsan: Why is it bad to show non-Latin words as misspelled? Because most languages in the world are written in scripts that are not based on Latin, and if you speak one of those languages everything you type will be marked as misspelling.

Flags: needinfo?(ehsan)

Jorg K (CEST = GMT+2)

Comment 6

•

10 years ago

(In reply to :Ehsan Akhgari from comment #5) > Because most languages in the world are written in scripts that are not > based on Latin, and if you speak one of those languages everything you type > will be marked as misspelling. ... if you ask the program to spell check in English. Frankly, I don't think your expectation is realistic. If you write in Spanish and tell the computer to spell check in English, it's all wrong. If you write in Persian and tell the computer to spell check in English, it should also come out wrong. To fix this, 1) don't spell check, or 2) install the appropriate spell checker. Do you want to be in the privileged situation that if you write a Persian/English mix, the Persian is unchecked, and that only the English part gets checked? Well, people who write French/English (Canadians) or any other Latin-based language mix don't have that privilege. Neither do the other UTF-8 dictionary users have this privilege (ignoring the Korean trickery). The guy who uses the Hebrew dictionary gets all his English flagged as wrong. Anyway, that's all besides the point. As far as my research goes: If Hunspell uses UTF-8, it spell checks every string of characters and reports it, if it's not in the dictionary. If Hunspell uses ISO-8859-1, it ignores stuff not in the range. What do you want to do about it? Drill open Hunspell? I can't see a solution. You'd have to define for every language where in the UTF-8 range the writing system has it's characters stored and teach Hunspell to look at that. So if you write in Latin script, Persian, Arabic, Hebrew, CJK, etc. get ignored. If you write, for example, Hebrew, all the others get ignored.

(no longer active)

Reporter

Comment 7

•

10 years ago

The desired outcome of this bug will be not regressing the current behavior of an en-US build (that is, to not mark non-Latin words as misspelled), and switching then en-US dictionary to UTF-8. Since I haven't yet investigated how this needs to be fixed, I can't comment on what a good solution will be. If that involves patching Hunspell, that's completely fine, we do that already anyways.

Ekanan Ketunuti

Updated

•

5 years ago

Depends on: 1652692

billyswong

Comment 8

•

3 years ago

I suggest adding a code range to the dictionary. A dictionary can declare a code range it will check against, then only words composed by characters within range of such and such will be spellchecked by that dictionary. The existing ISO-8859-1 trick is a crude way to declare such code range. We can make it explicit.

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

Bugzilla

Investigate why making the en-US dictionary UTF-8 results in non-Latin text being marked as misspelled

Categories

(Core :: Spelling checker, defect)

Tracking

()

People

(Reporter: ehsan.akhgari, Unassigned)

References

(Depends on 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Updated