Open Bug 1164263 Opened 9 years ago Updated 23 days ago

Investigate why making the en-US dictionary UTF-8 results in non-Latin text being marked as misspelled

Categories

(Core :: Spelling checker, defect)

defect

Tracking

()

People

(Reporter: ehsan.akhgari, Unassigned)

References

(Depends on 1 open bug)

Details

Follow-up from bug 1162823, in order to investigate the root cause of the issue.
I'd like to understand this a little better.

When I use the en-US dictionary which is ISO 8859-1 encoded, then a non-Latin word, like a Korean word, 안녕하세요 ("Hello"), or a Greek word, "Ευχαριστώ" ("Thanks"), does *not* get marked as misspelled.

When I use the en-GB dictionary, which is UTF-8 encoded, these words are marked as misspelled.

When I write Korean text using an English dictionary, all Korean words are by definition misspelled. When I write a Spanish text while using an English dictionary, also most words will be misspelled. Misspelled words should be flagged.

Why is this a problem? The spellchecker should mark words which are not in the dictionary as misspelled. And that's what it does.

IMHO the bug is that it currently doesn't mark non-Latin words with the ISO 8859-1 encoded en-US dictionary. This might be due to the fact that (quoted from https://bugzilla.redhat.com/show_bug.cgi?id=240696#c2):
  Hunspell spellchecks only in the stated character set.
That would explain why non-Latin words are just ignored if the dictionary uses ISO 8859-1.
(The ISO 8859-7 (8bit) encoded Greek dictionary behaves like the en-US dictionary.)

More experiments:
I've installed that comparatively simple Sanskrit dictionary (UTF-8)
https://addons.mozilla.org/en-US/firefox/addon/sanskrit-spell-checker/contribute/roadblock/?src=dp-btn-primary&version=2.0
and also the Hebrew dictionary (UTF-8)
https://addons.mozilla.org/en-US/firefox/addon/hebrew-spell-checking-dictiona/contribute/roadblock/?src=dp-btn-primary&version=1.3.0.1
and both flag all English words as misspelled as expected.

I installed the Korean dictionary:
https://addons.mozilla.org/en-us/firefox/addon/korean-spellchecker/
It marks "안녕하세요요" as misspelled (see correct spelling above), and it does *not* mark (most) Latin text as misspelled, although clearly English words don't appear in the Korean dictionary. However, English words with an apostrophe, like "I'd" and "don't" are marked as misspelled, as are accented works, for example, like "résumé" or "mañana". Also Greek text is marked as misspelled.

I reckon there is a trick in the Korean dictionary that lets all words [a-zA-Z]* pass. Sadly the Korean dictionary is a fine piece of art with 78000 (!) lines in the affix file, so I wouldn't know where to look.

So my questions are:

Ehsan: Why is it bad to show non-Latin words as misspelled?

Kevin: Is there a trick that can be done in the dictionary to ignore non-Latin words in the spellckeck? We know the UTF-8 ranges for CJK and other non-Latin languages. The Koreans seem to manage to ignore (most) Latin words, but how do they do it?
Flags: needinfo?(kevin.bugzilla)
Flags: needinfo?(ehsan)
Sorry I do not know enough about Hunspell to be of help.
Flags: needinfo?(kevin.bugzilla)
OK, I've looked at the Korean stuff. Looks like they allow [0-9a-z] using a tricky configuration of suffixes. They have 0123456789abcdefghijklmnopqrstuvwxyz 4173 times in their suffix file. So as described in comment #1: Latin words consisting of [a-z] only are allowed. I conclude that the Korean approach won't help us.

I think it comes down to:
If we use UTF-8, Hunspell flags all words which are not in the dictionary as errors (as I would naïvely expect). In 8bit mode it doesn't check words outside the given character set (as per comment #1 above).

I'd call the current behaviour a bug and back out bug 1162823 to enable UTF-8. Flagging non-Latin words when using a Latin dictionary is not a bug/regression but a natural feature.

BTW, going for UTF-8 would also allow the use of the two different single quotes: ’ ' - right?
(In reply to Jorg K (GMT+1) from comment #3)
> I'd call the current behaviour a bug and back out bug 1162823 to enable
> UTF-8. Flagging non-Latin words when using a Latin dictionary is not a
> bug/regression but a natural feature.

I do not agree, but it is not my place to decide this.

> BTW, going for UTF-8 would also allow the use of the two different single
> quotes: ’ ' - right?

Yes it should, assuming Mozilla implementation of Hunspell supports this.
(In reply to Jorg K (GMT+1) from comment #1)
> Ehsan: Why is it bad to show non-Latin words as misspelled?

Because most languages in the world are written in scripts that are not based on Latin, and if you speak one of those languages everything you type will be marked as misspelling.
Flags: needinfo?(ehsan)
(In reply to :Ehsan Akhgari from comment #5)
> Because most languages in the world are written in scripts that are not
> based on Latin, and if you speak one of those languages everything you type
> will be marked as misspelling.
... if you ask the program to spell check in English.

Frankly, I don't think your expectation is realistic.

If you write in Spanish and tell the computer to spell check in English, it's all wrong.
If you write in Persian and tell the computer to spell check in English, it should also come out wrong.

To fix this, 1) don't spell check, or 2) install the appropriate spell checker.

Do you want to be in the privileged situation that if you write a Persian/English mix, the Persian is unchecked, and that only the English part gets checked? Well, people who write French/English (Canadians) or any other Latin-based language mix don't have that privilege. Neither do the other UTF-8 dictionary users have this privilege (ignoring the Korean trickery). The guy who uses the Hebrew dictionary gets all his English flagged as wrong.

Anyway, that's all besides the point. As far as my research goes: If Hunspell uses UTF-8, it spell checks every string of characters and reports it, if it's not in the dictionary. If Hunspell uses ISO-8859-1, it ignores stuff not in the range.

What do you want to do about it? Drill open Hunspell? I can't see a solution. You'd have to define for every language where in the UTF-8 range the writing system has it's characters stored and teach Hunspell to look at that.

So if you write in Latin script, Persian, Arabic, Hebrew, CJK, etc. get ignored. If you write, for example, Hebrew, all the others get ignored.
The desired outcome of this bug will be not regressing the current behavior of an en-US build (that is, to not mark non-Latin words as misspelled), and switching then en-US dictionary to UTF-8.  Since I haven't yet investigated how this needs to be fixed, I can't comment on what a good solution will be.  If that involves patching Hunspell, that's completely fine, we do that already anyways.
Depends on: 1652692

I suggest adding a code range to the dictionary. A dictionary can declare a code range it will check against, then only words composed by characters within range of such and such will be spellchecked by that dictionary. The existing ISO-8859-1 trick is a crude way to declare such code range. We can make it explicit.

Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.