User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b5) Gecko/2008032620 Firefox/3.0b5
Build Identifier: Mozilla Firefox 3 Beta 5
The tokenization of words in firefox for spellcheck is wrong when there is a ZWJ/ZWNJ/ZWS in the word.
How to Reproduce:
In firefox 3 beta 5(I used this. can be reproduced in old versions too), type any word with a ZWJ/ZWNJ/ZWS in between the word. English word itself is sufficient.
Type "catwalc" with a ZWNJ in between t and w. The red underline comes for the word fragment walc and not catwalc. I used English word for simplifying the explanation. Even though the Joiner characters in the word is not common in English, it is present in many other languages. Indic Languages , especially my mother tongue Malayalam(ml_IN) uses this extensively- in between the word as well as at the end of word. I will list some examples
1. അവന് (he) There is a ZWJ at the end of word
2. വില്പന (selling) There is a ZWJ in between the word
3. കായ്കറികള് (vegetables) There is a ZWNJ in between the word.
Some Bengali(bn_IN) words
If any of these words are not present in the word, user add it to the dictionary, the spelling error remains there with red underline. I will explain one case:
If the word xyz<zwj> is not there in the dictionary, and if we add to dictionary, still xyz<zwj> is a word with wrong spelling. But if there is a word xyz in the dictionary already, xyz<zwj> will not show any spelling errors. This is because the word used for spell check after the wrong tokenization is "xyz"
To verify whether this is a hunspell problem, I used the same hunspell word list in Openoffice 2.4 and everything was working fine there. Hunspell developers informed me that the problem might be in firefox and I found that it is true.
All languages with using ZWJ/ZWNJ/ZWS in words will get affected. Spell check wont work for those words.
Fix to be done:
Correct the word tokenization of firefox for spellcheck in textfields.
In bug 455236 we have a related question, which was, is there really a standardized spelling with those characters? At least for Telugu, it sounded that that's not the case.
I personally would consider a ZWNJ to be a spelling mistake in English words, but I don't know, CCing smontagu on that.
Nemeth, did this discussion come up elsewhere in hunspell land so far?
(In reply to comment #1)
> I personally would consider a ZWNJ to be a spelling mistake in English words,
> but I don't know, CCing smontagu on that.
I don't see that that question is relevant here. AFAICT, the issue that needs to be fixed is that we should be passing the word including ZWNJ etc to hunspell (as we currently do e.g. with SHY). Hunspell can then make the call whether the specific control characters should be transparent or meaningful or Just Wrong in the particular language.
3 Years after filing the bug, still I am able to reproduce this. Affects many Indian languages, Sinhala, Arabic etc. These languages cannot use spellchecker extension.
Created attachment 560018 [details] [diff] [review]
Comment on attachment 560018 [details] [diff] [review]
Review of attachment 560018 [details] [diff] [review]:
The code looks reasonable (modulo rebasing) but the reftest doesn't seem to be testing anything, since the default en-US dictionary is in ISO-8859-1. Do we have a way to create ad-hoc dictionaries for tests?
Hmm, not really. Do you want me to land the patch without the test?
Landed without the tests: https://hg.mozilla.org/integration/mozilla-inbound/rev/257d3a7e61bb