Last Comment Bug 434044 - The tokenization of words for spellcheck is wrong when there is a ZWJ/ZWNJ/ZWS in the word.
: The tokenization of words for spellcheck is wrong when there is a ZWJ/ZWNJ/ZW...
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Spelling checker (show other bugs)
: 5 Branch
: All Linux
: -- normal with 2 votes (vote)
: mozilla15
Assigned To: :Ehsan Akhgari (busy, don't ask for review please)
:
Mentors:
Depends on:
Blocks: 257073
  Show dependency treegraph
 
Reported: 2008-05-16 07:49 PDT by Santhosh Thottingal
Modified: 2012-06-02 11:53 PDT (History)
6 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Patch (v1) (5.14 KB, patch)
2011-09-13 12:57 PDT, :Ehsan Akhgari (busy, don't ask for review please)
smontagu: review+
Details | Diff | Review

Description Santhosh Thottingal 2008-05-16 07:49:27 PDT
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b5) Gecko/2008032620 Firefox/3.0b5
Build Identifier: Mozilla Firefox 3 Beta 5

Bug description:
The tokenization of words in firefox for spellcheck is wrong when there is a ZWJ/ZWNJ/ZWS in the word.


How to Reproduce:
In firefox 3 beta 5(I used this. can be reproduced in old versions too), type any word with a ZWJ/ZWNJ/ZWS in between the word. English word itself is sufficient.
Type "cat‌walc" with a ZWNJ in between t and w. The red underline comes for the word fragment walc and not catwalc. I used English word for simplifying the explanation. Even though the Joiner characters in the word is not common in English, it is present in many other languages. Indic Languages , especially my mother tongue Malayalam(ml_IN) uses this extensively- in between the word as well as at the end of word. I will list some examples
1. അവന്‍ (he)  There is a ZWJ at the end of word
2. വില്‍പന (selling) There is a ZWJ in between the word
3. കായ്‌കറികള്‍ (vegetables) There is a ZWNJ in between the word.
Some Bengali(bn_IN) words
র‍্যান্ডম (ZWJ)
বাল্‌ব (ZWNJ)
নেক্‌স্ট (ZWNJ)
রেকোগ্‌নাইস (ZWNJ)
ভবিষ্যদ্‌দ্রষ্টা (ZWNJ)

If any of these words are not present in the word, user add it to the dictionary, the spelling error remains there with red underline. I will explain one case:
If the word xyz<zwj> is not there in the dictionary, and if we add to dictionary, still xyz<zwj> is a word with wrong spelling. But if there is a word xyz in the dictionary already, xyz<zwj> will not show any spelling errors. This is because the word used for spell check after the wrong tokenization is "xyz" 
To verify whether this is a hunspell problem, I used the same hunspell word list in Openoffice 2.4 and everything was working fine there. Hunspell developers informed me that the problem might be in firefox and I found that it is true.

Impact:
All languages with using ZWJ/ZWNJ/ZWS in words will get affected. Spell check wont work for those words.

Fix to be done:
Correct the word tokenization of firefox for spellcheck in textfields. 








Reproducible: Always

Actual Results:
Comment 1 Axel Hecht [:Pike] 2008-11-04 02:32:10 PST
In bug 455236 we have a related question, which was, is there really a standardized spelling with those characters? At least for Telugu, it sounded that that's not the case.

I personally would consider a ZWNJ to be a spelling mistake in English words, but I don't know, CCing smontagu on that.

Nemeth, did this discussion come up elsewhere in hunspell land so far?
Comment 2 Simon Montagu :smontagu 2008-11-04 14:12:26 PST
(In reply to comment #1)

> I personally would consider a ZWNJ to be a spelling mistake in English words,
> but I don't know, CCing smontagu on that.

I don't see that that question is relevant here. AFAICT, the issue that needs to be fixed is that we should be passing the word including ZWNJ etc to hunspell (as we currently do e.g. with SHY). Hunspell can then make the call whether the specific control characters should be transparent or meaningful or Just Wrong in the particular language.
Comment 3 Santhosh Thottingal 2011-08-06 06:48:12 PDT
3 Years after filing the bug, still I am able to reproduce this. Affects many Indian languages, Sinhala, Arabic etc. These languages cannot use spellchecker extension.
Comment 4 :Ehsan Akhgari (busy, don't ask for review please) 2011-09-13 12:57:35 PDT
Created attachment 560018 [details] [diff] [review]
Patch (v1)
Comment 5 :Ehsan Akhgari (busy, don't ask for review please) 2012-05-24 13:52:51 PDT
Simon: ping?
Comment 6 Simon Montagu :smontagu 2012-05-28 11:04:37 PDT
Comment on attachment 560018 [details] [diff] [review]
Patch (v1)

Review of attachment 560018 [details] [diff] [review]:
-----------------------------------------------------------------

The code looks reasonable (modulo rebasing) but the reftest doesn't seem to be testing anything, since the default en-US dictionary is in ISO-8859-1. Do we have a way to create ad-hoc dictionaries for tests?
Comment 7 :Ehsan Akhgari (busy, don't ask for review please) 2012-05-30 20:45:39 PDT
Hmm, not really.  Do you want me to land the patch without the test?
Comment 8 :Ehsan Akhgari (busy, don't ask for review please) 2012-06-01 07:58:22 PDT
Landed without the tests: https://hg.mozilla.org/integration/mozilla-inbound/rev/257d3a7e61bb
Comment 9 :Ehsan Akhgari (busy, don't ask for review please) 2012-06-02 11:53:45 PDT
https://hg.mozilla.org/mozilla-central/rev/257d3a7e61bb

Note You need to log in before you can comment on or make changes to this bug.