The default bug view has changed. See this FAQ.

The tokenization of words for spellcheck is wrong when there is a ZWJ/ZWNJ/ZWS in the word.

RESOLVED FIXED in mozilla15

Status

()

Core
Spelling checker
RESOLVED FIXED
9 years ago
5 years ago

People

(Reporter: Santhosh Thottingal, Assigned: Ehsan)

Tracking

5 Branch
mozilla15
All
Linux
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

9 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b5) Gecko/2008032620 Firefox/3.0b5
Build Identifier: Mozilla Firefox 3 Beta 5

Bug description:
The tokenization of words in firefox for spellcheck is wrong when there is a ZWJ/ZWNJ/ZWS in the word.


How to Reproduce:
In firefox 3 beta 5(I used this. can be reproduced in old versions too), type any word with a ZWJ/ZWNJ/ZWS in between the word. English word itself is sufficient.
Type "cat‌walc" with a ZWNJ in between t and w. The red underline comes for the word fragment walc and not catwalc. I used English word for simplifying the explanation. Even though the Joiner characters in the word is not common in English, it is present in many other languages. Indic Languages , especially my mother tongue Malayalam(ml_IN) uses this extensively- in between the word as well as at the end of word. I will list some examples
1. അവന്‍ (he)  There is a ZWJ at the end of word
2. വില്‍പന (selling) There is a ZWJ in between the word
3. കായ്‌കറികള്‍ (vegetables) There is a ZWNJ in between the word.
Some Bengali(bn_IN) words
র‍্যান্ডম (ZWJ)
বাল্‌ব (ZWNJ)
নেক্‌স্ট (ZWNJ)
রেকোগ্‌নাইস (ZWNJ)
ভবিষ্যদ্‌দ্রষ্টা (ZWNJ)

If any of these words are not present in the word, user add it to the dictionary, the spelling error remains there with red underline. I will explain one case:
If the word xyz<zwj> is not there in the dictionary, and if we add to dictionary, still xyz<zwj> is a word with wrong spelling. But if there is a word xyz in the dictionary already, xyz<zwj> will not show any spelling errors. This is because the word used for spell check after the wrong tokenization is "xyz" 
To verify whether this is a hunspell problem, I used the same hunspell word list in Openoffice 2.4 and everything was working fine there. Hunspell developers informed me that the problem might be in firefox and I found that it is true.

Impact:
All languages with using ZWJ/ZWNJ/ZWS in words will get affected. Spell check wont work for those words.

Fix to be done:
Correct the word tokenization of firefox for spellcheck in textfields. 








Reproducible: Always

Actual Results:

Updated

9 years ago
Assignee: mscott → nobody

Comment 1

9 years ago
In bug 455236 we have a related question, which was, is there really a standardized spelling with those characters? At least for Telugu, it sounded that that's not the case.

I personally would consider a ZWNJ to be a spelling mistake in English words, but I don't know, CCing smontagu on that.

Nemeth, did this discussion come up elsewhere in hunspell land so far?
(In reply to comment #1)

> I personally would consider a ZWNJ to be a spelling mistake in English words,
> but I don't know, CCing smontagu on that.

I don't see that that question is relevant here. AFAICT, the issue that needs to be fixed is that we should be passing the word including ZWNJ etc to hunspell (as we currently do e.g. with SHY). Hunspell can then make the call whether the specific control characters should be transparent or meaningful or Just Wrong in the particular language.

Updated

8 years ago
Blocks: 257073
(Reporter)

Comment 3

6 years ago
3 Years after filing the bug, still I am able to reproduce this. Affects many Indian languages, Sinhala, Arabic etc. These languages cannot use spellchecker extension.
Version: unspecified → 5 Branch
(Assignee)

Comment 4

6 years ago
Created attachment 560018 [details] [diff] [review]
Patch (v1)
Assignee: nobody → ehsan
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Attachment #560018 - Flags: review?(Olli.Pettay)

Updated

6 years ago
Attachment #560018 - Flags: review?(Olli.Pettay) → review?(smontagu)
(Assignee)

Comment 5

5 years ago
Simon: ping?
Comment on attachment 560018 [details] [diff] [review]
Patch (v1)

Review of attachment 560018 [details] [diff] [review]:
-----------------------------------------------------------------

The code looks reasonable (modulo rebasing) but the reftest doesn't seem to be testing anything, since the default en-US dictionary is in ISO-8859-1. Do we have a way to create ad-hoc dictionaries for tests?
Attachment #560018 - Flags: review?(smontagu) → review+
(Assignee)

Comment 7

5 years ago
Hmm, not really.  Do you want me to land the patch without the test?
(Assignee)

Comment 8

5 years ago
Landed without the tests: https://hg.mozilla.org/integration/mozilla-inbound/rev/257d3a7e61bb
Target Milestone: --- → mozilla15
(Assignee)

Comment 9

5 years ago
https://hg.mozilla.org/mozilla-central/rev/257d3a7e61bb
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.