Closed Bug 257073 Opened 16 years ago Closed 6 years ago
Spelling checker mark as misspelled words with a special character in Catalan
User-Agent: Mozilla/5.0 (X11; U; Linux i686; ca-CA; rv:1.7) Gecko/20040803 Firefox/0.9.3 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; ca-CA; rv:1.7) Gecko/20040803 Firefox/0.9.3 Some Catalan words which in Latin have a double ll (in English as well), for instance 'gorilla', are written down as goril·la, that is, l - dot - l. This dot corresponds to 0087 in Unicode CharMap, not a punctuation dot. I think the problem may be that in this particular case, 3 characters make up a letter ('l·l') There is not any problem correcting misspelled words (from gorilla to goril·la), for instance. http://www.softcatala.org/projectes/mozilla/errades/gorilla2.png Current OpenOffice.org versions have not problems with this spelling, so I suppose this will be an issue that would be solved in a future after updating the spelling checker engine. I just wanted to inform it. Regards Reproducible: Always Steps to Reproduce: 1. 2. 3.
This patch for gencattable.pl would make middle dot be regarded as a letter. Then, spellchecker would not split valid Catalan words.
This problem seems to be solved in Thunderbird 1.1 alfa builds. If I can track the solution and if it can benefit Seamonkey and other programs, I would change the bug as fixed.
I was checking current TB 1.5b2 and it seems to have regressed, and the error appears again! May someone help me to find what has happened?
This bug still seems to exist. Marking dependent on Bug 434044. It appears to be a very similar Unicode tokenization bug.
Depends on: 434044
I'm not sure that comments are included when searching for bugs, but if they are, then this comment will help finding this bug: In English this dot is called interpunct, interpoint, space dot, middle dot or centered dot. In Catalan it is called punt volat and it is indeed a very common feature of this language. It is also used in some orthographies of Occitan, for example in standard Aranese, where it is called punt interior. Finally, it is used in Franco-Provençal, but unfortunately i don't know how it is called there.
It seems to affect other languages, such as Sardinian. Long time ago, a collaborator from Softcatalà created some patches, I will try to recover them and see how this changes in Hunspell.
Assignee: nobody → toniher
Hi, it seams the same Bug 355178. I create a patch for that but I'm waiting for a review. With this patch the word chars are defined into Hunspell affix file (as in OO.org), so that patch should be fixed also this bug.
Hi, this bug makes Catalan spell checking really annoying, because some common words are splitted wrongly. AFAIK, bug 355178 finally only fixes hyphen (-) char problem. So, it doesn't solve this Catalan bug related with unicode 0087 char (·) I think there are two easy solution: 1st: attachment 165521 [details] [diff] [review] patch 2nd: create a patch similar to bug 35178, checking 00B7 char when it has l or L chars at left and right.
Ups, a typo 2nd: create a patch similar to bug 355178, checking 00B7 char when it has l or L chars at left and right.
Simon, do you know what a good solution would be here?
I got lost rereading bug 355178. Do we support WORDCHARS in dictionary files now? If so, adding "WORDCHARS ·" to the dictionary seems like the right solution.
(In reply to comment #11) > I got lost rereading bug 355178. Do we support WORDCHARS in dictionary files > now? If so, adding "WORDCHARS ·" to the dictionary seems like the right > solution. Yeah, that _should_ work, I think.
Hello, I've just tested it (http://gent.softcatala.org/toniher/tmp/addon-ca-dict.xpi) but no success. In .aff it was: --- SET ISO8859-1 TRY easirtocnlumdpgvfbqjwxyzhàèéíïòóúüç·-' --- and I changed it to: --- SET ISO8859-1 WORDCHARS ·-' TRY easirtocnlumdpgvfbqjwxyzhàèéíïòóúüç
(In reply to comment #13) > Hello, I've just tested it > (http://gent.softcatala.org/toniher/tmp/addon-ca-dict.xpi) but no success. > Side note, I tested it in 5 beta and last central nightly.
What is the behavior desired here? Which characters do you want to be counted as part of a word in Catalan?
The special character is '·'. See http://en.wikipedia.org/wiki/Interpunct . As Toni wrote in the opening description, "goril·la" is one correct Catalan word and it is supposed to be recognized as such. Here are a few more words: "instal·lació", "pel·lícula", "paral·lel".
Does somebody want to run the code under the debugger to see what's going wrong? In particular, we want to see what WordSplitState::ClassifyCharacter returns for the interpunt character.
Not to forget also the Unicode characters ŀ (U+0140, LATIN SMALL LETTER L WITH MIDDLE DOT) and Ŀ (U+013F, LATIN CAPITAL LETTER L WITH MIDDLE DOT). E.g., the following two tokens should be recognised as valid Catalan words: 1. goril·la (U+006C, U+00B7, U+006C) 2. goriŀla (U+0140, U+006C) NB: In KDE's Spanish/Catalan keyboard layout, L with middle dot is accessible via AltGr+l / AltGr+L, which makes for faster typing and a more visually appealing presentation.
Hi, we will not include the second option in spellcheck dictionaries since 'ŀ' is discouraged (http://unicode.org/charts/PDF/U0100.pdf) It got spread in a few (little-used compared to common ones) keyboard layouts. For Linux, it's a matter of time this will be removed as well.
Hi, I was considering to take a look at Ehsan direction, when Joan Montané passed me this link http://unicode.org/cldr/utility/breaks.jsp where segmentation of words such as 'goril·la' seems to be correct. As far as I've read, ICU library has recently been integrated into Mozilla code ( bug 724531 & bug 724533 ). So, I don't know if the way to solve this (and other similar problems) should be approached differently in a near future.
Just for info, http://unicode.org/cldr/utility/breaks.jsp uses Unicode Text Segmentation algorithm, as specified in UAX #29 http://www.unicode.org/reports/tr29/
This patch for mozInlineSpellWordUtil.cpp make middle dot (U+00B7) be regarded as a letter only when it has letters side by side. Then, spellchecker would not split valid words. This behavior is consistent with standard Unicode Text Segmentation UAX TR29 for "·" (U+00B7) character and word boundaries detection   http://www.unicode.org/reports/tr29/#Word_Boundaries
Comment on attachment 830770 [details] [diff] [review] Proposed patch for mozInlineSpellWordUtil.cpp review patch
Attachment #830770 - Flags: review?(ehsan)
Comment on attachment 830770 [details] [diff] [review] Proposed patch for mozInlineSpellWordUtil.cpp Review of attachment 830770 [details] [diff] [review]: ----------------------------------------------------------------- Nice! Thanks a lot, Joan!
Attachment #830770 - Flags: review?(ehsan) → review+
Assignee: toniher → jmontane
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla28
You need to log in before you can comment on or make changes to this bug.