Closed
Bug 257073
Opened 20 years ago
Closed 11 years ago
Spelling checker mark as misspelled words with a special character in Catalan
Categories
(Core :: Spelling checker, defect)
Tracking
()
VERIFIED
FIXED
mozilla28
People
(Reporter: toniher, Assigned: jmontane)
References
(Depends on 1 open bug, )
Details
Attachments
(2 files)
237 bytes,
patch
|
Details | Diff | Splinter Review | |
545 bytes,
patch
|
ehsan.akhgari
:
review+
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (X11; U; Linux i686; ca-CA; rv:1.7) Gecko/20040803 Firefox/0.9.3
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; ca-CA; rv:1.7) Gecko/20040803 Firefox/0.9.3
Some Catalan words which in Latin have a double ll (in English as well), for
instance 'gorilla', are written down as goril·la, that is, l - dot - l.
This dot corresponds to 0087 in Unicode CharMap, not a punctuation dot.
I think the problem may be that in this particular case, 3 characters make up a
letter ('l·l')
There is not any problem correcting misspelled words (from gorilla to goril·la),
for instance.
http://www.softcatala.org/projectes/mozilla/errades/gorilla2.png
Current OpenOffice.org versions have not problems with this spelling, so I
suppose this will be an issue that would be solved in a future after updating
the spelling checker engine.
I just wanted to inform it.
Regards
Reproducible: Always
Steps to Reproduce:
1.
2.
3.
Reporter | ||
Comment 1•20 years ago
|
||
This patch for gencattable.pl would make middle dot be regarded as a letter.
Then, spellchecker would not split valid Catalan words.
Reporter | ||
Comment 2•20 years ago
|
||
This problem seems to be solved in Thunderbird 1.1 alfa builds.
If I can track the solution and if it can benefit Seamonkey and other programs,
I would change the bug as fixed.
Reporter | ||
Comment 3•19 years ago
|
||
I was checking current TB 1.5b2 and it seems to have regressed, and the error
appears again! May someone help me to find what has happened?
Comment 4•16 years ago
|
||
This bug still seems to exist.
Marking dependent on Bug 434044. It appears to be a very similar Unicode tokenization bug.
Depends on: 434044
Comment 5•16 years ago
|
||
I'm not sure that comments are included when searching for bugs, but if they are, then this comment will help finding this bug:
In English this dot is called interpunct, interpoint, space dot, middle dot or centered dot.
In Catalan it is called punt volat and it is indeed a very common feature of this language.
It is also used in some orthographies of Occitan, for example in standard Aranese, where it is called punt interior.
Finally, it is used in Franco-Provençal, but unfortunately i don't know how it is called there.
Reporter | ||
Comment 6•15 years ago
|
||
It seems to affect other languages, such as Sardinian. Long time ago, a collaborator from Softcatalà created some patches, I will try to recover them and see how this changes in Hunspell.
Assignee: nobody → toniher
Comment 7•15 years ago
|
||
Hi,
it seams the same Bug 355178. I create a patch for that but I'm waiting for a review.
With this patch the word chars are defined into Hunspell affix file (as in OO.org), so that patch should be fixed also this bug.
Assignee | ||
Comment 8•14 years ago
|
||
Hi,
this bug makes Catalan spell checking really annoying, because some common words are splitted wrongly.
AFAIK, bug 355178 finally only fixes hyphen (-) char problem. So, it doesn't solve this Catalan bug related with unicode 0087 char (·)
I think there are two easy solution:
1st: attachment 165521 [details] [diff] [review] patch
2nd: create a patch similar to bug 35178, checking 00B7 char when it has l or L chars at left and right.
Assignee | ||
Comment 9•14 years ago
|
||
Ups, a typo
2nd: create a patch similar to bug 355178, checking 00B7 char when it has l or L chars at left and right.
Comment 10•14 years ago
|
||
Simon, do you know what a good solution would be here?
Comment 11•14 years ago
|
||
I got lost rereading bug 355178. Do we support WORDCHARS in dictionary files now? If so, adding "WORDCHARS ·" to the dictionary seems like the right solution.
Comment 12•14 years ago
|
||
(In reply to comment #11)
> I got lost rereading bug 355178. Do we support WORDCHARS in dictionary files
> now? If so, adding "WORDCHARS ·" to the dictionary seems like the right
> solution.
Yeah, that _should_ work, I think.
Reporter | ||
Comment 13•14 years ago
|
||
Hello, I've just tested it (http://gent.softcatala.org/toniher/tmp/addon-ca-dict.xpi) but no success.
In .aff it was:
---
SET ISO8859-1
TRY easirtocnlumdpgvfbqjwxyzhàèéíïòóúüç·-'
---
and I changed it to:
---
SET ISO8859-1
WORDCHARS ·-'
TRY easirtocnlumdpgvfbqjwxyzhàèéíïòóúüç
Reporter | ||
Comment 14•14 years ago
|
||
(In reply to comment #13)
> Hello, I've just tested it
> (http://gent.softcatala.org/toniher/tmp/addon-ca-dict.xpi) but no success.
>
Side note, I tested it in 5 beta and last central nightly.
Comment 15•14 years ago
|
||
What is the behavior desired here? Which characters do you want to be counted as part of a word in Catalan?
Comment 16•14 years ago
|
||
The special character is '·'. See http://en.wikipedia.org/wiki/Interpunct .
As Toni wrote in the opening description, "goril·la" is one correct Catalan word and it is supposed to be recognized as such. Here are a few more words: "instal·lació", "pel·lícula", "paral·lel".
Comment 17•14 years ago
|
||
Does somebody want to run the code under the debugger to see what's going wrong? In particular, we want to see what WordSplitState::ClassifyCharacter returns for the interpunt character.
Comment 18•14 years ago
|
||
Comment 19•12 years ago
|
||
Not to forget also the Unicode characters ŀ (U+0140, LATIN SMALL LETTER L WITH MIDDLE DOT) and Ŀ (U+013F, LATIN CAPITAL LETTER L WITH MIDDLE DOT).
E.g., the following two tokens should be recognised as valid Catalan words:
1. goril·la (U+006C, U+00B7, U+006C)
2. goriŀla (U+0140, U+006C)
NB: In KDE's Spanish/Catalan keyboard layout, L with middle dot is accessible via AltGr+l / AltGr+L, which makes for faster typing and a more visually appealing presentation.
Reporter | ||
Comment 20•12 years ago
|
||
Hi,
we will not include the second option in spellcheck dictionaries since 'ŀ' is discouraged (http://unicode.org/charts/PDF/U0100.pdf)
It got spread in a few (little-used compared to common ones) keyboard layouts.
For Linux, it's a matter of time this will be removed as well.
Reporter | ||
Comment 21•12 years ago
|
||
Hi,
I was considering to take a look at Ehsan direction, when Joan Montané passed me this link http://unicode.org/cldr/utility/breaks.jsp where segmentation of words such as 'goril·la' seems to be correct.
As far as I've read, ICU library has recently been integrated into Mozilla code ( bug 724531 & bug 724533 ). So, I don't know if the way to solve this (and other similar problems) should be approached differently in a near future.
Assignee | ||
Comment 22•12 years ago
|
||
Just for info,
http://unicode.org/cldr/utility/breaks.jsp uses Unicode Text Segmentation algorithm, as specified in UAX #29 http://www.unicode.org/reports/tr29/
Assignee | ||
Comment 23•11 years ago
|
||
This patch for mozInlineSpellWordUtil.cpp make middle dot (U+00B7) be regarded as a letter only when it has letters side by side. Then, spellchecker would not split valid words.
This behavior is consistent with standard Unicode Text Segmentation UAX TR29 for "·" (U+00B7) character and word boundaries detection [1]
[1] http://www.unicode.org/reports/tr29/#Word_Boundaries
Reporter | ||
Comment 24•11 years ago
|
||
Comment on attachment 830770 [details] [diff] [review]
Proposed patch for mozInlineSpellWordUtil.cpp
review patch
Attachment #830770 -
Flags: review?(ehsan)
Comment 25•11 years ago
|
||
Comment on attachment 830770 [details] [diff] [review]
Proposed patch for mozInlineSpellWordUtil.cpp
Review of attachment 830770 [details] [diff] [review]:
-----------------------------------------------------------------
Nice! Thanks a lot, Joan!
Attachment #830770 -
Flags: review?(ehsan) → review+
Comment 26•11 years ago
|
||
Assignee: toniher → jmontane
Comment 27•11 years ago
|
||
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla28
Assignee | ||
Updated•11 years ago
|
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•