Closed Bug 257073 Opened 16 years ago Closed 6 years ago

Spelling checker mark as misspelled words with a special character in Catalan

Categories

(Core :: Spelling checker, defect)

x86
Linux
defect
Not set

Tracking

()

VERIFIED FIXED
mozilla28

People

(Reporter: toniher, Assigned: jmontane)

References

(Depends on 1 open bug, )

Details

Attachments

(2 files)

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; ca-CA; rv:1.7) Gecko/20040803 Firefox/0.9.3
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; ca-CA; rv:1.7) Gecko/20040803 Firefox/0.9.3

Some Catalan words which in Latin have a double ll (in English as well), for
instance 'gorilla', are written down as goril·la, that is,  l - dot - l.
This dot corresponds to 0087 in Unicode CharMap, not a punctuation dot.
I think the problem may be that in this particular case, 3 characters make up a
letter ('l·l')

There is not any problem correcting misspelled words (from gorilla to goril·la),
for instance.
http://www.softcatala.org/projectes/mozilla/errades/gorilla2.png

Current OpenOffice.org versions have not problems with this spelling, so I
suppose this will be an issue that would be solved in a future after updating
the spelling checker engine.
I just wanted to inform it.

Regards

Reproducible: Always
Steps to Reproduce:
1.
2.
3.
This patch for gencattable.pl would make middle dot be regarded as a letter.
Then, spellchecker would not split valid Catalan words.
This problem seems to be solved in Thunderbird 1.1 alfa builds.
If I can track the solution and if it can benefit Seamonkey and other programs,
I would change the bug as fixed.
I was checking current TB 1.5b2 and it seems to have regressed, and the error
appears again! May someone help me to find what has happened?
This bug still seems to exist. 

Marking dependent on Bug 434044. It appears to be a very similar Unicode tokenization bug.
Depends on: 434044
I'm not sure that comments are included when searching for bugs, but if they are, then this comment will help finding this bug:

In English this dot is called interpunct, interpoint, space dot, middle dot or centered dot.

In Catalan it is called punt volat and it is indeed a very common feature of this language.

It is also used in some orthographies of Occitan, for example in standard Aranese, where it is called punt interior.

Finally, it is used in Franco-Provençal, but unfortunately i don't know how it is called there.
It seems to affect other languages, such as Sardinian. Long time ago, a collaborator from Softcatalà created some patches, I will try to recover them and see how this changes in Hunspell.
Assignee: nobody → toniher
Hi,
it seams the same Bug 355178. I create a patch for that but I'm waiting for a review.
With this patch the word chars are defined into Hunspell affix file (as in OO.org), so that patch should be fixed also this bug.
Depends on: 355178
Hi,

this bug makes Catalan spell checking really annoying, because some common words are splitted wrongly.

AFAIK, bug 355178 finally only fixes hyphen (-) char problem. So, it doesn't solve this Catalan bug related with unicode 0087  char (·)

I think there are two easy solution:

1st:  attachment 165521 [details] [diff] [review] patch
2nd: create a patch similar to bug 35178, checking 00B7 char when it has l or L chars at left and right.
Ups, a typo

2nd: create a patch similar to bug 355178, checking 00B7 char when it has l or L chars at left and right.
Simon, do you know what a good solution would be here?
I got lost rereading bug 355178. Do we support WORDCHARS in dictionary files now? If so, adding "WORDCHARS ·" to the dictionary seems like the right solution.
(In reply to comment #11)
> I got lost rereading bug 355178. Do we support WORDCHARS in dictionary files
> now? If so, adding "WORDCHARS ·" to the dictionary seems like the right
> solution.

Yeah, that _should_ work, I think.
Hello, I've just tested it (http://gent.softcatala.org/toniher/tmp/addon-ca-dict.xpi) but no success.

In .aff it was:
---
SET ISO8859-1

TRY easirtocnlumdpgvfbqjwxyzhàèéíïòóúüç·-'
---
and I changed it to:
---
SET ISO8859-1

WORDCHARS ·-'

TRY easirtocnlumdpgvfbqjwxyzhàèéíïòóúüç
(In reply to comment #13)
> Hello, I've just tested it
> (http://gent.softcatala.org/toniher/tmp/addon-ca-dict.xpi) but no success.
> 

Side note, I tested it in 5 beta and last central nightly.
What is the behavior desired here?  Which characters do you want to be counted as part of a word in Catalan?
The special character is '·'. See http://en.wikipedia.org/wiki/Interpunct .

As Toni wrote in the opening description, "goril·la" is one correct Catalan word and it is supposed to be recognized as such. Here are a few more words: "instal·lació", "pel·lícula", "paral·lel".
Does somebody want to run the code under the debugger to see what's going wrong?  In particular, we want to see what WordSplitState::ClassifyCharacter returns for the interpunt character.
Depends on: 632977
Not to forget also the Unicode characters ŀ (U+0140, LATIN SMALL LETTER L WITH MIDDLE DOT) and Ŀ (U+013F, LATIN CAPITAL LETTER L WITH MIDDLE DOT).

E.g., the following two tokens should be recognised as valid Catalan words:

1. goril·la (U+006C, U+00B7, U+006C)
2. goriŀla (U+0140, U+006C)

NB: In KDE's Spanish/Catalan keyboard layout, L with middle dot is accessible via AltGr+l / AltGr+L, which makes for faster typing and a more visually appealing presentation.
Hi,

we will not include the second option in spellcheck dictionaries since 'ŀ' is discouraged (http://unicode.org/charts/PDF/U0100.pdf)
It got spread in a few (little-used compared to common ones) keyboard layouts.
For Linux, it's a matter of time this will be removed as well.
Hi,

I was considering to take a look at Ehsan direction, when Joan Montané passed me this link http://unicode.org/cldr/utility/breaks.jsp where segmentation of words such as 'goril·la' seems to be correct.

As far as I've read, ICU library has recently been integrated into Mozilla code ( bug 724531 & bug 724533 ). So, I don't know if the way to solve this (and other similar problems) should be approached differently in a near future.
Just for info,

http://unicode.org/cldr/utility/breaks.jsp uses Unicode Text Segmentation algorithm, as specified in UAX #29 http://www.unicode.org/reports/tr29/
This patch for mozInlineSpellWordUtil.cpp make middle dot (U+00B7) be regarded as a letter only when it has letters side by side. Then, spellchecker would not split valid words.

This behavior is consistent with standard Unicode Text Segmentation UAX TR29 for "·" (U+00B7) character and word boundaries detection [1] 

[1] http://www.unicode.org/reports/tr29/#Word_Boundaries
Comment on attachment 830770 [details] [diff] [review]
Proposed patch for mozInlineSpellWordUtil.cpp

review patch
Attachment #830770 - Flags: review?(ehsan)
Comment on attachment 830770 [details] [diff] [review]
Proposed patch for mozInlineSpellWordUtil.cpp

Review of attachment 830770 [details] [diff] [review]:
-----------------------------------------------------------------

Nice!  Thanks a lot, Joan!
Attachment #830770 - Flags: review?(ehsan) → review+
https://hg.mozilla.org/mozilla-central/rev/8372a7defa69
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla28
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.