Closed Bug 257073 Opened 20 years ago Closed 11 years ago

Spelling checker mark as misspelled words with a special character in Catalan

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla28

People

(Reporter: toniher, Assigned: jmontane)

References

(Depends on 1 open bug,
URL
)

Details

Attachments

(2 files)

Suggested patch for gencattable.pl 20 years ago Toni Hermoso Pulido 237 bytes, patch		Details \| Diff \| Splinter Review
Proposed patch for mozInlineSpellWordUtil.cpp 11 years ago Joan Montané 545 bytes, patch	ehsan.akhgari : review+	Details \| Diff \| Splinter Review

Toni Hermoso Pulido

Reporter

Description

•

20 years ago

User-Agent: Mozilla/5.0 (X11; U; Linux i686; ca-CA; rv:1.7) Gecko/20040803 Firefox/0.9.3 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; ca-CA; rv:1.7) Gecko/20040803 Firefox/0.9.3 Some Catalan words which in Latin have a double ll (in English as well), for instance 'gorilla', are written down as goril·la, that is, l - dot - l. This dot corresponds to 0087 in Unicode CharMap, not a punctuation dot. I think the problem may be that in this particular case, 3 characters make up a letter ('l·l') There is not any problem correcting misspelled words (from gorilla to goril·la), for instance. http://www.softcatala.org/projectes/mozilla/errades/gorilla2.png Current OpenOffice.org versions have not problems with this spelling, so I suppose this will be an issue that would be solved in a future after updating the spelling checker engine. I just wanted to inform it. Regards Reproducible: Always Steps to Reproduce: 1. 2. 3.

Toni Hermoso Pulido

Reporter

Comment 1

•

20 years ago

Attached patch Suggested patch for gencattable.pl — Details — Splinter Review

This patch for gencattable.pl would make middle dot be regarded as a letter. Then, spellchecker would not split valid Catalan words.

Toni Hermoso Pulido

Reporter

Comment 2

•

20 years ago

This problem seems to be solved in Thunderbird 1.1 alfa builds. If I can track the solution and if it can benefit Seamonkey and other programs, I would change the bug as fixed.

Toni Hermoso Pulido

Reporter

Comment 3

•

19 years ago

I was checking current TB 1.5b2 and it seems to have regressed, and the error appears again! May someone help me to find what has happened?

Matt Caywood

Comment 4

•

16 years ago

This bug still seems to exist. Marking dependent on Bug 434044. It appears to be a very similar Unicode tokenization bug.

Depends on: 434044

Amir Aharoni

Comment 5

•

16 years ago

I'm not sure that comments are included when searching for bugs, but if they are, then this comment will help finding this bug: In English this dot is called interpunct, interpoint, space dot, middle dot or centered dot. In Catalan it is called punt volat and it is indeed a very common feature of this language. It is also used in some orthographies of Occitan, for example in standard Aranese, where it is called punt interior. Finally, it is used in Franco-Provençal, but unfortunately i don't know how it is called there.

Toni Hermoso Pulido

Reporter

Comment 6

•

15 years ago

It seems to affect other languages, such as Sardinian. Long time ago, a collaborator from Softcatalà created some patches, I will try to recover them and see how this changes in Hunspell.

Assignee: nobody → toniher

Massimeddu

Comment 7

•

15 years ago

Hi, it seams the same Bug 355178. I create a patch for that but I'm waiting for a review. With this patch the word chars are defined into Hunspell affix file (as in OO.org), so that patch should be fixed also this bug.

Toni Hermoso Pulido

Reporter

Updated

•

14 years ago

Depends on: 355178

Joan Montané

Assignee

Comment 8

•

14 years ago

Hi, this bug makes Catalan spell checking really annoying, because some common words are splitted wrongly. AFAIK, bug 355178 finally only fixes hyphen (-) char problem. So, it doesn't solve this Catalan bug related with unicode 0087 char (·) I think there are two easy solution: 1st: attachment 165521 [details] [diff] [review] patch 2nd: create a patch similar to bug 35178, checking 00B7 char when it has l or L chars at left and right.

Joan Montané

Assignee

Comment 9

•

14 years ago

Ups, a typo 2nd: create a patch similar to bug 355178, checking 00B7 char when it has l or L chars at left and right.

(no longer active)

Comment 10

•

14 years ago

Simon, do you know what a good solution would be here?

Simon Montagu :smontagu

Comment 11

•

14 years ago

I got lost rereading bug 355178. Do we support WORDCHARS in dictionary files now? If so, adding "WORDCHARS ·" to the dictionary seems like the right solution.

(no longer active)

Comment 12

•

14 years ago

(In reply to comment #11) > I got lost rereading bug 355178. Do we support WORDCHARS in dictionary files > now? If so, adding "WORDCHARS ·" to the dictionary seems like the right > solution. Yeah, that _should_ work, I think.

Toni Hermoso Pulido

Reporter

Comment 13

•

14 years ago

Hello, I've just tested it (http://gent.softcatala.org/toniher/tmp/addon-ca-dict.xpi) but no success. In .aff it was: --- SET ISO8859-1 TRY easirtocnlumdpgvfbqjwxyzhàèéíïòóúüç·-' --- and I changed it to: --- SET ISO8859-1 WORDCHARS ·-' TRY easirtocnlumdpgvfbqjwxyzhàèéíïòóúüç

Toni Hermoso Pulido

Reporter

Comment 14

•

14 years ago

(In reply to comment #13) > Hello, I've just tested it > (http://gent.softcatala.org/toniher/tmp/addon-ca-dict.xpi) but no success. > Side note, I tested it in 5 beta and last central nightly.

(no longer active)

Comment 15

•

14 years ago

What is the behavior desired here? Which characters do you want to be counted as part of a word in Catalan?

Amir Aharoni

Comment 16

•

14 years ago

The special character is '·'. See http://en.wikipedia.org/wiki/Interpunct . As Toni wrote in the opening description, "goril·la" is one correct Catalan word and it is supposed to be recognized as such. Here are a few more words: "instal·lació", "pel·lícula", "paral·lel".

(no longer active)

Comment 17

•

14 years ago

Does somebody want to run the code under the debugger to see what's going wrong? In particular, we want to see what WordSplitState::ClassifyCharacter returns for the interpunt character.

(no longer active)

Comment 18

•

14 years ago

The function is here: <http://mxr.mozilla.org/mozilla-central/source/extensions/spellcheck/src/mozInlineSpellWordUtil.cpp#860>

Toni Hermoso Pulido

Reporter

Updated

•

13 years ago

Depends on: 632977

voraone

Comment 19

•

12 years ago

Not to forget also the Unicode characters ŀ (U+0140, LATIN SMALL LETTER L WITH MIDDLE DOT) and Ŀ (U+013F, LATIN CAPITAL LETTER L WITH MIDDLE DOT). E.g., the following two tokens should be recognised as valid Catalan words: 1. goril·la (U+006C, U+00B7, U+006C) 2. goriŀla (U+0140, U+006C) NB: In KDE's Spanish/Catalan keyboard layout, L with middle dot is accessible via AltGr+l / AltGr+L, which makes for faster typing and a more visually appealing presentation.

Toni Hermoso Pulido

Reporter

Comment 20

•

12 years ago

Hi, we will not include the second option in spellcheck dictionaries since 'ŀ' is discouraged (http://unicode.org/charts/PDF/U0100.pdf) It got spread in a few (little-used compared to common ones) keyboard layouts. For Linux, it's a matter of time this will be removed as well.

Toni Hermoso Pulido

Reporter

Comment 21

•

12 years ago

Hi, I was considering to take a look at Ehsan direction, when Joan Montané passed me this link http://unicode.org/cldr/utility/breaks.jsp where segmentation of words such as 'goril·la' seems to be correct. As far as I've read, ICU library has recently been integrated into Mozilla code ( bug 724531 & bug 724533 ). So, I don't know if the way to solve this (and other similar problems) should be approached differently in a near future.

Joan Montané

Assignee

Comment 22

•

12 years ago

Just for info, http://unicode.org/cldr/utility/breaks.jsp uses Unicode Text Segmentation algorithm, as specified in UAX #29 http://www.unicode.org/reports/tr29/

Joan Montané

Assignee

Comment 23

•

11 years ago

Attached patch Proposed patch for mozInlineSpellWordUtil.cpp — Details — Splinter Review

This patch for mozInlineSpellWordUtil.cpp make middle dot (U+00B7) be regarded as a letter only when it has letters side by side. Then, spellchecker would not split valid words. This behavior is consistent with standard Unicode Text Segmentation UAX TR29 for "·" (U+00B7) character and word boundaries detection [1] [1] http://www.unicode.org/reports/tr29/#Word_Boundaries

Toni Hermoso Pulido

Reporter

Comment 24

•

11 years ago

Comment on attachment 830770 [details] [diff] [review] Proposed patch for mozInlineSpellWordUtil.cpp review patch

Attachment #830770 - Flags: review?(ehsan)

(no longer active)

Comment 25

•

11 years ago

Comment on attachment 830770 [details] [diff] [review] Proposed patch for mozInlineSpellWordUtil.cpp Review of attachment 830770 [details] [diff] [review]: ----------------------------------------------------------------- Nice! Thanks a lot, Joan!

Attachment #830770 - Flags: review?(ehsan) → review+

(no longer active)

Comment 26

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/8372a7defa69

Assignee: toniher → jmontane

Ryan VanderMeulen [:RyanVM]

Comment 27

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/8372a7defa69

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla28

Joan Montané

Assignee

Updated

•

11 years ago

Status: RESOLVED → VERIFIED

You need to log in before you can comment on or make changes to this bug.