355178 - spell checker doesn't accept "scot-free" (tokenization, Hunspell 1.2.8)

Reporter

Description

•

18 years ago

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20061002 Minefield/3.0a1
Build Identifier: Thunderbird version 1.5.0.7 (20060909)

"scot-free" is a common construction in English, e.g. "He escaped scot-free", even though "scot" on its own is obscure Middle English.  So I think "scot-free" should be in the English dictionaries.  I'm not sure whether you can add a hyphenated word to the dictionary, if not I guess add "scot" so the construction is OK.

Reproducible: Always

Steps to Reproduce:
1.  Enter "He escaped scot-free"


Actual Results:  
Flags "scot"

Expected Results:  
scot-free is OK and common.

guanxi

Comment 1

•

18 years ago

confirming:
scot-free is a word - http://onelook.com/?w=scot-free&ls=b

S. Page: Can you test other hyphenated words with one part that is not English?  Can you add "scot-free" to the dictionary?

It would be useful to know if this is a problem with hyphenated words or just a word missing from the dictionary.

Status: UNCONFIRMED → NEW

Ever confirmed: true

Dan Mosedale (:dmosedale, :dmose)

Updated

•

16 years ago

Assignee: mscott → nobody

skierpage

Reporter

Comment 2

•

16 years ago

In reply to comment #1)
>  Can you add "scot-free" to the dictionary?

Sort of. I added scot-free to en-US.dic and it suggests it as a correction for scotfree.  But it still flags the "scot" part of "scot-free" as a misspelling.  I guess Hunspell checks the word before the hyphen, fails, and doesn't retry to see if the compound word is valid.

I found a different case: "Topsy-turvy" is correct English but separately each part is not English.  If I add that to en-US.dic, Firefox correctly suggests topsy-turvy as a replacement for topsyturvy, but then flags both parts of "topsy-turvy" as misspelled.

> Can you test other hyphenated words with one part that is not English?

I can't think of any others off-hand nor find a list of them.


> It would be useful to know if this is a problem with hyphenated words or just a
> word missing from the dictionary.

In general, there are no hyphenated words in the en-US.dic dictionary.  Perhaps that's why it doesn't suggest "plug-in cop-out" as corrections when you type "plugin copout"   (BTW, bug 448897 is about "plugin/plug-in".).  I found adding the hyphenated word does the right thing, but there may be some other way to indicate hyphenated compounds -- the "Hunspell features"/hunspell(4) man page doc is hard to understand.

To sum up,
* Adding a hyphenated phrase to en-US.dic makes FF3 suggest it as a spelling correction (good!).  So IMO this should be done for common phrases like cop-out and plug-in that people incorrectly spell as one word.

* Adding a hyphenated phrase to en-US.dic where either part alone is not valid English does not make the compound pass spell checking (bad!).  So adding "scot-free" and "topsy-turvy" will confuse users.

Summary: spell checker doesn't accept "scot-free" → spell checker doesn't accept "scot-free" or "topsy-turvy"

Németh László

Comment 3

•

16 years ago

This is a longstanding problem at OpenOffice.org development, see http://www.openoffice.org/issues/show_bug.cgi?id=64400.
The upcoming Hunspell 1.2.8 solves this problem by checking the whole compound word and (when necessary) all parts of the hyphenated input word. I will also extend the en_US dictionary.

To fix the reported problem, Firefox and Thunderbird have to modify their word tokenization to handle hyphen as a word character. Thanks for the report, László

Németh László

Comment 4

•

16 years ago

To solve this problem the only task is to change the tokenization at hyphens: Without any dictionary modification the newly integrated Hunspell 1.2.8 can break the input token at hyphens, like the recent tokenizator (default back compatibility), but Hunspell checks also the whole token (with hyphen, like in "scot-free") before this tokenization. (The default tokenization of Hunspell can be modified by the BREAK parameter of the Hunspell affix file.)

This is an important issue for the spell checking (for example, see the last duplicate Bug 466127). Could anyone help to find what component or who is responsible for the tokenization in Mozilla Firefox? (Handling abbreviations with dots is also an important tokenization problem in Firefox.)

Matt Caywood

Updated

•

15 years ago

Blocks: 360240

Matt Caywood

Updated

•

15 years ago

Blocks: 461142

Matt Caywood

Updated

•

15 years ago

Summary: spell checker doesn't accept "scot-free" or "topsy-turvy" → spell checker doesn't accept "scot-free" (tokenization, Hunspell 1.2.8)

Matt Caywood

Updated

•

15 years ago

Blocks: 276748

Matt Caywood

Updated

•

15 years ago

Blocks: 318040

Matt Caywood

Comment 9

•

15 years ago

http://www.openoffice.org/issues/show_bug.cgi?id=64400

has some useful information including a list of Unicode word separators used by
OpenOffice.

R Baars

Comment 10

•

15 years ago

THis bug should not just be about the dash, but about word iteration parameterizations.

It involves the . as well, for abbreviations.
A construction like OOo would be possible, a different approach woulkd be a construction like SRX (general rules by regexp, exceptions on these rules by regexp)

tru

Comment 11

•

15 years ago

It's a giant bug in bugzilla that random people like caywood@gmail.com can add the emails of other people like me to bug report cclists without their permission.

Matt Caywood

Comment 12

•

15 years ago

tru, I didn't do a thing. 

You probably added yourself to the bug when you voted for it in Bugzilla (or possibly one of the bugs it blocks.) That's so you can be notified of fixes. 

I guess your email has been removed, since I can't see it on any of these bugs.

Magnus Melin [:mkmelin]

Comment 13

•

15 years ago

It's not a bug, and in this case happened because tru's bug got duped to this bug. (That always adds the dupe reporter to the cc list of the duped-to bug they should follow.)

Ramon Casha

Comment 14

•

15 years ago

Note that it's not just the hyphen that causes this problem. In the Maltese language, an apostrophe (or right single quote) is also part of a word. The word [bela'] is correct, while [bela] is not. 

With the current functionality, if the user correctly writes bela', the spell checker flags the letters "bela" as incorrect, and then suggests bela' with an apostrophe as an alternative.

Ideally the list of characters should be language-dependent. 

Note also some of these characters, while forming part of a word, can also be word-breaking characters. So, in Maltese, "tal-bidu" is two words, "tal-" and "bidu", but in English, "topsy-turvy" should be checked as a single word.

I think that the solution would require a language-dependent list of characters that are to be treated as alphabetic characters, AND a separate list of characters that are to be treated as word-break characters.

madtom

Comment 15

•

15 years ago

Using 2.0.0.22 and the English(GB) dictionary if I type etc the etc is underlined and the drop down alternative of etc. is chosen the etc is still underlined so th replacement etc. is still regarded as a spelling error. This is true for firefox too!

Mark Clements

Comment 16

•

15 years ago

(In reply to comment #15)
> Using 2.0.0.22 and the English(GB) dictionary if I type etc the etc is
> underlined and the drop down alternative of etc. is chosen the etc is still
> underlined so th replacement etc. is still regarded as a spelling error. This
> is true for firefox too!

This is being handled by bug 318040.

Jaska Zedlik

Comment 17

•

15 years ago

(In reply to comment #14)
> Note that it's not just the hyphen that causes this problem. In the Maltese
> language, an apostrophe (or right single quote) is also part of a word. The
> word [bela'] is correct, while [bela] is not. 

The same is in Belarusian and, I guess, Ukrainian. For example, if a word "сям’я" (without quotes, but with a U+2019 apostrophe sign) is containing in the dictionary, it nevertheless underlined as misspelled with no variant suggested. If I try to "misspell" this work in a way of typing "сям’яя", then the word "сям’я" is among the suggested variant (it is in the dictionary if suggested), but after it have been selected in the context menu to replace misspelled "сям’яя", it becomes underlined as misspelled too.

Firefox 3.5.3 under Windows.

Massimeddu

Comment 18

•

15 years ago

Hi,

I have the same problem with sardinian dictionary. I create a simple patch that can fix it. Please some Mozilla team member can review it.

Massimeddu

Comment 19

•

15 years ago

Attached patch The patch that can resolve this problem (obsolete) — Details — Splinter Review

Massimeddu

Comment 20

•

15 years ago

Attached patch Improved version of the patch (obsolete) — Details — Splinter Review

Attachment #418524 - Attachment is obsolete: true

Massimeddu

Comment 21

•

15 years ago

Attached patch Improved version of the patch (obsolete) — Details — Splinter Review

Resolved minor code style problem.

Attachment #418548 - Attachment is obsolete: true

Attachment #418549 - Flags: review+

Massimeddu

Updated

•

15 years ago

Attachment #418549 - Flags: review+ → review?(spelling-checker)

Massimeddu

Comment 22

•

15 years ago

Attached patch Improved version of the patch 2 (obsolete) — Details — Splinter Review

This version of the patch resolve also the Bug 318040, please review.

Attachment #418549 - Attachment is obsolete: true

Attachment #418984 - Flags: review?(spelling-checker)

Attachment #418549 - Flags: review?(spelling-checker)

Ryan VanderMeulen [:RyanVM]

Comment 23

•

15 years ago

Ehsan, Olli - You two are the last people I can think of that have touched spellcheck. Can one of you take a look?

(no longer active)

Assignee

Comment 24

•

15 years ago

Comment on attachment 418984 [details] [diff] [review]
Improved version of the patch 2

The patch looks sane to me, I guess Simon can review it?

Attachment #418984 - Flags: review?(spelling-checker) → review?(smontagu)

Magnus Melin [:mkmelin]

Updated

•

15 years ago

Assignee: nobody → massimeddu

Status: NEW → ASSIGNED

Massimeddu

Updated

•

15 years ago

OS: Windows XP → All

Hardware: x86 → All

Nomis101

Comment 25

•

15 years ago

I've tested, if you use osxspell in Thunderbird, than it doesn't  flag the "scot" part of "scot-free" as a misspelling (and it doesn't flag etc.). So hunspell also shouldn't flag it.

Massimeddu

Comment 26

•

15 years ago

Attached patch Fix a regression of previous patch (obsolete) — Details — Splinter Review

I'm sorry, I rewrite my patch to fix a little regression in my previous patch, where words that end with more then 1 punctuation char (for example "aword),") are flagged as wrong. Now I think it's all ok!

Attachment #418984 - Attachment is obsolete: true

Attachment #419239 - Flags: review?(spelling-checker)

Attachment #418984 - Flags: review?(smontagu)

Magnus Melin [:mkmelin]

Comment 27

•

15 years ago

Massimeddu: please request review directly from a person ( smontagu@smontagu.org )

Magnus Melin [:mkmelin]

Comment 28

•

15 years ago

.. by setting the review? field to his email address

Massimeddu

Updated

•

15 years ago

Attachment #419239 - Flags: review?(spelling-checker) → review?(smontagu)

R Baars

Comment 29

•

15 years ago

When testing, could you secure that words like: oma's are considered as one word (plural), not as oma and s separately?

Maybe this is typical for Dutch, maybe it is not. Anyway, for Dutch, scene's is wrong, while scene and s are correct, as is scenes.

Massimeddu

Comment 30

•

15 years ago

(In reply to comment #29)
> When testing, could you secure that words like: oma's are considered as one
> word (plural), not as oma and s separately?

Yes, it's considered as one word, if the hunspell dictionary is correct (for example you must use the WORDCHARS ' option into your dictionary).

R Baars

Comment 31

•

15 years ago

Where is the list of word and non-word chars documented?

Will unicode chars like ´ also be seen as part of the word?

Right now, though Hunspell is used in lots of apps, the interface of all apps is different. Opera is very different from OOo and Mozilla in this. which makes the spellchecker behave differently in different apps.

Massimeddu

Comment 32

•

15 years ago

(In reply to comment #31)
> Where is the list of word and non-word chars documented?

It's embedded into affix file of hunspell dictionary (it's different for each language).

> Will unicode chars like ´ also be seen as part of the word?

Yes

> Right now, though Hunspell is used in lots of apps, the interface of all apps
> is different. Opera is very different from OOo and Mozilla in this. which makes
> the spellchecker behave differently in different apps.

With my patch the behavior of mozilla spell checker it's very similar to the OOo 3.2 one.

R Baars

Comment 34

•

14 years ago

Would it be possible to mention the release or date the fix probably will be in released software?

(no longer active)

Assignee

Comment 35

•

14 years ago

smontagu: ping

Simon Montagu :smontagu

Comment 36

•

14 years ago

Comment on attachment 419239 [details] [diff] [review]
Fix a regression of previous patch

>-  // all other punctuation
>+  // the char dot is evalued by hunspell if is near to a word char
>+  // for example it can be used into abreviations
>+  if (aIndex > 0 &&
>+      mDOMWordText[aIndex] == '.' &&
>+      ClassifyCharacter(aIndex - 1, false) == CHAR_CLASS_WORD)
>+    return CHAR_CLASS_WORD;

This seems wrong, because it will treat a sequence of "." of any length as part of the word (etc. etc.. etc...). I think it should only return CHAR_CLASS_WORD for a single "."

>+  
>+  // all other punctuation (only if it is at start or end of the word)
>   if (charCategory == nsIUGenCategory::kSeparator ||
>       charCategory == nsIUGenCategory::kOther ||
>       charCategory == nsIUGenCategory::kPunctuation ||
>-      charCategory == nsIUGenCategory::kSymbol)
>-    return CHAR_CLASS_SEPARATOR;
>+      charCategory == nsIUGenCategory::kSymbol) {
>+    PRBool start = true, end = true;
>+    for (PRInt32 aIndexTmp = aIndex; start && aIndexTmp >= 0; aIndexTmp--) {
>+      // this will classify the current character
>+      nsIUGenCategory::nsUGenCategory
>+      charCategoryTmp = mWordUtil->GetCategories()->Get(PRUint32(mDOMWordText[aIndexTmp]));
>+      if (charCategoryTmp != nsIUGenCategory::kSeparator &&
>+          charCategoryTmp != nsIUGenCategory::kOther &&
>+          charCategoryTmp != nsIUGenCategory::kPunctuation &&
>+          charCategoryTmp != nsIUGenCategory::kSymbol)
>+        start = false;
>+    }
>+    for (PRInt32 aIndexTmp = aIndex; !start && end && aIndexTmp <= PRInt32(mDOMWordText.Length()) - 1; aIndexTmp++) {
>+      // this will classify the current character
>+      nsIUGenCategory::nsUGenCategory
>+      charCategoryTmp = mWordUtil->GetCategories()->Get(PRUint32(mDOMWordText[aIndexTmp]));
>+      if (charCategoryTmp != nsIUGenCategory::kSeparator &&
>+          charCategoryTmp != nsIUGenCategory::kOther &&
>+          charCategoryTmp != nsIUGenCategory::kPunctuation &&
>+          charCategoryTmp != nsIUGenCategory::kSymbol)
>+        end = false;
>+    }
>+    if (start || end)
>+      return CHAR_CLASS_SEPARATOR;
>+  }

This is also problematical: for example something like "apples/oranges/lemons" will be treated as a single word.

Attachment #419239 - Flags: review?(smontagu) → review-

Massimeddu

Comment 37

•

14 years ago

Attached patch A fixed version (obsolete) — Details — Splinter Review

(In reply to comment #36)
> (From update of attachment 419239 [details] [diff] [review])

> This seems wrong, because it will treat a sequence of "." of any length as part
> of the word (etc. etc.. etc...). I think it should only return CHAR_CLASS_WORD
> for a single "." 

Yes, that's right, but it shouldn't a problem, because hunspell can handle that (see above).
In any case, I fixed its in my new patch.

> This is also problematical: for example something like "apples/oranges/lemons"
> will be treated as a single word.

Yes, but the new versions of hunspell can handle this with the WORDCHARS and BREAK directives (where you can specified if "/" char is (or not) a word char for the current dictionary), so hunspell should internally divide the single string "apples/oranges/lemons" into the 3 words.

In other words, the current implementation of mozilla spellchecker define for all languages what chars are word_chars. With my patch, it's defined into hunspell for each language.

I saw that the current English dictionary don't support the char "/" as a break... so it should be updated!

Attachment #419239 - Attachment is obsolete: true

Attachment #431603 - Flags: review?(smontagu)

Simon Montagu :smontagu

Comment 38

•

14 years ago

Comment on attachment 431603 [details] [diff] [review]
A fixed version

Sorry, I'm not confident that I know the spellchecker code well enough to review this properly.

Attachment #431603 - Flags: review?(smontagu) → review?(ryanvm)

Ryan VanderMeulen [:RyanVM]

Comment 39

•

14 years ago

Comment on attachment 431603 [details] [diff] [review]
A fixed version

I'm not a spellcheck peer. Maybe smaug?

Attachment #431603 - Flags: review?(ryanvm) → review?

(no longer active)

Assignee

Updated

•

14 years ago

Attachment #431603 - Flags: review? → review?(Olli.Pettay)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 40

•

14 years ago

I'll try to review this during weekend.

R Baars

Comment 41

•

14 years ago

When a dictionary is wanted to test this with, you could use the new Dutch one at www.opentaal.org : http://data.opentaal.org/opentaalbank/spellingcontrole200/woordenboek_nederlands-2.2.4-fx+tb+sm.xpi

A test page is here: 
http://www.opentaal.org/instructies

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 42

•

14 years ago

(Sorry, I'm a bit late with reviewing)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 43

•

14 years ago

Comment on attachment 431603 [details] [diff] [review]
A fixed version

>diff -r 3b49923ba8c5 extensions/spellcheck/src/mozInlineSpellWordUtil.cpp
>--- a/extensions/spellcheck/src/mozInlineSpellWordUtil.cpp	Wed Mar 10 09:07:45 2010 +0100
>+++ b/extensions/spellcheck/src/mozInlineSpellWordUtil.cpp	Wed Mar 10 11:48:44 2010 +0100
>@@ -867,22 +867,52 @@ WordSplitState::ClassifyCharacter(PRInt3
>       return CHAR_CLASS_SEPARATOR;
>     if (ClassifyCharacter(aIndex + 1, false) != CHAR_CLASS_WORD)
>       return CHAR_CLASS_SEPARATOR;
> 
>     // char on either side is a word, this counts as a word
>     return CHAR_CLASS_WORD;
>   }
> 
>-  // all other punctuation
>-  if (charCategory == nsIUGenCategory::kSeparator ||
>-      charCategory == nsIUGenCategory::kOther ||
>-      charCategory == nsIUGenCategory::kPunctuation ||
>-      charCategory == nsIUGenCategory::kSymbol)
>-    return CHAR_CLASS_SEPARATOR;
>+  // the char dot is evalued by hunspell if is near to a word char
>+  // for example it can be used into abreviations
>+  if (aIndex > 0 &&
>+      mDOMWordText[aIndex] == '.' &&
>+      mDOMWordText[aIndex-1] != '.' &&
>+      ClassifyCharacter(aIndex - 1, false) == CHAR_CLASS_WORD)
>+    return CHAR_CLASS_WORD;
Nit,
if (expr) {
  statement;
}

>+    
>+   if (charCategory == nsIUGenCategory::kSeparator ||
>+       charCategory == nsIUGenCategory::kOther ||
>+       charCategory == nsIUGenCategory::kPunctuation ||
>+       charCategory == nsIUGenCategory::kSymbol) {
>+    PRBool start = true, end = true;
s/true/PR_TRUE;

>+    for (PRInt32 aIndexTmp = aIndex; start && aIndexTmp >= 0; aIndexTmp--) {
variable names which start with 'a'+Capital letter are reserved for
parameters.

--aIndexTmp. (It is slightly faster if compiler doesn't optimize)

>+      // this will classify the current character
>+      nsIUGenCategory::nsUGenCategory
>+      charCategoryTmp = mWordUtil->GetCategories()->Get(PRUint32(mDOMWordText[aIndexTmp]));
>+      if (charCategoryTmp != nsIUGenCategory::kSeparator &&
>+          charCategoryTmp != nsIUGenCategory::kOther &&
>+          charCategoryTmp != nsIUGenCategory::kPunctuation &&
>+          charCategoryTmp != nsIUGenCategory::kSymbol)
>+        start = false;
PR_FALSE

And again
if (expr) {
  statement;
}

>+    }
>+    for (PRInt32 aIndexTmp = aIndex; !start && end && aIndexTmp <= PRInt32(mDOMWordText.Length()) - 1; aIndexTmp++) {
Same thing with variable name and postfix/prefix increment.

>+      // this will classify the current character
>+      nsIUGenCategory::nsUGenCategory
>+      charCategoryTmp = mWordUtil->GetCategories()->Get(PRUint32(mDOMWordText[aIndexTmp]));
>+      if (charCategoryTmp != nsIUGenCategory::kSeparator &&
>+          charCategoryTmp != nsIUGenCategory::kOther &&
>+          charCategoryTmp != nsIUGenCategory::kPunctuation &&
>+          charCategoryTmp != nsIUGenCategory::kSymbol)
>+        end = false;
if (expr) {
  statement;
}


>+    }
>+    if (start || end)
>+      return CHAR_CLASS_SEPARATOR;
And same here.

The current code seems to use true/false in few cases, but it shouldn't.
No need to change those cases in this bug though.

Some tests would be great. Perhaps based on English and reftests.

Attachment #431603 - Flags: review?(Olli.Pettay) → review-

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 44

•

14 years ago

Also, are there cases when the patch iterates over mDOMWordText many times?
Do we really need to go from aIndex to 0 every time?

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 45

•

14 years ago

..or from aIndex to end mDOMWordText.Length()

Ryan VanderMeulen [:RyanVM]

Comment 46

•

14 years ago

Massimeddu, would you be interested in updating the patch based on Olli's comments?

Massimeddu

Comment 47

•

14 years ago

(In reply to comment #46)
> Massimeddu, would you be interested in updating the patch based on Olli's
> comments?

I'm sorry, I have a lot of work these days... but I can recheck my patch during this weekend.

I can fix the code-style issues, but for now I don't know how to reduce the number of iterations...

In any case I think that the problem of the number of iterations is not critical, because these iterations are made only for the words that have into the string a kSeparator, kOther, kPunctuation or kSymbol char, and only one time at page loading or only for the word that you are current editing.

Any comment is well accepted!

R Baars

Comment 48

•

14 years ago

Is there any chance of progress on this issue? Dutch spell checking is functionally bad because of this issue!

Németh László

Comment 49

•

14 years ago

I agree, this improvement is not resource-critical, but important for the correct spell checking. By the way, hyphen (-) has to be the part of the word, but n-dash (U+2013: –) and m-dash (U+2014: —) don't, because they are always words separators. What are the kSymbol characters?

Wayne Mery (:wsmwk)

Updated

•

14 years ago

URL: http://www.openoffice.org/issues/show...

Olivier R.

Comment 50

•

14 years ago

French spell checking is also flawed due to this issue. And users are eager to have this issue solved. :)

João Neves

Comment 52

•

13 years ago

Just adding that Portuguese spell-checking is broken because of this. And the people who provide us the dictionaries for pt-PT are complaining about this and saying it looks bad on them :(.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 53

•

13 years ago

I'd be happy to review an updated patch.

R Baars

Comment 54

•

13 years ago

Olli, since you reported the issues, and Massimedu clearly has not had time for a long time, could you not change code yourselves, and have it reviewed by somenone else?

Kevin Scannell

Comment 55

•

13 years ago

This is a serious problem for Irish spell checking also.  I commented to this effect in the related (duplicate?) Bug 360240.  In short, we have correct words like "dea-scéal" where the first bit "dea" is not a correct standalone word.   We also have words like "t-anam", "n-anam" which are very common, and unfortunately the one-letter prefixes "t" and "n" are highlighted as misspelled.

Toni Hermoso Pulido

Updated

•

13 years ago

Blocks: 257073

Toni Hermoso Pulido

Comment 56

•

13 years ago

(In reply to comment #55)
> This is a serious problem for Irish spell checking also.  I commented to this
> effect in the related (duplicate?) Bug 360240.  In short, we have correct words
> like "dea-scéal" where the first bit "dea" is not a correct standalone word.  
> We also have words like "t-anam", "n-anam" which are very common, and
> unfortunately the one-letter prefixes "t" and "n" are highlighted as
> misspelled.

Also for Catalan, Occitan: bug 257073  ;P

Michael Bauer

Comment 57

•

13 years ago

And Scots Gaelic (h- t- n- highlighted). As with Irish, very common prefixes.

(no longer active)

Assignee

Comment 58

•

13 years ago

Attached patch Patch (v1) — Details — Splinter Review

Assignee: massimeddu → ehsan

Attachment #431603 - Attachment is obsolete: true

Attachment #507324 - Flags: review?(Olli.Pettay)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 59

•

13 years ago

Comment on attachment 507324 [details] [diff] [review]
Patch (v1)

So the hunspell changes are from where?

Ryan VanderMeulen [:RyanVM]

Comment 60

•

13 years ago

Adding "scot-free" to the dictionary (as well as updating the upstream diff) and telling Hunspell which characters are valid for breaking.

(no longer active)

Assignee

Comment 61

•

13 years ago

Comment on attachment 507324 [details] [diff] [review]
Patch (v1)

(In reply to comment #60)
> Adding "scot-free" to the dictionary (as well as updating the upstream diff)
> and telling Hunspell which characters are valid for breaking.

Also, see comment 37 about the BREAK directive.

But this patch is failing this reftest <http://mxr.mozilla.org/mozilla-central/source/layout/reftests/text/wordwrap-02.html?force=1>.  I need to figure out why...

Attachment #507324 - Flags: review?(Olli.Pettay)

(no longer active)

Assignee

Comment 62

•

13 years ago

OK, here is the problem.  Before this patch, we used to break text into words before passing it to hunspell, and we used to put a maximum of 64 characters on each word <http://mxr.mozilla.org/mozilla-central/source/extensions/spellcheck/src/mozSpellChecker.cpp#44>, which might make sense because words longer than that are rare in most languages.  But with this patch we may end up with large sequences of characters which are going to be passed to hunspell, so we may fail that max length check often, which causes us to consider the word as misspelled.

This check has been added in bug 340050.  I need to investigate whether turning it off is safe or not, but I don't think that's safe to do for Firefox 4.

I think at this point our best bet is to add spellcheck="false" to the textarea element in wordwrap-02.html.  Smaug, what do you think?

R Baars

Comment 63

•

13 years ago

Besides the -, best option is to get all wordchars from the wordchars-clause in the affix file.
The clause BREAK specifies which char could be used as a word breaking char; default is -.
Lnaguages (mostly eastern) without word separation chars also exist. I don't know how that sould be handled.

bjoern

Comment 64

•

13 years ago

the discussion in the openoffice.org bugtracker took 3 years or so until the right fix was implemented. It would be great if the discussion here could be kept short now by simply looking at the resulting implementation in openoffice.org, which every dictionary maintainer seems to be happy with.
again: http://www.openoffice.org/issues/show_bug.cgi?id=64400

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 65

•

13 years ago

The happiness of dictionary maintainers is less important than happiness of 
our users ;)

bjoern

Comment 66

•

13 years ago

if you think so ...

I'd say that dictionary maintainers are the linguistic experts of their language from a spell-checker-technical point of view - which is most important at this place. I can tell you that the German Thunderbird users are not happy at all with the spell checker's word separation. The OOo users *are* happy since a while.

Ramon Casha

Comment 67

•

13 years ago

In some languages (at least Maltese), some characters are both word-break characters AND part of the word (in our case, the preceding word). 

For instance "tal-aħħar" should be broken into "tal-" and "aħħar". Hunspell should be able to handle the text if it is passed in as is - it will break it into two words correctly. It's a bit like the prefix "semi-", which can be applied arbitrarily to many words, except that the word "tal" without the dash would be a misspelling.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 68

•

13 years ago

I mean, if we back out bug 340050, and that causes crashes, end users certainly
won't be happy. Also, if we're going to take a fix to FF4, it must be very safe.

Ehsan, could we just increase UNREASONABLE_WORD_LENGTH?
64 is not very unreasonable. Even in Finnish one could probably construct 
longer, even vaguely valid and usable, words.

Olivier R.

Comment 69

•

13 years ago

(In reply to comment #65)
> The happiness of dictionary maintainers is less important than happiness of 
> our users ;)

Do you think we, dictionaries maintainers, don’t care of users?
They complain to us because they don’t know this is your fault. ;)

R Baars

Comment 70

•

13 years ago

I think, by now, we agree this should be fixed. That is very positive.
It will move Mozilla forward to same quality level of spell checking as OOo has.
So let's stop discussing the need for it, and get it to work.

timeless

Comment 71

•

13 years ago

sorry, the risk factor here is that the spell checker we were using at one point resulted in a scary crash. scary crashes are worth $3,000 to a friendly developer and worth tens or hundreds of thousands of dollars to malicious people.

Someone will have to spend some time to ensure there are some good crash tests for long words (first against the old version which was crashing) and then determine if the same formerly crashing long words would crash if the size check is removed from the current codebase.

We all appreciate the hard work that developers do and the work of linguists to improve the user experience for their target audience, but please understand that the *safety* of our users comes first. If someone isn't sure if a word is spelled correctly, they can use an online dictionary (or perhaps decide "oh, i know this is right"). If our software isn't safe, it's possible that the individual user loses thousands of dollars (or euros if you prefer). Users who lose lots of money are considerably less happy than users who lose a half minute checking things with an online dictionary.

R Baars, please read https://bugzilla.mozilla.org/page.cgi?id=etiquette.html
especially 'no obligation'.

If you want to write a patch, you're welcome to do so. if you want to pay someone to write a patch, you can probably find an eager contractor somewhere (don't ask me, this isn't the forum and i don't know).

Please note that a patch to remove the size check should really come with crash tests (and really that should be part one as i described above) and then an adjustment of the length check.

bjoern

Comment 72

•

13 years ago

Bugs won't get fixed because you're afaid that you introduce new bugs that you have promised to pay people thousands of dollars for. Mozilla is getting rediculous, really.

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 73

•

13 years ago

Timeless does not speak for all of Mozilla, but his point about needing some test coverage is a good one, if made quite excessively.  Once FF4 is completed, it will be easier to take such changes as well, since there will be more time to identify any fallout from it.

Before any such thing is done, though, a patch with some coverage for at least the functional changes is needed.  It sounds like there's something from OpenOffice to work from, so this is a great time for someone to ramp up on the code and produce a patch, so that more experienced people can help with reviews and shepherding it in once the decks are clear from FF4.

António Manuel Dias

Comment 74

•

13 years ago

Please people, let's try to be nice to each other.  I'd say most people here are working as a volunteer, so we should understand that there are no obligations whatsoever.  Nevertheless, if we didn't care for this application we wouldn't be following this thread would we?

Of course on-line security is *a lot* more important than accurate spell-checking (regardless of the dollars involved), event though to most users spell-checking is more visible.  As I understand (but mind that I'm not a Mozilla developer) Olli Pettay is saying that it's probably equally safe to increase the UNREASONABLE_WORD_LENGTH to a longer value -- not remove the check! -- that would satisfy most users.  Can that be done without changing the safety of the code?

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 75

•

13 years ago

I think that's likely a safe enough change for FF4, but would defer to Ehsan.

(no longer active)

Assignee

Comment 76

•

13 years ago

(In reply to comment #63)
> Besides the -, best option is to get all wordchars from the wordchars-clause in
> the affix file.
> The clause BREAK specifies which char could be used as a word breaking char;
> default is -.
> Lnaguages (mostly eastern) without word separation chars also exist. I don't
> know how that sould be handled.

So, the story with Mozilla's spell checker right now is that we tokenize words before passing them to the spell checker.  Before this patch, we used to break words on separators, punctuations, symbols, and others, according to the Unicode character classes, which caused us to pass "scot" and "free" as separate words to the spell checker, instead of "scot-free".  This is suboptimal, because word breaking rules are language dependent, and it seems like hunspell's tokenizer supports handling that using the WORDCHARS and BREAK directives.  But BREAK directives were completely ignored before this patch, *unless* you chose to broke on a letter or something (I'm not sure that's needed in any language).

With this patch, what changed is that we don't break on those four character classes, *unless* we find two or more of them in a sequence, in which case, we break on every one of them.  In other words, we'll pass "scot-free" to hunspell if we find "scot-free" in the source text, but we'll pass "scot" and "free" to hunspell if we find "scot--free" in the source text.  This still might be suboptimal for some languages, but it's a small improvement over the old behavior.

But please note that we still don't honor BREAK and WORDCHARS rules in our own tokenizer.  This patch makes the BREAK directive kind of useful, but the WORDCHARS directive is still pretty useless.  But that's the most that I would be convinced we should do for Firefox 4.

For a future version of Firefox, I think we should experiment with throwing out our own tokenizer in favor of hunspell's, and I think we should look into how we can integrate its tokenizer with our DOM word reader utils.  It's not as easy as passing raw textnode data to hunspell (because we want things like "va<span>lid</span>" to pass spell checking, for example), but I think we can get there.  Once we hand off tokenization to hunspell completely, dictionary authors can take full advantage of BREAK, WORDCHARS, and other hunspell tokenization wonders.

Nemeth: please correct me if I'm wrong, and let me know if you don't think this is the direction we should be headed forwards to.

(no longer active)

Assignee

Comment 77

•

13 years ago

(In reply to comment #64)
> the discussion in the openoffice.org bugtracker took 3 years or so until the
> right fix was implemented. It would be great if the discussion here could be
> kept short now by simply looking at the resulting implementation in
> openoffice.org, which every dictionary maintainer seems to be happy with.
> again: http://www.openoffice.org/issues/show_bug.cgi?id=64400

That discussion is kind of long, and hard to follow.  I specifically looked for links to commits, etc, which shows how OOo handles the tokenization, but couldn't find anything.  Can someone please point me to OOo's code for tokenization?  Has OOo moved in the same direction as what I laid out in comment 76?

(no longer active)

Assignee

Comment 78

•

13 years ago

(In reply to comment #68)
> I mean, if we back out bug 340050, and that causes crashes, end users certainly
> won't be happy. Also, if we're going to take a fix to FF4, it must be very
> safe.

Agreed.  I think disabling the check in bug 340050 is something we want to experiment with in the nightlies post Firefox 4, but definitely not something we'd want to try close to a release.  I filed bug 629719 about that.

> Ehsan, could we just increase UNREASONABLE_WORD_LENGTH?
> 64 is not very unreasonable. Even in Finnish one could probably construct 
> longer, even vaguely valid and usable, words.

Agreed.

So, I read <http://en.wikipedia.org/wiki/Longest_words>.  It gave me the chills, as it seems like there is at least one valid word in English with 189,819 letters!  But let's not support that for Firefox 4  :-)

Hungarian is also pretty scary, but the largest word that I think we should support without scaring me too much is the Swedish Nordöstersjökustartilleriflygspaningssimulatoranläggningsmaterielunderhållsuppföljningssystemdiskussionsinläggsförberedelsearbeten with 130 characters.  So, I propose raising the limit to 130 for Firefox 4.

(Yes, I know it's arbitrary, and it's not really a scientific research, but let's be honest, I bet the number 64 wasn't also picked after years of research too... ;-) )

We should definitely try to remove this limit in some future version of Firefox altogether.

(no longer active)

Assignee

Comment 79

•

13 years ago

(In reply to comment #73)
> Before any such thing is done, though, a patch with some coverage for at least
> the functional changes is needed.

My patch includes _some_ tests which make me feel not too worried about taking this for Firefox 4.  I also filed bug 629734 about integrating hunspell's test suite into our own.  But we ultimately need a fuzzer for spell checking (yeah, not functional tests, I know).  Bug 629716 covers that.

> It sounds like there's something from
> OpenOffice to work from, so this is a great time for someone to ramp up on the
> code and produce a patch, so that more experienced people can help with reviews
> and shepherding it in once the decks are clear from FF4.

I'd definitely like to see OOo's code here, and I'd very much like to see patches, but I'm afraid I won't personally have enough time to work on this more before Firefox 4...

(no longer active)

Assignee

Comment 80

•

13 years ago

Attached patch Part 2: raise the maximum letter count for spell checking (obsolete) — Details — Splinter Review

Attachment #507912 - Flags: review?(Olli.Pettay)

(no longer active)

Assignee

Comment 81

•

13 years ago

Attached patch Part 3: work around the reftest failure — Details — Splinter Review

Attachment #507917 - Flags: review?(roc)

(no longer active)

Assignee

Updated

•

13 years ago

Attachment #507324 - Flags: review?(Olli.Pettay)

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

13 years ago

Attachment #507917 - Flags: review?(roc) → review+

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 82

•

13 years ago

Comment on attachment 507912 [details] [diff] [review]
Part 2: raise the maximum letter count for spell checking

The number of 139...
I guess you mean 130.

Attachment #507912 - Flags: review?(Olli.Pettay) → review+

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

13 years ago

Attachment #507324 - Flags: review?(Olli.Pettay) → review+

(no longer active)

Assignee

Comment 83

•

13 years ago

Attached patch Part 2: raise the maximum letter count for spell checking — Details — Splinter Review

(In reply to comment #82)
> The number of 139...
> I guess you mean 130.

Heh, of course! Sorry about the typo!

Attachment #507912 - Attachment is obsolete: true

Attachment #508164 - Flags: approval2.0?

(no longer active)

Assignee

Updated

•

13 years ago

Attachment #507324 - Flags: approval2.0?

(no longer active)

Assignee

Comment 84

•

13 years ago

CCing some drivers here for approval decision.  If we want to take this, we should take it for beta11.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Comment 85

•

13 years ago

This seems a lot like a new feature to me and feature freeze was 4 betas ago. Am I missing something?

João Neves

Comment 86

•

13 years ago

This is a fix for breaking spell-checking in several languages. It could also be a regression bug since it works well worst than previous versions.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Comment 87

•

13 years ago

At least on my mac version 3.6 also marks scot-free as a spelling error.

If this is a regression, please point out what worked in FF3.6 but no longer works and add the 'regression' keyword.

João Neves

Comment 88

•

13 years ago

Jonas, sorry, I shouldn't really comment on bugzilla at 3am. This is not a regression bug on Firefox (this broken behaviour is present since Firefox 2.0). What I meant is that hunspell handles this correctly and Firefox doesn't.

I understand your worries, and the need to follow the decisions, it's just that it leaves me feeling we got the worst end of the deal: we report an important problem for our users (several times, for the different languages), it gets classified as something not important, we need to raise awareness about it's importance, and when there's finally a patch, we're told it's to late to fix it and that we'll just have to deal with more user complaints when Firefox 4 is finally released. It's a PITA.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Comment 89

•

13 years ago

Yeah, I understand that. I think given that next release is just a few months out, and that we've had this behavior for several years now without users complaining too much, it's better to play it safe.

In other words, since we can get away without taking this fix, I think we should.

Michael Bauer

Comment 90

•

13 years ago

I'm not so sure they're not "complaining too much" - maybe their complaints just get stuck with the localisation teams... gd has only been out a few months and I've had several "queries".

João Neves

Comment 91

•

13 years ago

For Portuguese it's the worst timing possible. Both pt-PT and pt-BR are going to spell changes in the language. People are paying more attention to what it's written and complaining.

We get complains from users and from the people supplying us with the dictionaries, complaining that it's a false positive for them :(.

Brendan Eich [:brendan]

Comment 92

•

13 years ago

If we like the patch and it sticks, why not ship it in Firefox 4?

The fear of regressions delaying 4 is good but I don't see it here. More like we won't be perfect yet (as if!). What am I missing?

/be

Jonas Sicking (:sicking) No longer reading bugmail consistently

Comment 93

•

13 years ago

How I this different from a new feature? Why isn't there a fear of regressions just like with any other patch?

Brendan Eich [:brendan]

Comment 94

•

13 years ago

How about analyzing the risks of the patch? Generalizing as if every patch carried equal and large risk is no good. We're taking lots of fixes still, not classified as "feature changes" but arguably some are. We could stop in a brain off way (if only we could agree on "feature") but brain-on is always better.

What's the brain-on analysis of this patch's risks?

/be

(no longer active)

Assignee

Comment 95

•

13 years ago

Here is an analysis on why I picked this up, and why I think we should take this for Firefox 4. This is a very high impact fix for a lot of European languages. The bug itself is not a regression, and the fix is very low impact on en-US.

The code change itself is relatively small and safe. It doesn't change the performance characteristics of the spell checker, and it doesn't regress the spell checking behavior in en-US.

The only risk that I can think of here is that this will need dictionary authors to add the appropriate BREAK directives to their dictionary affix files, if they don't have them there already. What this means that without updated dictionary affix files, phrases such as "apples-oranges" (exactly that, without spaces before or after hyphens) will be marked as incorrectly spelled. If we decide to take this, I'm planning to make a blog post about this fix, and we can communicate this to the dictionary authors on AMO (the fix is simple, and the target audience is relatively small).

The escape plan here is also simple. There will be one patch to backout in case we have any regressions here (I will fix any regressions by a backout, because I won't have enough time to look into them anyways!). Ryan and I will monitor crash-stats post landing and will back this out in case we spot any crashes after this lands.

I believe that this analysis should be enough in order to enable the drivers to reach a conclusion here. I'd insist that *if* we want to take this, we should take it for beta11. In fact, after much thought on this, I'll a- this myself if we don't have a decision here by the b11 freeze date (TBD). :-)

Ramon Casha

Comment 96

•

13 years ago

I definitely think that this should be included in the earliest beta possible so that testing can start. This bug is a significant problem for many languages and should not be shrugged off so easily.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Comment 97

•

13 years ago

I think beta11 freeze is tomorrow, tuesday. I recommend bringing this up at the dev meeting tomorrow.

Brendan Eich [:brendan]

Comment 98

•

13 years ago

I'm going to recommend we take this one. Thanks for the work and analysis, Ehsan.

/be

R Baars

Comment 99

•

13 years ago

About adding the BREAK to the dictionary: this is not required, since BREAK has a default of - when not specified.

(no longer active)

Assignee

Comment 100

•

13 years ago

(In reply to comment #97)
> I think beta11 freeze is tomorrow, tuesday. I recommend bringing this up at the
> dev meeting tomorrow.

OK, will do.

(no longer active)

Assignee

Comment 101

•

13 years ago

(In reply to comment #99)
> About adding the BREAK to the dictionary: this is not required, since BREAK has
> a default of - when not specified.

The default might not be enough, as it only breaks on hyphens.

Olivier R.

Comment 102

•

13 years ago

You should have a look on these English dictionaries:
http://extensions.services.openoffice.org/en/project/dict-en-fixed

They have been made by Németh László to fix main issues when OpenOffice.org got the new tokenization behaviour (same as what you want to do here).

(no longer active)

Assignee

Comment 103

•

13 years ago

Comment on attachment 507324 [details] [diff] [review]
Patch (v1)

So, I brought this up in the developers meeting today, and it turns out that this is going to have l10n impact, because we ship some dictionaries with localized versions of Firefox.  This is too late in the cycle to introduce such an l10n churn, so I'd recommend that given the new information, we should not take this for Firefox 4.

I will land this soon after Firefox 4, though.

Attachment #507324 - Flags: approval2.0? → approval2.0-

(no longer active)

Assignee

Updated

•

13 years ago

Attachment #508164 - Flags: approval2.0? → approval2.0-

(no longer active)

Assignee

Updated

•

13 years ago

Whiteboard: [post-2.0]

(no longer active)

Assignee

Comment 104

•

13 years ago

Note to myself: this is the list of languages with which we ship a dictionary.

(no longer active)

Assignee

Comment 105

•

13 years ago

<http://mxr.mozilla.org/l10n-central/find?text=&string=aff%24>

Brendan Eich [:brendan]

Comment 106

•

13 years ago

Is this .x fodder, or Firefox 5?

/be

Axel Hecht [:Pike]

Comment 107

•

13 years ago

It seems that dictionaries need updating, and thus this would be add-on compat. I don't think we should take this on .x. Let alone of the effort to fix and ship our own dictionaries. But that'd be easier to talk about if we knew which are affected how.

(no longer active)

Assignee

Comment 108

•

13 years ago

(In reply to comment #106)
> Is this .x fodder, or Firefox 5?

I'd say that this is wanted-next, but we don't have that flag in Bugzilla yet!

But it's definitely not something that we'd want to take in a dot release.

R Baars

Comment 109

•

13 years ago

Guys, I think you are mistaken on the dictionary issue.
Please validate your assumptions with Laszlo Nemeth.
As far as I know, any existing old dictionary should function correctly with Hunspell 1.3.x, but 1.3.x has additional features that CAN be used by the dictionary people. So new dictionaries CAN be introduced. For sure, the mext release of the Dutch dictionary, on which I worked parallel with Laszlo's code changes, wil require 1.3.2 at least. It would be a big disappointment when spelling support in Firefox will soon be quite outdated compared to the one the platform and text processor will offer, especially since FF is used more and more for text entry.

(no longer active)

Assignee

Comment 110

•

13 years ago

We're not using hunspell 1.3.x in Firefox 4.

Axel Hecht [:Pike]

Comment 111

•

13 years ago

This is the mxr query for BREAK in the affix files we ship with our localizations: http://mxr.mozilla.org/l10n-central/search?string=BREAK&find=aff$

It doesn't look like a lot of BREAK support made it into the dictionaries we ship.

I'm reading through this bug for the second time only, did we get a concrete pointer to "the plan" that OOo put through, beyond the 104 bug comments over there?

(no longer active)

Assignee

Comment 112

•

13 years ago

(In reply to comment #111)
> This is the mxr query for BREAK in the affix files we ship with our
> localizations:
> http://mxr.mozilla.org/l10n-central/search?string=BREAK&find=aff$
> 
> It doesn't look like a lot of BREAK support made it into the dictionaries we
> ship.

It doesn't matter, as those BREAK directives will be ignored until this bug is fixed.

> I'm reading through this bug for the second time only, did we get a concrete
> pointer to "the plan" that OOo put through, beyond the 104 bug comments over
> there?

I'm still waiting for a response to comment 77!

Brendan Eich [:brendan]

Comment 113

•

13 years ago

I think if we wait for someone to answer comment 77 OOo questions we will be waiting a while.

I still wonder if we wouldn't be better off with this fix now, and fix the shipped dictionaries as soon as we can. Since - already breaks, what do we make worse by shipping this bug's patch ahead of dictionaries with added BREAKs that it would respect?

Sorry if this is a dumb question.

/be

R Baars

Comment 114

•

13 years ago

Using the - as word separator was normal in OOo up to 3.1. It was only fixed for 3.2. OOo 3.2 had the - as part of a word. However, it would use the break clause from Hunspell if specified. Hunspell itself used the - as a separator when break was not specified.

So there was full backwards compatibility.
Since then, dictionary developers started to use the new functionality, having the - in a word, changing the dictionary break clause and adding appropriate words.

So by now, all existing well-maintained dictionaries should be ready for this change in Mozilla. I really don't see the problem.

In case a less well-maintained dictionary fails, it is about time it was updated. It will only fail in quite unusual words.

So actually: Mozilla spell-checkign is failing now for a lot of languages, and MIGHT fail for some after fixing this scot-free issue.

So what is worse?
Lets get it in soon. Maybe not in 4.0, but in 4.1. I see no need whatsoever (except unbased fear) for postponing it to 5.

Brendan Eich [:brendan]

Comment 115

•

13 years ago

If comment 114 is accurate (my preceding comment came down on the same side) then we should take this patch now, in Fx4.

What is the counterexample, the regression from where we are now that this patch causes without the shipped dictionaries being extended?

/be

Németh László

Comment 116

•

13 years ago

(In reply to comment #77)

OpenOffice.org has its own tokenization using IBM ICU. Six years ago I asked to add n-dash and m-dash to the Hungarian word characters (see this patch: http://www.openoffice.org/nonav/issues/showattachment.cgi/30686/breakiterator.diff, in http://www.openoffice.org/issues/show_bug.cgi?id=56347). This was a big improvement in Hungarian spell checking.

Partially thanks to this Mozilla issue, handling *n-dash*, as word character was extended for all languages: http://qa.openoffice.org/issues/show_bug.cgi?id=64400. For back-compatibility, Hunspell has a default tokenization for words with n-dashes (ie. there is no need any dictionary modification for Mozilla). Some languages, like Hungarian has different tokenization rules in OpenOffice.org (see the previous patch), so Hunspell is able to get also the words with m-dashes during Hungarian spell checking.

About dot handling

Several languages use dots after abbreviations. To spell check these forms,  OpenOffice.org Hunspell dictionaries stores these forms only with dots, eg. "dr.". Hunspell recognizes also other words with dots, eg. sentence ending words with period (checking them without dots, too). Hunspell suggestion list doesn't contain extra dots, eg.

"dR."->"dr.", "dl"

So OpenOffice.org may extend it before the text replacement to keep the period:

"dr.", "dl."

Németh László

Comment 117

•

13 years ago

I'm sorry, hyphen is the general word character in OpenOffice.org, n-dash is not.

(In reply to comment #115)
There is no need to modify the Mozilla dictionaries. The suggestion mechanism of Hunspell works for the longer bad input words with hyphens, too, eg.

$ hunspell -d en_US -1
element-siz	
& element-siz 15 12: element-six, element-size, element-sis, element-sin, element-sir,

(no longer active)

Assignee

Comment 118

•

13 years ago

This is not about whether hunspell can provide us with good suggestions.  The hyphen is the only character that hunspell can handle as the separator by default.

An example of a regression with this patch without the BREAK directives in the affix file is data:text/html,<textarea>apples.oranges</textarea>.

Without this patch, there's no spelling mistake flagged.  With this patch sans the BREAK directives, the entire text gets flagged as a spelling mistake.

Brendan Eich [:brendan]

Comment 119

•

13 years ago

(In reply to comment #118)
> An example of a regression with this patch without the BREAK directives in the
> affix file is data:text/html,<textarea>apples.oranges</textarea>.
> 
> Without this patch, there's no spelling mistake flagged.  With this patch sans
> the BREAK directives, the entire text gets flagged as a spelling mistake.

It's a punctuation mistake. :-/ I would want it flagged if I failed to enter a space after full stop (and failed to capitalize oranges). If it's code (C, C++, JS, etc.) then it ought not be spellchecked.

Is that the worst case?

/be

(no longer active)

Assignee

Comment 120

•

13 years ago

(In reply to comment #119)
> (In reply to comment #118)
> > An example of a regression with this patch without the BREAK directives in the
> > affix file is data:text/html,<textarea>apples.oranges</textarea>.
> > 
> > Without this patch, there's no spelling mistake flagged.  With this patch sans
> > the BREAK directives, the entire text gets flagged as a spelling mistake.
> 
> It's a punctuation mistake. :-/ I would want it flagged if I failed to enter a
> space after full stop (and failed to capitalize oranges). If it's code (C, C++,
> JS, etc.) then it ought not be spellchecked.
> 
> Is that the worst case?
> 
> /be

No.  data:text/html,<textarea>I don't know him/her.</textarea>

With that, "him/her" would be flagged as misspelled.  While I would argue that we shouldn't catch "apples.oranges" too, "him/her" should definitely not be flagged.

Pander

Comment 121

•

13 years ago

(In reply to comment #119)
> (In reply to comment #118)
> > An example of a regression with this patch without the BREAK directives in the
> > affix file is data:text/html,<textarea>apples.oranges</textarea>.
> > 
> > Without this patch, there's no spelling mistake flagged.  With this patch sans
> > the BREAK directives, the entire text gets flagged as a spelling mistake.
> 
> It's a punctuation mistake. :-/ I would want it flagged if I failed to enter a
> space after full stop (and failed to capitalize oranges). If it's code (C, C++,
> JS, etc.) then it ought not be spellchecked.
> 
> Is that the worst case?
> 
> /be

Punctuation mistakes should be found by an add-on what communicates with LanguageTool, see:
  https://addons.mozilla.org/en-Us/thunderbird/addon/grammar-checker/
and
  http://sourceforge.net/tracker/?func=detail&aid=2943212&group_id=281835&atid=1195603

(no longer active)

Assignee

Comment 122

•

13 years ago

(In reply to comment #116)
> (In reply to comment #77)
> 
> OpenOffice.org has its own tokenization using IBM ICU.

Does that mean that we have to do something similar to that, too?  I thought that hunspell is capable of handling the tokenization based on the affix file rules...

Brendan Eich [:brendan]

Comment 123

•

13 years ago

Re: "him/her" -- good one, I give.

Still sad that we have to ship something that certain language users find so badly broken, one more time.

/be

(no longer active)

Assignee

Comment 124

•

13 years ago

I've been talking about this with Brendan a bit more, and I think we may have a good middle-ground for Firefox 4.  I'm proposing that we can only exempt the hyphen (both its ASCII and Unicode variants) from the word breaking logic.  This should have minimal risk, as those two characters are the default value for the BREAK directive in hunspell.

Brendan is going to bring it up in the triage meeting today, and if we get a go, I'll write the patch.

Németh László

Comment 125

•

13 years ago

(In reply to comment #124)
> I've been talking about this with Brendan a bit more, and I think we may have a
> good middle-ground for Firefox 4.  I'm proposing that we can only exempt the
> hyphen (both its ASCII and Unicode variants) from the word breaking logic. 
> This should have minimal risk, as those two characters are the default value
> for the BREAK directive in hunspell.

Hunspell handles only the hyphen (U+002D), and the terminating dots, but not n-dash (U+2013).

(no longer active)

Assignee

Comment 126

•

13 years ago

(In reply to comment #125)
> (In reply to comment #124)
> > I've been talking about this with Brendan a bit more, and I think we may have a
> > good middle-ground for Firefox 4.  I'm proposing that we can only exempt the
> > hyphen (both its ASCII and Unicode variants) from the word breaking logic. 
> > This should have minimal risk, as those two characters are the default value
> > for the BREAK directive in hunspell.
> 
> Hunspell handles only the hyphen (U+002D), and the terminating dots, but not
> n-dash (U+2013).

So I guess we should only consider ASCII hyphens in that case.

(no longer active)

Assignee

Comment 127

•

13 years ago

Attached patch Low-risk patch for 2.0 — Details — Splinter Review

This is a low risk version of the patch for 2.0 based on comments 124-126.

Attachment #510837 - Flags: review?

Attachment #510837 - Flags: approval2.0?

(no longer active)

Assignee

Updated

•

13 years ago

Attachment #510837 - Flags: review? → review?(Olli.Pettay)

Hendrik Maryns

Comment 128

•

13 years ago

(In reply to comment #126)
> (In reply to comment #125)
> > (In reply to comment #124)
> > > I've been talking about this with Brendan a bit more, and I think we may have a
> > > good middle-ground for Firefox 4.  I'm proposing that we can only exempt the
> > > hyphen (both its ASCII and Unicode variants) from the word breaking logic. 
> > > This should have minimal risk, as those two characters are the default value
> > > for the BREAK directive in hunspell.
> > 
> > Hunspell handles only the hyphen (U+002D), and the terminating dots, but not
> > n-dash (U+2013).
> 
> So I guess we should only consider ASCII hyphens in that case.

You mean: you should only consider the hyphen (which happens to be ASCII).  N-dash is NOT a hyphen.

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

13 years ago

Attachment #510837 - Flags: review?(Olli.Pettay)

Attachment #510837 - Flags: review+

Attachment #510837 - Flags: feedback?(l10n)

Axel Hecht [:Pike]

Comment 129

•

13 years ago

Comment on attachment 510837 [details] [diff] [review]
Low-risk patch for 2.0

Looks good

Attachment #510837 - Flags: feedback?(l10n) → feedback+

(no longer active)

Assignee

Updated

•

13 years ago

Attachment #508164 - Flags: approval2.0- → approval2.0?

Mike Beltzner [:beltzner, not reading bugmail]

Comment 130

•

13 years ago

Comment on attachment 510837 [details] [diff] [review]
Low-risk patch for 2.0

a=beltzner

Attachment #510837 - Flags: approval2.0? → approval2.0+

Mike Beltzner [:beltzner, not reading bugmail]

Updated

•

13 years ago

Attachment #508164 - Flags: approval2.0? → approval2.0+

(no longer active)

Assignee

Comment 131

•

13 years ago

http://hg.mozilla.org/mozilla-central/rev/1db71fcbc712
http://hg.mozilla.org/mozilla-central/rev/35437a194bbc
http://hg.mozilla.org/mozilla-central/rev/f323dcb530a2

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Flags: in-testsuite+

Resolution: --- → FIXED

Whiteboard: [post-2.0]

Target Milestone: --- → mozilla2.0b12

João Neves

Comment 132

•

13 years ago

Thank you all for all the work!

Will happily test it tomorrow when the nightly updates. Thank you very much.

(no longer active)

Assignee

Comment 133

•

13 years ago

(In reply to comment #132)
> Thank you all for all the work!
> 
> Will happily test it tomorrow when the nightly updates. Thank you very much.

Please do.  If you find any problems, please file a new bug in the Core::Spell Checking component.

Ryan VanderMeulen [:RyanVM]

Updated

•

13 years ago

Blocks: 632977

Jeff Muizelaar [:jrmuizel]

Comment 134

•

13 years ago

This seems cause a single hyphen standing alone '-' to be marked as a spelling mistake. That doesn't seem like a desired result.

Jeff Muizelaar [:jrmuizel]

Comment 135

•

13 years ago

I've filed bug 633210 for the new issue.

Jeff Muizelaar [:jrmuizel]

Updated

•

13 years ago

Depends on: 633210

João Neves

Comment 136

•

13 years ago

Verified on this textarea with: "fazê-la fá-la" and the pt-PT dictionary with no changes.

Minefield 4.0b12pre (2011-02-10)

Thank you very much!

(no longer active)

Assignee

Updated

•

13 years ago

Depends on: 634153

(no longer active)

Assignee

Updated

•

13 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Whiteboard: [post-2.0]

(no longer active)

Assignee

Comment 137

•

13 years ago

This was backed out in bug 634153 because of regressions.  I'll look at it again after Firefox 4.

João Neves

Comment 138

•

13 years ago

Leaving a comment for later then:

The code should check for "-", "foo-" and "-foo". These strings should not be passed on to the spellchecker.

Best regards,
João Miguel Neves

R Baars

Comment 139

•

13 years ago

Comment 138 is not correct for Dutch. A correct Dutch phrase is:
   mond- en klauwzeer
which is a standard shorter notation for
   mondzeer en klauwzeer.
The end-hyphen is possible in all places where a word could be compounded:
   spannings- and vervelingstest

A lot of effort has been put in having Hunspell understand this behaviour and to support it.

It now even understands spannings is wrong, and suggests both spanning and spannings-

Mike Beltzner [:beltzner, not reading bugmail]

Updated

•

13 years ago

Attachment #508164 - Flags: approval2.0+ → approval2.0-

Mike Beltzner [:beltzner, not reading bugmail]

Comment 140

•

13 years ago

Comment on attachment 510837 [details] [diff] [review]
Low-risk patch for 2.0

(bookkeeping)

Attachment #510837 - Flags: approval2.0+ → approval2.0-

Jesse Ruderman

Comment 141

•

13 years ago

Ehsan, are you going to give this another shot now that bug 620626 has landed?

(no longer active)

Assignee

Comment 142

•

13 years ago

(In reply to comment #141)
> Ehsan, are you going to give this another shot now that bug 620626 has landed?

Yes, hopefully in the Firefox 6 timeframe (tentatively).

Brendan Eich [:brendan]

Comment 143

•

13 years ago

Thanks to Ehsan for trying last time -- it was not the right time (sorry about that on my side as release-drivers intercessor), but we also learned some things, which does help for next time.

This is not a justification, more of a lemons -> lemonade point, but with more positive connotations, IMHO.

/be

(no longer active)

Assignee

Comment 144

•

13 years ago

Attached patch Fix to bug 634153 — Details — Splinter Review

This patch should be applied on top of the previously attached patches in this bug (I'm attaching it separately to ease the review).

This fixes the regression in bug 634153 which caused us to back out the patch for this bug in 2.0.

The cause of that regression was that the code for detecting conditional punctuations was not updated to take account this hunk (added as part of this bug).

+  // The dot character, if appearing at the end of a word, should
+  // be considered part of that word.  Example: "etc.", or
+  // abbreviations
+  if (aIndex > 0 &&
+      mDOMWordText[aIndex] == '.' &&
+      mDOMWordText[aIndex - 1] != '.' &&
+      ClassifyCharacter(aIndex - 1, false) != CHAR_CLASS_WORD) {
+    return CHAR_CLASS_WORD;
+  }


This patch simply checks the preceding and following charatcers to explicitly make sure that they're not dot's which are masked as word-chars.  If we find such a character, we just recognize the conditional punctuation character as a word separator (because the only case such characters can be considered as part of the word is if they appear in the middle of a word (such as "he's").

Attachment #526149 - Flags: review?(Olli.Pettay)

(no longer active)

Assignee

Comment 145

•

13 years ago

I prefer to land this on trunk as soon as possible so that we have enough time to deal with possible regressions before we branch to aurora.  If this gets onto aurora and we determine that we can't ship with it, I will deal with it by backing the patch out of aurora (the way I did for 2.0).

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 146

•

13 years ago

Comment on attachment 526149 [details] [diff] [review]
Fix to bug 634153

Can we get a testcase for this, please.

Attachment #526149 - Flags: review?(Olli.Pettay) → review+

(no longer active)

Assignee

Comment 147

•

13 years ago

(In reply to comment #146)
> Comment on attachment 526149 [details] [diff] [review]
> Fix to bug 634153
> 
> Can we get a testcase for this, please.

I already landed a test case for this as part of bug 634153: <http://mxr.mozilla.org/mozilla-central/find?text=&kind=text&string=dotafterquote>

(no longer active)

Assignee

Comment 148

•

13 years ago

http://hg.mozilla.org/mozilla-central/rev/605f6d7309fa
http://hg.mozilla.org/mozilla-central/rev/97acf2219d5a
http://hg.mozilla.org/mozilla-central/rev/a2814ef64e6d

Status: REOPENED → RESOLVED

Closed: 13 years ago → 13 years ago

Resolution: --- → FIXED

Whiteboard: [post-2.0]

Target Milestone: mozilla2.0b12 → mozilla6

(no longer active)

Assignee

Comment 149

•

13 years ago

I had mistakenly marked a test failing on Android here, here's the fix: http://hg.mozilla.org/mozilla-central/rev/281b86cd0403

bjoern

Comment 150

•

13 years ago

this is still not fixed. Compare OOo with the German dictionary which correctly accepts all of the words in "Arbeits- und Familienrecht" while Firefox still complains about "Arbeits-" as it just checked "Arbeits" without the dash.

As mentioned above, please make the tokenizer behave like the one in OpenOffice.

Can someone with apropriate rights please reopen this bug?

R Baars

Comment 151

•

13 years ago

In addition to Björns remark: it is not fixed when the hyphen is at the end (maybe start too) of the word. It is fixed for in-word positions (as far as I can tell).

Ryan VanderMeulen [:RyanVM]

Comment 152

•

13 years ago

Probably better to take the discussion of the last two comments over to bug 632977 and leave this one alone.

Kagami Rosylight [:saschanaz] (they/them)

Updated

•

2 years ago

The patch that can resolve this problem 15 years ago Massimeddu 605 bytes, patch		Details \| Diff \| Splinter Review
Improved version of the patch 15 years ago Massimeddu 1.15 KB, patch		Details \| Diff \| Splinter Review
Improved version of the patch 15 years ago Massimeddu 1.15 KB, patch		Details \| Diff \| Splinter Review
Improved version of the patch 2 15 years ago Massimeddu 1.25 KB, patch		Details \| Diff \| Splinter Review
Fix a regression of previous patch 15 years ago Massimeddu 2.47 KB, patch	smontagu : review-	Details \| Diff \| Splinter Review
A fixed version 14 years ago Massimeddu 2.60 KB, patch	smaug : review-	Details \| Diff \| Splinter Review
Patch (v1) 13 years ago (no longer active) 11.44 KB, patch	smaug : review+ ehsan.akhgari : approval2.0-	Details \| Diff \| Splinter Review
Part 2: raise the maximum letter count for spell checking 13 years ago (no longer active) 1.07 KB, patch	smaug : review+	Details \| Diff \| Splinter Review
Part 3: work around the reftest failure 13 years ago (no longer active) 1.44 KB, patch	roc : review+	Details \| Diff \| Splinter Review
Part 2: raise the maximum letter count for spell checking 13 years ago (no longer active) 1.14 KB, patch	beltzner : approval2.0-	Details \| Diff \| Splinter Review
Low-risk patch for 2.0 13 years ago (no longer active) 9.57 KB, patch	smaug : review+ Pike : feedback+ beltzner : approval2.0-	Details \| Diff \| Splinter Review
Fix to bug 634153 13 years ago (no longer active) 1.50 KB, patch	smaug : review+	Details \| Diff \| Splinter Review