Last Comment Bug 355178 - spell checker doesn't accept "scot-free" (tokenization, Hunspell 1.2.8)
: spell checker doesn't accept "scot-free" (tokenization, Hunspell 1.2.8)
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Spelling checker (show other bugs)
: Trunk
: All All
: -- minor with 22 votes (vote)
: mozilla6
Assigned To: :Ehsan Akhgari (busy, don't ask for review please)
:
Mentors:
http://www.openoffice.org/issues/show...
: 435930 466127 489515 497842 550448 624250 (view as bug list)
Depends on: 633210 634153
Blocks: 276748 360240 632977 257073 318040 461142
  Show dependency treegraph
 
Reported: 2006-10-02 17:19 PDT by skierpage
Modified: 2011-09-24 09:32 PDT (History)
43 users (show)
ehsan: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
The patch that can resolve this problem (605 bytes, patch)
2009-12-19 14:18 PST, Massimeddu
no flags Details | Diff | Review
Improved version of the patch (1.15 KB, patch)
2009-12-20 07:41 PST, Massimeddu
no flags Details | Diff | Review
Improved version of the patch (1.15 KB, patch)
2009-12-20 07:55 PST, Massimeddu
no flags Details | Diff | Review
Improved version of the patch 2 (1.25 KB, patch)
2009-12-23 03:00 PST, Massimeddu
no flags Details | Diff | Review
Fix a regression of previous patch (2.47 KB, patch)
2009-12-27 06:44 PST, Massimeddu
smontagu: review-
Details | Diff | Review
A fixed version (2.60 KB, patch)
2010-03-10 02:56 PST, Massimeddu
bugs: review-
Details | Diff | Review
Patch (v1) (11.44 KB, patch)
2011-01-26 17:49 PST, :Ehsan Akhgari (busy, don't ask for review please)
bugs: review+
ehsan: approval2.0-
Details | Diff | Review
Part 2: raise the maximum letter count for spell checking (1.07 KB, patch)
2011-01-28 11:00 PST, :Ehsan Akhgari (busy, don't ask for review please)
bugs: review+
Details | Diff | Review
Part 3: work around the reftest failure (1.44 KB, patch)
2011-01-28 11:06 PST, :Ehsan Akhgari (busy, don't ask for review please)
roc: review+
Details | Diff | Review
Part 2: raise the maximum letter count for spell checking (1.14 KB, patch)
2011-01-29 14:50 PST, :Ehsan Akhgari (busy, don't ask for review please)
mbeltzner: approval2.0-
Details | Diff | Review
Low-risk patch for 2.0 (9.57 KB, patch)
2011-02-08 16:48 PST, :Ehsan Akhgari (busy, don't ask for review please)
bugs: review+
l10n: feedback+
mbeltzner: approval2.0-
Details | Diff | Review
Fix to bug 634153 (1.50 KB, patch)
2011-04-14 16:31 PDT, :Ehsan Akhgari (busy, don't ask for review please)
bugs: review+
Details | Diff | Review

Description skierpage 2006-10-02 17:19:28 PDT
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20061002 Minefield/3.0a1
Build Identifier: Thunderbird version 1.5.0.7 (20060909)

"scot-free" is a common construction in English, e.g. "He escaped scot-free", even though "scot" on its own is obscure Middle English.  So I think "scot-free" should be in the English dictionaries.  I'm not sure whether you can add a hyphenated word to the dictionary, if not I guess add "scot" so the construction is OK.

Reproducible: Always

Steps to Reproduce:
1.  Enter "He escaped scot-free"


Actual Results:  
Flags "scot"

Expected Results:  
scot-free is OK and common.
Comment 1 guanxi 2006-10-26 07:41:35 PDT
confirming:
scot-free is a word - http://onelook.com/?w=scot-free&ls=b

S. Page: Can you test other hyphenated words with one part that is not English?  Can you add "scot-free" to the dictionary?

It would be useful to know if this is a problem with hyphenated words or just a word missing from the dictionary.
Comment 2 skierpage 2008-10-16 17:19:25 PDT
In reply to comment #1)
>  Can you add "scot-free" to the dictionary?

Sort of. I added scot-free to en-US.dic and it suggests it as a correction for scotfree.  But it still flags the "scot" part of "scot-free" as a misspelling.  I guess Hunspell checks the word before the hyphen, fails, and doesn't retry to see if the compound word is valid.

I found a different case: "Topsy-turvy" is correct English but separately each part is not English.  If I add that to en-US.dic, Firefox correctly suggests topsy-turvy as a replacement for topsyturvy, but then flags both parts of "topsy-turvy" as misspelled.

> Can you test other hyphenated words with one part that is not English?

I can't think of any others off-hand nor find a list of them.


> It would be useful to know if this is a problem with hyphenated words or just a
> word missing from the dictionary.

In general, there are no hyphenated words in the en-US.dic dictionary.  Perhaps that's why it doesn't suggest "plug-in cop-out" as corrections when you type "plugin copout"   (BTW, bug 448897 is about "plugin/plug-in".).  I found adding the hyphenated word does the right thing, but there may be some other way to indicate hyphenated compounds -- the "Hunspell features"/hunspell(4) man page doc is hard to understand.

To sum up,
* Adding a hyphenated phrase to en-US.dic makes FF3 suggest it as a spelling correction (good!).  So IMO this should be done for common phrases like cop-out and plug-in that people incorrectly spell as one word.

* Adding a hyphenated phrase to en-US.dic where either part alone is not valid English does not make the compound pass spell checking (bad!).  So adding "scot-free" and "topsy-turvy" will confuse users.
Comment 3 Németh László 2008-10-22 10:19:17 PDT
This is a longstanding problem at OpenOffice.org development, see http://www.openoffice.org/issues/show_bug.cgi?id=64400.
The upcoming Hunspell 1.2.8 solves this problem by checking the whole compound word and (when necessary) all parts of the hyphenated input word. I will also extend the en_US dictionary.

To fix the reported problem, Firefox and Thunderbird have to modify their word tokenization to handle hyphen as a word character. Thanks for the report, László
Comment 4 Németh László 2009-01-09 14:07:34 PST
To solve this problem the only task is to change the tokenization at hyphens: Without any dictionary modification the newly integrated Hunspell 1.2.8 can break the input token at hyphens, like the recent tokenizator (default back compatibility), but Hunspell checks also the whole token (with hyphen, like in "scot-free") before this tokenization. (The default tokenization of Hunspell can be modified by the BREAK parameter of the Hunspell affix file.)

This is an important issue for the spell checking (for example, see the last duplicate Bug 466127). Could anyone help to find what component or who is responsible for the tokenization in Mozilla Firefox? (Handling abbreviations with dots is also an important tokenization problem in Firefox.)
Comment 5 Ryan VanderMeulen [:RyanVM] 2009-04-22 15:42:46 PDT
*** Bug 466127 has been marked as a duplicate of this bug. ***
Comment 6 Ryan VanderMeulen [:RyanVM] 2009-04-22 15:42:51 PDT
*** Bug 489515 has been marked as a duplicate of this bug. ***
Comment 7 Matt Caywood 2009-06-15 18:08:12 PDT
*** Bug 435930 has been marked as a duplicate of this bug. ***
Comment 8 Matt Caywood 2009-06-16 00:38:55 PDT
*** Bug 497842 has been marked as a duplicate of this bug. ***
Comment 9 Matt Caywood 2009-06-16 00:39:42 PDT
http://www.openoffice.org/issues/show_bug.cgi?id=64400

has some useful information including a list of Unicode word separators used by
OpenOffice.
Comment 10 R Baars 2009-06-16 00:50:53 PDT
THis bug should not just be about the dash, but about word iteration parameterizations.

It involves the . as well, for abbreviations.
A construction like OOo would be possible, a different approach woulkd be a construction like SRX (general rules by regexp, exceptions on these rules by regexp)
Comment 11 tru 2009-06-16 07:02:50 PDT
It's a giant bug in bugzilla that random people like caywood@gmail.com can add the emails of other people like me to bug report cclists without their permission.
Comment 12 Matt Caywood 2009-06-16 09:58:10 PDT
tru, I didn't do a thing. 

You probably added yourself to the bug when you voted for it in Bugzilla (or possibly one of the bugs it blocks.) That's so you can be notified of fixes. 

I guess your email has been removed, since I can't see it on any of these bugs.
Comment 13 Magnus Melin 2009-06-18 05:16:41 PDT
It's not a bug, and in this case happened because tru's bug got duped to this bug. (That always adds the dupe reporter to the cc list of the duped-to bug they should follow.)
Comment 14 Ramon Casha 2009-06-26 21:00:46 PDT
Note that it's not just the hyphen that causes this problem. In the Maltese language, an apostrophe (or right single quote) is also part of a word. The word [bela'] is correct, while [bela] is not. 

With the current functionality, if the user correctly writes bela', the spell checker flags the letters "bela" as incorrect, and then suggests bela' with an apostrophe as an alternative.

Ideally the list of characters should be language-dependent. 

Note also some of these characters, while forming part of a word, can also be word-breaking characters. So, in Maltese, "tal-bidu" is two words, "tal-" and "bidu", but in English, "topsy-turvy" should be checked as a single word.

I think that the solution would require a language-dependent list of characters that are to be treated as alphabetic characters, AND a separate list of characters that are to be treated as word-break characters.
Comment 15 madtom 2009-07-04 01:41:39 PDT
Using 2.0.0.22 and the English(GB) dictionary if I type etc the etc is underlined and the drop down alternative of etc. is chosen the etc is still underlined so th replacement etc. is still regarded as a spelling error. This is true for firefox too!
Comment 16 Mark Clements 2009-08-18 11:48:42 PDT
(In reply to comment #15)
> Using 2.0.0.22 and the English(GB) dictionary if I type etc the etc is
> underlined and the drop down alternative of etc. is chosen the etc is still
> underlined so th replacement etc. is still regarded as a spelling error. This
> is true for firefox too!

This is being handled by bug 318040.
Comment 17 Jaska Zedlik 2009-09-20 18:02:53 PDT
(In reply to comment #14)
> Note that it's not just the hyphen that causes this problem. In the Maltese
> language, an apostrophe (or right single quote) is also part of a word. The
> word [bela'] is correct, while [bela] is not. 

The same is in Belarusian and, I guess, Ukrainian. For example, if a word "сям’я" (without quotes, but with a U+2019 apostrophe sign) is containing in the dictionary, it nevertheless underlined as misspelled with no variant suggested. If I try to "misspell" this work in a way of typing "сям’яя", then the word "сям’я" is among the suggested variant (it is in the dictionary if suggested), but after it have been selected in the context menu to replace misspelled "сям’яя", it becomes underlined as misspelled too.

Firefox 3.5.3 under Windows.
Comment 18 Massimeddu 2009-12-19 14:17:43 PST
Hi,

I have the same problem with sardinian dictionary. I create a simple patch that can fix it. Please some Mozilla team member can review it.
Comment 19 Massimeddu 2009-12-19 14:18:54 PST
Created attachment 418524 [details] [diff] [review]
The patch that can resolve this problem
Comment 20 Massimeddu 2009-12-20 07:41:33 PST
Created attachment 418548 [details] [diff] [review]
Improved version of the patch
Comment 21 Massimeddu 2009-12-20 07:55:34 PST
Created attachment 418549 [details] [diff] [review]
Improved version of the patch

Resolved minor code style problem.
Comment 22 Massimeddu 2009-12-23 03:00:39 PST
Created attachment 418984 [details] [diff] [review]
Improved version of the patch 2

This version of the patch resolve also the Bug 318040, please review.
Comment 23 Ryan VanderMeulen [:RyanVM] 2009-12-23 05:51:29 PST
Ehsan, Olli - You two are the last people I can think of that have touched spellcheck. Can one of you take a look?
Comment 24 :Ehsan Akhgari (busy, don't ask for review please) 2009-12-23 11:13:29 PST
Comment on attachment 418984 [details] [diff] [review]
Improved version of the patch 2

The patch looks sane to me, I guess Simon can review it?
Comment 25 Nomis101 2009-12-27 03:38:34 PST
I've tested, if you use osxspell in Thunderbird, than it doesn't  flag the "scot" part of "scot-free" as a misspelling (and it doesn't flag etc.). So hunspell also shouldn't flag it.
Comment 26 Massimeddu 2009-12-27 06:44:55 PST
Created attachment 419239 [details] [diff] [review]
Fix a regression of previous patch

I'm sorry, I rewrite my patch to fix a little regression in my previous patch, where words that end with more then 1 punctuation char (for example "aword),") are flagged as wrong. Now I think it's all ok!
Comment 27 Magnus Melin 2009-12-27 10:14:57 PST
Massimeddu: please request review directly from a person ( smontagu@smontagu.org )
Comment 28 Magnus Melin 2009-12-27 10:15:30 PST
.. by setting the review? field to his email address
Comment 29 R Baars 2010-01-20 05:48:36 PST
When testing, could you secure that words like: oma's are considered as one word (plural), not as oma and s separately?

Maybe this is typical for Dutch, maybe it is not. Anyway, for Dutch, scene's is wrong, while scene and s are correct, as is scenes.
Comment 30 Massimeddu 2010-01-20 08:34:29 PST
(In reply to comment #29)
> When testing, could you secure that words like: oma's are considered as one
> word (plural), not as oma and s separately?

Yes, it's considered as one word, if the hunspell dictionary is correct (for example you must use the WORDCHARS ' option into your dictionary).
Comment 31 R Baars 2010-01-22 12:40:14 PST
Where is the list of word and non-word chars documented?

Will unicode chars like ´ also be seen as part of the word?

Right now, though Hunspell is used in lots of apps, the interface of all apps is different. Opera is very different from OOo and Mozilla in this. which makes the spellchecker behave differently in different apps.
Comment 32 Massimeddu 2010-01-22 14:51:43 PST
(In reply to comment #31)
> Where is the list of word and non-word chars documented?

It's embedded into affix file of hunspell dictionary (it's different for each language).

> Will unicode chars like ´ also be seen as part of the word?

Yes

> Right now, though Hunspell is used in lots of apps, the interface of all apps
> is different. Opera is very different from OOo and Mozilla in this. which makes
> the spellchecker behave differently in different apps.

With my patch the behavior of mozilla spell checker it's very similar to the OOo 3.2 one.
Comment 33 Ludovic Hirlimann [:Usul] 2010-03-09 05:40:27 PST
*** Bug 550448 has been marked as a duplicate of this bug. ***
Comment 34 R Baars 2010-03-09 09:38:34 PST
Would it be possible to mention the release or date the fix probably will be in released software?
Comment 35 :Ehsan Akhgari (busy, don't ask for review please) 2010-03-09 10:00:57 PST
smontagu: ping
Comment 36 Simon Montagu :smontagu 2010-03-09 12:25:06 PST
Comment on attachment 419239 [details] [diff] [review]
Fix a regression of previous patch

>-  // all other punctuation
>+  // the char dot is evalued by hunspell if is near to a word char
>+  // for example it can be used into abreviations
>+  if (aIndex > 0 &&
>+      mDOMWordText[aIndex] == '.' &&
>+      ClassifyCharacter(aIndex - 1, false) == CHAR_CLASS_WORD)
>+    return CHAR_CLASS_WORD;

This seems wrong, because it will treat a sequence of "." of any length as part of the word (etc. etc.. etc...). I think it should only return CHAR_CLASS_WORD for a single "."

>+  
>+  // all other punctuation (only if it is at start or end of the word)
>   if (charCategory == nsIUGenCategory::kSeparator ||
>       charCategory == nsIUGenCategory::kOther ||
>       charCategory == nsIUGenCategory::kPunctuation ||
>-      charCategory == nsIUGenCategory::kSymbol)
>-    return CHAR_CLASS_SEPARATOR;
>+      charCategory == nsIUGenCategory::kSymbol) {
>+    PRBool start = true, end = true;
>+    for (PRInt32 aIndexTmp = aIndex; start && aIndexTmp >= 0; aIndexTmp--) {
>+      // this will classify the current character
>+      nsIUGenCategory::nsUGenCategory
>+      charCategoryTmp = mWordUtil->GetCategories()->Get(PRUint32(mDOMWordText[aIndexTmp]));
>+      if (charCategoryTmp != nsIUGenCategory::kSeparator &&
>+          charCategoryTmp != nsIUGenCategory::kOther &&
>+          charCategoryTmp != nsIUGenCategory::kPunctuation &&
>+          charCategoryTmp != nsIUGenCategory::kSymbol)
>+        start = false;
>+    }
>+    for (PRInt32 aIndexTmp = aIndex; !start && end && aIndexTmp <= PRInt32(mDOMWordText.Length()) - 1; aIndexTmp++) {
>+      // this will classify the current character
>+      nsIUGenCategory::nsUGenCategory
>+      charCategoryTmp = mWordUtil->GetCategories()->Get(PRUint32(mDOMWordText[aIndexTmp]));
>+      if (charCategoryTmp != nsIUGenCategory::kSeparator &&
>+          charCategoryTmp != nsIUGenCategory::kOther &&
>+          charCategoryTmp != nsIUGenCategory::kPunctuation &&
>+          charCategoryTmp != nsIUGenCategory::kSymbol)
>+        end = false;
>+    }
>+    if (start || end)
>+      return CHAR_CLASS_SEPARATOR;
>+  }

This is also problematical: for example something like "apples/oranges/lemons" will be treated as a single word.
Comment 37 Massimeddu 2010-03-10 02:56:57 PST
Created attachment 431603 [details] [diff] [review]
A fixed version

(In reply to comment #36)
> (From update of attachment 419239 [details] [diff] [review])

> This seems wrong, because it will treat a sequence of "." of any length as part
> of the word (etc. etc.. etc...). I think it should only return CHAR_CLASS_WORD
> for a single "." 

Yes, that's right, but it shouldn't a problem, because hunspell can handle that (see above).
In any case, I fixed its in my new patch.

> This is also problematical: for example something like "apples/oranges/lemons"
> will be treated as a single word.

Yes, but the new versions of hunspell can handle this with the WORDCHARS and BREAK directives (where you can specified if "/" char is (or not) a word char for the current dictionary), so hunspell should internally divide the single string "apples/oranges/lemons" into the 3 words.

In other words, the current implementation of mozilla spellchecker define for all languages what chars are word_chars. With my patch, it's defined into hunspell for each language.

I saw that the current English dictionary don't support the char "/" as a break... so it should be updated!
Comment 38 Simon Montagu :smontagu 2010-03-11 06:40:35 PST
Comment on attachment 431603 [details] [diff] [review]
A fixed version

Sorry, I'm not confident that I know the spellchecker code well enough to review this properly.
Comment 39 Ryan VanderMeulen [:RyanVM] 2010-03-11 17:13:51 PST
Comment on attachment 431603 [details] [diff] [review]
A fixed version

I'm not a spellcheck peer. Maybe smaug?
Comment 40 Olli Pettay [:smaug] 2010-03-12 01:55:18 PST
I'll try to review this during weekend.
Comment 41 R Baars 2010-03-12 02:25:45 PST
When a dictionary is wanted to test this with, you could use the new Dutch one at www.opentaal.org : http://data.opentaal.org/opentaalbank/spellingcontrole200/woordenboek_nederlands-2.2.4-fx+tb+sm.xpi

A test page is here: 
http://www.opentaal.org/instructies
Comment 42 Olli Pettay [:smaug] 2010-03-15 07:04:35 PDT
(Sorry, I'm a bit late with reviewing)
Comment 43 Olli Pettay [:smaug] 2010-03-15 07:29:08 PDT
Comment on attachment 431603 [details] [diff] [review]
A fixed version

>diff -r 3b49923ba8c5 extensions/spellcheck/src/mozInlineSpellWordUtil.cpp
>--- a/extensions/spellcheck/src/mozInlineSpellWordUtil.cpp	Wed Mar 10 09:07:45 2010 +0100
>+++ b/extensions/spellcheck/src/mozInlineSpellWordUtil.cpp	Wed Mar 10 11:48:44 2010 +0100
>@@ -867,22 +867,52 @@ WordSplitState::ClassifyCharacter(PRInt3
>       return CHAR_CLASS_SEPARATOR;
>     if (ClassifyCharacter(aIndex + 1, false) != CHAR_CLASS_WORD)
>       return CHAR_CLASS_SEPARATOR;
> 
>     // char on either side is a word, this counts as a word
>     return CHAR_CLASS_WORD;
>   }
> 
>-  // all other punctuation
>-  if (charCategory == nsIUGenCategory::kSeparator ||
>-      charCategory == nsIUGenCategory::kOther ||
>-      charCategory == nsIUGenCategory::kPunctuation ||
>-      charCategory == nsIUGenCategory::kSymbol)
>-    return CHAR_CLASS_SEPARATOR;
>+  // the char dot is evalued by hunspell if is near to a word char
>+  // for example it can be used into abreviations
>+  if (aIndex > 0 &&
>+      mDOMWordText[aIndex] == '.' &&
>+      mDOMWordText[aIndex-1] != '.' &&
>+      ClassifyCharacter(aIndex - 1, false) == CHAR_CLASS_WORD)
>+    return CHAR_CLASS_WORD;
Nit,
if (expr) {
  statement;
}

>+    
>+   if (charCategory == nsIUGenCategory::kSeparator ||
>+       charCategory == nsIUGenCategory::kOther ||
>+       charCategory == nsIUGenCategory::kPunctuation ||
>+       charCategory == nsIUGenCategory::kSymbol) {
>+    PRBool start = true, end = true;
s/true/PR_TRUE;

>+    for (PRInt32 aIndexTmp = aIndex; start && aIndexTmp >= 0; aIndexTmp--) {
variable names which start with 'a'+Capital letter are reserved for
parameters.

--aIndexTmp. (It is slightly faster if compiler doesn't optimize)

>+      // this will classify the current character
>+      nsIUGenCategory::nsUGenCategory
>+      charCategoryTmp = mWordUtil->GetCategories()->Get(PRUint32(mDOMWordText[aIndexTmp]));
>+      if (charCategoryTmp != nsIUGenCategory::kSeparator &&
>+          charCategoryTmp != nsIUGenCategory::kOther &&
>+          charCategoryTmp != nsIUGenCategory::kPunctuation &&
>+          charCategoryTmp != nsIUGenCategory::kSymbol)
>+        start = false;
PR_FALSE

And again
if (expr) {
  statement;
}

>+    }
>+    for (PRInt32 aIndexTmp = aIndex; !start && end && aIndexTmp <= PRInt32(mDOMWordText.Length()) - 1; aIndexTmp++) {
Same thing with variable name and postfix/prefix increment.

>+      // this will classify the current character
>+      nsIUGenCategory::nsUGenCategory
>+      charCategoryTmp = mWordUtil->GetCategories()->Get(PRUint32(mDOMWordText[aIndexTmp]));
>+      if (charCategoryTmp != nsIUGenCategory::kSeparator &&
>+          charCategoryTmp != nsIUGenCategory::kOther &&
>+          charCategoryTmp != nsIUGenCategory::kPunctuation &&
>+          charCategoryTmp != nsIUGenCategory::kSymbol)
>+        end = false;
if (expr) {
  statement;
}


>+    }
>+    if (start || end)
>+      return CHAR_CLASS_SEPARATOR;
And same here.

The current code seems to use true/false in few cases, but it shouldn't.
No need to change those cases in this bug though.

Some tests would be great. Perhaps based on English and reftests.
Comment 44 Olli Pettay [:smaug] 2010-03-15 07:44:50 PDT
Also, are there cases when the patch iterates over mDOMWordText many times?
Do we really need to go from aIndex to 0 every time?
Comment 45 Olli Pettay [:smaug] 2010-03-15 07:47:50 PDT
..or from aIndex to end mDOMWordText.Length()
Comment 46 Ryan VanderMeulen [:RyanVM] 2010-06-01 14:37:02 PDT
Massimeddu, would you be interested in updating the patch based on Olli's comments?
Comment 47 Massimeddu 2010-06-01 14:56:03 PDT
(In reply to comment #46)
> Massimeddu, would you be interested in updating the patch based on Olli's
> comments?

I'm sorry, I have a lot of work these days... but I can recheck my patch during this weekend.

I can fix the code-style issues, but for now I don't know how to reduce the number of iterations...

In any case I think that the problem of the number of iterations is not critical, because these iterations are made only for the words that have into the string a kSeparator, kOther, kPunctuation or kSymbol char, and only one time at page loading or only for the word that you are current editing.

Any comment is well accepted!
Comment 48 R Baars 2010-07-02 12:43:39 PDT
Is there any chance of progress on this issue? Dutch spell checking is functionally bad because of this issue!
Comment 49 Németh László 2010-09-08 09:58:44 PDT
I agree, this improvement is not resource-critical, but important for the correct spell checking. By the way, hyphen (-) has to be the part of the word, but n-dash (U+2013: –) and m-dash (U+2014: —) don't, because they are always words separators. What are the kSymbol characters?
Comment 50 Olivier R. 2010-10-16 07:48:34 PDT
French spell checking is also flawed due to this issue. And users are eager to have this issue solved. :)
Comment 51 António Manuel Dias 2011-01-12 14:02:03 PST
*** Bug 624250 has been marked as a duplicate of this bug. ***
Comment 52 João Neves 2011-01-24 03:56:12 PST
Just adding that Portuguese spell-checking is broken because of this. And the people who provide us the dictionaries for pt-PT are complaining about this and saying it looks bad on them :(.
Comment 53 Olli Pettay [:smaug] 2011-01-24 04:05:23 PST
I'd be happy to review an updated patch.
Comment 54 R Baars 2011-01-24 07:20:54 PST
Olli, since you reported the issues, and Massimedu clearly has not had time for a long time, could you not change code yourselves, and have it reviewed by somenone else?
Comment 55 Kevin Scannell 2011-01-24 10:24:56 PST
This is a serious problem for Irish spell checking also.  I commented to this effect in the related (duplicate?) Bug 360240.  In short, we have correct words like "dea-scéal" where the first bit "dea" is not a correct standalone word.   We also have words like "t-anam", "n-anam" which are very common, and unfortunately the one-letter prefixes "t" and "n" are highlighted as misspelled.
Comment 56 Toni Hermoso Pulido 2011-01-24 11:03:44 PST
(In reply to comment #55)
> This is a serious problem for Irish spell checking also.  I commented to this
> effect in the related (duplicate?) Bug 360240.  In short, we have correct words
> like "dea-scéal" where the first bit "dea" is not a correct standalone word.  
> We also have words like "t-anam", "n-anam" which are very common, and
> unfortunately the one-letter prefixes "t" and "n" are highlighted as
> misspelled.

Also for Catalan, Occitan: bug 257073  ;P
Comment 57 Michael Bauer 2011-01-26 06:07:35 PST
And Scots Gaelic (h- t- n- highlighted). As with Irish, very common prefixes.
Comment 58 :Ehsan Akhgari (busy, don't ask for review please) 2011-01-26 17:49:12 PST
Created attachment 507324 [details] [diff] [review]
Patch (v1)
Comment 59 Olli Pettay [:smaug] 2011-01-27 15:18:03 PST
Comment on attachment 507324 [details] [diff] [review]
Patch (v1)

So the hunspell changes are from where?
Comment 60 Ryan VanderMeulen [:RyanVM] 2011-01-27 15:34:01 PST
Adding "scot-free" to the dictionary (as well as updating the upstream diff) and telling Hunspell which characters are valid for breaking.
Comment 61 :Ehsan Akhgari (busy, don't ask for review please) 2011-01-27 16:12:31 PST
Comment on attachment 507324 [details] [diff] [review]
Patch (v1)

(In reply to comment #60)
> Adding "scot-free" to the dictionary (as well as updating the upstream diff)
> and telling Hunspell which characters are valid for breaking.

Also, see comment 37 about the BREAK directive.

But this patch is failing this reftest <http://mxr.mozilla.org/mozilla-central/source/layout/reftests/text/wordwrap-02.html?force=1>.  I need to figure out why...
Comment 62 :Ehsan Akhgari (busy, don't ask for review please) 2011-01-27 21:44:23 PST
OK, here is the problem.  Before this patch, we used to break text into words before passing it to hunspell, and we used to put a maximum of 64 characters on each word <http://mxr.mozilla.org/mozilla-central/source/extensions/spellcheck/src/mozSpellChecker.cpp#44>, which might make sense because words longer than that are rare in most languages.  But with this patch we may end up with large sequences of characters which are going to be passed to hunspell, so we may fail that max length check often, which causes us to consider the word as misspelled.

This check has been added in bug 340050.  I need to investigate whether turning it off is safe or not, but I don't think that's safe to do for Firefox 4.

I think at this point our best bet is to add spellcheck="false" to the textarea element in wordwrap-02.html.  Smaug, what do you think?
Comment 63 R Baars 2011-01-28 03:02:14 PST
Besides the -, best option is to get all wordchars from the wordchars-clause in the affix file.
The clause BREAK specifies which char could be used as a word breaking char; default is -.
Lnaguages (mostly eastern) without word separation chars also exist. I don't know how that sould be handled.
Comment 64 bjoern 2011-01-28 03:12:55 PST
the discussion in the openoffice.org bugtracker took 3 years or so until the right fix was implemented. It would be great if the discussion here could be kept short now by simply looking at the resulting implementation in openoffice.org, which every dictionary maintainer seems to be happy with.
again: http://www.openoffice.org/issues/show_bug.cgi?id=64400
Comment 65 Olli Pettay [:smaug] 2011-01-28 03:42:50 PST
The happiness of dictionary maintainers is less important than happiness of 
our users ;)
Comment 66 bjoern 2011-01-28 03:48:32 PST
if you think so ...

I'd say that dictionary maintainers are the linguistic experts of their language from a spell-checker-technical point of view - which is most important at this place. I can tell you that the German Thunderbird users are not happy at all with the spell checker's word separation. The OOo users *are* happy since a while.
Comment 67 Ramon Casha 2011-01-28 03:52:00 PST
In some languages (at least Maltese), some characters are both word-break characters AND part of the word (in our case, the preceding word). 

For instance "tal-aħħar" should be broken into "tal-" and "aħħar". Hunspell should be able to handle the text if it is passed in as is - it will break it into two words correctly. It's a bit like the prefix "semi-", which can be applied arbitrarily to many words, except that the word "tal" without the dash would be a misspelling.
Comment 68 Olli Pettay [:smaug] 2011-01-28 04:02:57 PST
I mean, if we back out bug 340050, and that causes crashes, end users certainly
won't be happy. Also, if we're going to take a fix to FF4, it must be very safe.

Ehsan, could we just increase UNREASONABLE_WORD_LENGTH?
64 is not very unreasonable. Even in Finnish one could probably construct 
longer, even vaguely valid and usable, words.
Comment 69 Olivier R. 2011-01-28 04:34:17 PST
(In reply to comment #65)
> The happiness of dictionary maintainers is less important than happiness of 
> our users ;)

Do you think we, dictionaries maintainers, don’t care of users?
They complain to us because they don’t know this is your fault. ;)
Comment 70 R Baars 2011-01-28 04:40:05 PST
I think, by now, we agree this should be fixed. That is very positive.
It will move Mozilla forward to same quality level of spell checking as OOo has.
So let's stop discussing the need for it, and get it to work.
Comment 71 timeless 2011-01-28 06:05:24 PST
sorry, the risk factor here is that the spell checker we were using at one point resulted in a scary crash. scary crashes are worth $3,000 to a friendly developer and worth tens or hundreds of thousands of dollars to malicious people.

Someone will have to spend some time to ensure there are some good crash tests for long words (first against the old version which was crashing) and then determine if the same formerly crashing long words would crash if the size check is removed from the current codebase.

We all appreciate the hard work that developers do and the work of linguists to improve the user experience for their target audience, but please understand that the *safety* of our users comes first. If someone isn't sure if a word is spelled correctly, they can use an online dictionary (or perhaps decide "oh, i know this is right"). If our software isn't safe, it's possible that the individual user loses thousands of dollars (or euros if you prefer). Users who lose lots of money are considerably less happy than users who lose a half minute checking things with an online dictionary.

R Baars, please read https://bugzilla.mozilla.org/page.cgi?id=etiquette.html
especially 'no obligation'.

If you want to write a patch, you're welcome to do so. if you want to pay someone to write a patch, you can probably find an eager contractor somewhere (don't ask me, this isn't the forum and i don't know).

Please note that a patch to remove the size check should really come with crash tests (and really that should be part one as i described above) and then an adjustment of the length check.
Comment 72 bjoern 2011-01-28 06:13:01 PST
Bugs won't get fixed because you're afaid that you introduce new bugs that you have promised to pay people thousands of dollars for. Mozilla is getting rediculous, really.
Comment 73 Mike Shaver (:shaver -- probably not reading bugmail closely) 2011-01-28 06:17:41 PST
Timeless does not speak for all of Mozilla, but his point about needing some test coverage is a good one, if made quite excessively.  Once FF4 is completed, it will be easier to take such changes as well, since there will be more time to identify any fallout from it.

Before any such thing is done, though, a patch with some coverage for at least the functional changes is needed.  It sounds like there's something from OpenOffice to work from, so this is a great time for someone to ramp up on the code and produce a patch, so that more experienced people can help with reviews and shepherding it in once the decks are clear from FF4.
Comment 74 António Manuel Dias 2011-01-28 06:29:48 PST
Please people, let's try to be nice to each other.  I'd say most people here are working as a volunteer, so we should understand that there are no obligations whatsoever.  Nevertheless, if we didn't care for this application we wouldn't be following this thread would we?

Of course on-line security is *a lot* more important than accurate spell-checking (regardless of the dollars involved), event though to most users spell-checking is more visible.  As I understand (but mind that I'm not a Mozilla developer) Olli Pettay is saying that it's probably equally safe to increase the UNREASONABLE_WORD_LENGTH to a longer value -- not remove the check! -- that would satisfy most users.  Can that be done without changing the safety of the code?
Comment 75 Mike Shaver (:shaver -- probably not reading bugmail closely) 2011-01-28 06:31:10 PST
I think that's likely a safe enough change for FF4, but would defer to Ehsan.
Comment 76 :Ehsan Akhgari (busy, don't ask for review please) 2011-01-28 10:40:52 PST
(In reply to comment #63)
> Besides the -, best option is to get all wordchars from the wordchars-clause in
> the affix file.
> The clause BREAK specifies which char could be used as a word breaking char;
> default is -.
> Lnaguages (mostly eastern) without word separation chars also exist. I don't
> know how that sould be handled.

So, the story with Mozilla's spell checker right now is that we tokenize words before passing them to the spell checker.  Before this patch, we used to break words on separators, punctuations, symbols, and others, according to the Unicode character classes, which caused us to pass "scot" and "free" as separate words to the spell checker, instead of "scot-free".  This is suboptimal, because word breaking rules are language dependent, and it seems like hunspell's tokenizer supports handling that using the WORDCHARS and BREAK directives.  But BREAK directives were completely ignored before this patch, *unless* you chose to broke on a letter or something (I'm not sure that's needed in any language).

With this patch, what changed is that we don't break on those four character classes, *unless* we find two or more of them in a sequence, in which case, we break on every one of them.  In other words, we'll pass "scot-free" to hunspell if we find "scot-free" in the source text, but we'll pass "scot" and "free" to hunspell if we find "scot--free" in the source text.  This still might be suboptimal for some languages, but it's a small improvement over the old behavior.

But please note that we still don't honor BREAK and WORDCHARS rules in our own tokenizer.  This patch makes the BREAK directive kind of useful, but the WORDCHARS directive is still pretty useless.  But that's the most that I would be convinced we should do for Firefox 4.

For a future version of Firefox, I think we should experiment with throwing out our own tokenizer in favor of hunspell's, and I think we should look into how we can integrate its tokenizer with our DOM word reader utils.  It's not as easy as passing raw textnode data to hunspell (because we want things like "va<span>lid</span>" to pass spell checking, for example), but I think we can get there.  Once we hand off tokenization to hunspell completely, dictionary authors can take full advantage of BREAK, WORDCHARS, and other hunspell tokenization wonders.

Nemeth: please correct me if I'm wrong, and let me know if you don't think this is the direction we should be headed forwards to.
Comment 77 :Ehsan Akhgari (busy, don't ask for review please) 2011-01-28 10:42:34 PST
(In reply to comment #64)
> the discussion in the openoffice.org bugtracker took 3 years or so until the
> right fix was implemented. It would be great if the discussion here could be
> kept short now by simply looking at the resulting implementation in
> openoffice.org, which every dictionary maintainer seems to be happy with.
> again: http://www.openoffice.org/issues/show_bug.cgi?id=64400

That discussion is kind of long, and hard to follow.  I specifically looked for links to commits, etc, which shows how OOo handles the tokenization, but couldn't find anything.  Can someone please point me to OOo's code for tokenization?  Has OOo moved in the same direction as what I laid out in comment 76?
Comment 78 :Ehsan Akhgari (busy, don't ask for review please) 2011-01-28 10:51:29 PST
(In reply to comment #68)
> I mean, if we back out bug 340050, and that causes crashes, end users certainly
> won't be happy. Also, if we're going to take a fix to FF4, it must be very
> safe.

Agreed.  I think disabling the check in bug 340050 is something we want to experiment with in the nightlies post Firefox 4, but definitely not something we'd want to try close to a release.  I filed bug 629719 about that.

> Ehsan, could we just increase UNREASONABLE_WORD_LENGTH?
> 64 is not very unreasonable. Even in Finnish one could probably construct 
> longer, even vaguely valid and usable, words.

Agreed.

So, I read <http://en.wikipedia.org/wiki/Longest_words>.  It gave me the chills, as it seems like there is at least one valid word in English with 189,819 letters!  But let's not support that for Firefox 4  :-)

Hungarian is also pretty scary, but the largest word that I think we should support without scaring me too much is the Swedish Nordöstersjökustartilleriflygspaningssimulatoranläggningsmaterielunderhållsuppföljningssystemdiskussionsinläggsförberedelsearbeten with 130 characters.  So, I propose raising the limit to 130 for Firefox 4.

(Yes, I know it's arbitrary, and it's not really a scientific research, but let's be honest, I bet the number 64 wasn't also picked after years of research too... ;-) )

We should definitely try to remove this limit in some future version of Firefox altogether.
Comment 79 :Ehsan Akhgari (busy, don't ask for review please) 2011-01-28 10:57:08 PST
(In reply to comment #73)
> Before any such thing is done, though, a patch with some coverage for at least
> the functional changes is needed.

My patch includes _some_ tests which make me feel not too worried about taking this for Firefox 4.  I also filed bug 629734 about integrating hunspell's test suite into our own.  But we ultimately need a fuzzer for spell checking (yeah, not functional tests, I know).  Bug 629716 covers that.

> It sounds like there's something from
> OpenOffice to work from, so this is a great time for someone to ramp up on the
> code and produce a patch, so that more experienced people can help with reviews
> and shepherding it in once the decks are clear from FF4.

I'd definitely like to see OOo's code here, and I'd very much like to see patches, but I'm afraid I won't personally have enough time to work on this more before Firefox 4...
Comment 80 :Ehsan Akhgari (busy, don't ask for review please) 2011-01-28 11:00:14 PST
Created attachment 507912 [details] [diff] [review]
Part 2: raise the maximum letter count for spell checking
Comment 81 :Ehsan Akhgari (busy, don't ask for review please) 2011-01-28 11:06:18 PST
Created attachment 507917 [details] [diff] [review]
Part 3: work around the reftest failure
Comment 82 Olli Pettay [:smaug] 2011-01-29 03:19:39 PST
Comment on attachment 507912 [details] [diff] [review]
Part 2: raise the maximum letter count for spell checking

The number of 139...
I guess you mean 130.
Comment 83 :Ehsan Akhgari (busy, don't ask for review please) 2011-01-29 14:50:35 PST
Created attachment 508164 [details] [diff] [review]
Part 2: raise the maximum letter count for spell checking

(In reply to comment #82)
> The number of 139...
> I guess you mean 130.

Heh, of course! Sorry about the typo!
Comment 84 :Ehsan Akhgari (busy, don't ask for review please) 2011-01-29 14:52:06 PST
CCing some drivers here for approval decision.  If we want to take this, we should take it for beta11.
Comment 85 Jonas Sicking (:sicking) 2011-01-29 17:50:07 PST
This seems a lot like a new feature to me and feature freeze was 4 betas ago. Am I missing something?
Comment 86 João Neves 2011-01-29 19:23:06 PST
This is a fix for breaking spell-checking in several languages. It could also be a regression bug since it works well worst than previous versions.
Comment 87 Jonas Sicking (:sicking) 2011-01-30 01:37:02 PST
At least on my mac version 3.6 also marks scot-free as a spelling error.

If this is a regression, please point out what worked in FF3.6 but no longer works and add the 'regression' keyword.
Comment 88 João Neves 2011-01-30 03:33:41 PST
Jonas, sorry, I shouldn't really comment on bugzilla at 3am. This is not a regression bug on Firefox (this broken behaviour is present since Firefox 2.0). What I meant is that hunspell handles this correctly and Firefox doesn't.

I understand your worries, and the need to follow the decisions, it's just that it leaves me feeling we got the worst end of the deal: we report an important problem for our users (several times, for the different languages), it gets classified as something not important, we need to raise awareness about it's importance, and when there's finally a patch, we're told it's to late to fix it and that we'll just have to deal with more user complaints when Firefox 4 is finally released. It's a PITA.
Comment 89 Jonas Sicking (:sicking) 2011-01-30 10:27:32 PST
Yeah, I understand that. I think given that next release is just a few months out, and that we've had this behavior for several years now without users complaining too much, it's better to play it safe.

In other words, since we can get away without taking this fix, I think we should.
Comment 90 Michael Bauer 2011-01-30 10:31:31 PST
I'm not so sure they're not "complaining too much" - maybe their complaints just get stuck with the localisation teams... gd has only been out a few months and I've had several "queries".
Comment 91 João Neves 2011-01-30 10:35:18 PST
For Portuguese it's the worst timing possible. Both pt-PT and pt-BR are going to spell changes in the language. People are paying more attention to what it's written and complaining.

We get complains from users and from the people supplying us with the dictionaries, complaining that it's a false positive for them :(.
Comment 92 Brendan Eich [:brendan] 2011-01-30 12:41:05 PST
If we like the patch and it sticks, why not ship it in Firefox 4?

The fear of regressions delaying 4 is good but I don't see it here. More like we won't be perfect yet (as if!). What am I missing?

/be
Comment 93 Jonas Sicking (:sicking) 2011-01-30 18:45:17 PST
How I this different from a new feature? Why isn't there a fear of regressions just like with any other patch?
Comment 94 Brendan Eich [:brendan] 2011-01-30 20:15:03 PST
How about analyzing the risks of the patch? Generalizing as if every patch carried equal and large risk is no good. We're taking lots of fixes still, not classified as "feature changes" but arguably some are. We could stop in a brain off way (if only we could agree on "feature") but brain-on is always better.

What's the brain-on analysis of this patch's risks?

/be
Comment 95 :Ehsan Akhgari (busy, don't ask for review please) 2011-01-31 21:08:00 PST
Here is an analysis on why I picked this up, and why I think we should take this for Firefox 4.  This is a very high impact fix for a lot of European languages.  The bug itself is not a regression, and the fix is very low impact on en-US.

The code change itself is relatively small and safe.  It doesn't change the performance characteristics of the spell checker, and it doesn't regress the spell checking behavior in en-US.

The only risk that I can think of here is that this will need dictionary authors to add the appropriate BREAK directives to their dictionary affix files, if they don't have them there already.  What this means that without updated dictionary affix files, phrases such as "apples-oranges" (exactly that, without spaces before or after hyphens) will be marked as incorrectly spelled.  If we decide to take this, I'm planning to make a blog post about this fix, and we can communicate this to the dictionary authors on AMO (the fix is simple, and the target audience is relatively small).

The escape plan here is also simple.  There will be one patch to backout in case we have any regressions here (I will fix any regressions by a backout, because I won't have enough time to look into them anyways!).  Ryan and I will monitor crash-stats post landing and will back this out in case we spot any crashes after this lands.

I believe that this analysis should be enough in order to enable the drivers to reach a conclusion here.  I'd insist that *if* we want to take this, we should take it for beta11.  In fact, after much thought on this, I'll a- this myself if we don't have a decision here by the b11 freeze date (TBD).  :-)
Comment 96 Ramon Casha 2011-01-31 21:21:08 PST
I definitely think that this should be included in the earliest beta possible so that testing can start. This bug is a significant problem for many languages and should not be shrugged off so easily.
Comment 97 Jonas Sicking (:sicking) 2011-01-31 21:28:10 PST
I think beta11 freeze is tomorrow, tuesday. I recommend bringing this up at the dev meeting tomorrow.
Comment 98 Brendan Eich [:brendan] 2011-01-31 22:20:26 PST
I'm going to recommend we take this one. Thanks for the work and analysis, Ehsan.

/be
Comment 99 R Baars 2011-01-31 23:15:35 PST
About adding the BREAK to the dictionary: this is not required, since BREAK has a default of - when not specified.
Comment 100 :Ehsan Akhgari (busy, don't ask for review please) 2011-02-01 07:41:02 PST
(In reply to comment #97)
> I think beta11 freeze is tomorrow, tuesday. I recommend bringing this up at the
> dev meeting tomorrow.

OK, will do.
Comment 101 :Ehsan Akhgari (busy, don't ask for review please) 2011-02-01 07:41:27 PST
(In reply to comment #99)
> About adding the BREAK to the dictionary: this is not required, since BREAK has
> a default of - when not specified.

The default might not be enough, as it only breaks on hyphens.
Comment 102 Olivier R. 2011-02-01 08:42:11 PST
You should have a look on these English dictionaries:
http://extensions.services.openoffice.org/en/project/dict-en-fixed

They have been made by Németh László to fix main issues when OpenOffice.org got the new tokenization behaviour (same as what you want to do here).
Comment 103 :Ehsan Akhgari (busy, don't ask for review please) 2011-02-01 11:33:01 PST
Comment on attachment 507324 [details] [diff] [review]
Patch (v1)

So, I brought this up in the developers meeting today, and it turns out that this is going to have l10n impact, because we ship some dictionaries with localized versions of Firefox.  This is too late in the cycle to introduce such an l10n churn, so I'd recommend that given the new information, we should not take this for Firefox 4.

I will land this soon after Firefox 4, though.
Comment 104 :Ehsan Akhgari (busy, don't ask for review please) 2011-02-01 11:34:43 PST
Note to myself: this is the list of languages with which we ship a dictionary.
Comment 105 :Ehsan Akhgari (busy, don't ask for review please) 2011-02-01 11:34:52 PST
<http://mxr.mozilla.org/l10n-central/find?text=&string=aff%24>
Comment 106 Brendan Eich [:brendan] 2011-02-01 11:41:53 PST
Is this .x fodder, or Firefox 5?

/be
Comment 107 Axel Hecht [:Pike] 2011-02-01 11:45:00 PST
It seems that dictionaries need updating, and thus this would be add-on compat. I don't think we should take this on .x. Let alone of the effort to fix and ship our own dictionaries. But that'd be easier to talk about if we knew which are affected how.
Comment 108 :Ehsan Akhgari (busy, don't ask for review please) 2011-02-01 11:46:41 PST
(In reply to comment #106)
> Is this .x fodder, or Firefox 5?

I'd say that this is wanted-next, but we don't have that flag in Bugzilla yet!

But it's definitely not something that we'd want to take in a dot release.
Comment 109 R Baars 2011-02-01 12:29:45 PST
Guys, I think you are mistaken on the dictionary issue.
Please validate your assumptions with Laszlo Nemeth.
As far as I know, any existing old dictionary should function correctly with Hunspell 1.3.x, but 1.3.x has additional features that CAN be used by the dictionary people. So new dictionaries CAN be introduced. For sure, the mext release of the Dutch dictionary, on which I worked parallel with Laszlo's code changes, wil require 1.3.2 at least. It would be a big disappointment when spelling support in Firefox will soon be quite outdated compared to the one the platform and text processor will offer, especially since FF is used more and more for text entry.
Comment 110 :Ehsan Akhgari (busy, don't ask for review please) 2011-02-01 13:06:58 PST
We're not using hunspell 1.3.x in Firefox 4.
Comment 111 Axel Hecht [:Pike] 2011-02-01 15:35:39 PST
This is the mxr query for BREAK in the affix files we ship with our localizations: http://mxr.mozilla.org/l10n-central/search?string=BREAK&find=aff$

It doesn't look like a lot of BREAK support made it into the dictionaries we ship.

I'm reading through this bug for the second time only, did we get a concrete pointer to "the plan" that OOo put through, beyond the 104 bug comments over there?
Comment 112 :Ehsan Akhgari (busy, don't ask for review please) 2011-02-01 21:34:16 PST
(In reply to comment #111)
> This is the mxr query for BREAK in the affix files we ship with our
> localizations:
> http://mxr.mozilla.org/l10n-central/search?string=BREAK&find=aff$
> 
> It doesn't look like a lot of BREAK support made it into the dictionaries we
> ship.

It doesn't matter, as those BREAK directives will be ignored until this bug is fixed.

> I'm reading through this bug for the second time only, did we get a concrete
> pointer to "the plan" that OOo put through, beyond the 104 bug comments over
> there?

I'm still waiting for a response to comment 77!
Comment 113 Brendan Eich [:brendan] 2011-02-01 23:11:35 PST
I think if we wait for someone to answer comment 77 OOo questions we will be waiting a while.

I still wonder if we wouldn't be better off with this fix now, and fix the shipped dictionaries as soon as we can. Since - already breaks, what do we make worse by shipping this bug's patch ahead of dictionaries with added BREAKs that it would respect?

Sorry if this is a dumb question.

/be
Comment 114 R Baars 2011-02-01 23:25:49 PST
Using the - as word separator was normal in OOo up to 3.1. It was only fixed for 3.2. OOo 3.2 had the - as part of a word. However, it would use the break clause from Hunspell if specified. Hunspell itself used the - as a separator when break was not specified.

So there was full backwards compatibility.
Since then, dictionary developers started to use the new functionality, having the - in a word, changing the dictionary break clause and adding appropriate words.

So by now, all existing well-maintained dictionaries should be ready for this change in Mozilla. I really don't see the problem.

In case a less well-maintained dictionary fails, it is about time it was updated. It will only fail in quite unusual words.

So actually: Mozilla spell-checkign is failing now for a lot of languages, and MIGHT fail for some after fixing this scot-free issue.

So what is worse?
Lets get it in soon. Maybe not in 4.0, but in 4.1. I see no need whatsoever (except unbased fear) for postponing it to 5.
Comment 115 Brendan Eich [:brendan] 2011-02-01 23:35:17 PST
If comment 114 is accurate (my preceding comment came down on the same side) then we should take this patch now, in Fx4.

What is the counterexample, the regression from where we are now that this patch causes without the shipped dictionaries being extended?

/be
Comment 116 Németh László 2011-02-02 09:34:19 PST
(In reply to comment #77)

OpenOffice.org has its own tokenization using IBM ICU. Six years ago I asked to add n-dash and m-dash to the Hungarian word characters (see this patch: http://www.openoffice.org/nonav/issues/showattachment.cgi/30686/breakiterator.diff, in http://www.openoffice.org/issues/show_bug.cgi?id=56347). This was a big improvement in Hungarian spell checking.

Partially thanks to this Mozilla issue, handling *n-dash*, as word character was extended for all languages: http://qa.openoffice.org/issues/show_bug.cgi?id=64400. For back-compatibility, Hunspell has a default tokenization for words with n-dashes (ie. there is no need any dictionary modification for Mozilla). Some languages, like Hungarian has different tokenization rules in OpenOffice.org (see the previous patch), so Hunspell is able to get also the words with m-dashes during Hungarian spell checking.

About dot handling

Several languages use dots after abbreviations. To spell check these forms,  OpenOffice.org Hunspell dictionaries stores these forms only with dots, eg. "dr.". Hunspell recognizes also other words with dots, eg. sentence ending words with period (checking them without dots, too). Hunspell suggestion list doesn't contain extra dots, eg.

"dR."->"dr.", "dl"

So OpenOffice.org may extend it before the text replacement to keep the period:

"dr.", "dl."
Comment 117 Németh László 2011-02-02 10:00:46 PST
I'm sorry, hyphen is the general word character in OpenOffice.org, n-dash is not.

(In reply to comment #115)
There is no need to modify the Mozilla dictionaries. The suggestion mechanism of Hunspell works for the longer bad input words with hyphens, too, eg.

$ hunspell -d en_US -1
element-siz	
& element-siz 15 12: element-six, element-size, element-sis, element-sin, element-sir,
Comment 118 :Ehsan Akhgari (busy, don't ask for review please) 2011-02-02 13:58:19 PST
This is not about whether hunspell can provide us with good suggestions.  The hyphen is the only character that hunspell can handle as the separator by default.

An example of a regression with this patch without the BREAK directives in the affix file is data:text/html,<textarea>apples.oranges</textarea>.

Without this patch, there's no spelling mistake flagged.  With this patch sans the BREAK directives, the entire text gets flagged as a spelling mistake.
Comment 119 Brendan Eich [:brendan] 2011-02-02 14:02:53 PST
(In reply to comment #118)
> An example of a regression with this patch without the BREAK directives in the
> affix file is data:text/html,<textarea>apples.oranges</textarea>.
> 
> Without this patch, there's no spelling mistake flagged.  With this patch sans
> the BREAK directives, the entire text gets flagged as a spelling mistake.

It's a punctuation mistake. :-/ I would want it flagged if I failed to enter a space after full stop (and failed to capitalize oranges). If it's code (C, C++, JS, etc.) then it ought not be spellchecked.

Is that the worst case?

/be
Comment 120 :Ehsan Akhgari (busy, don't ask for review please) 2011-02-02 14:59:07 PST
(In reply to comment #119)
> (In reply to comment #118)
> > An example of a regression with this patch without the BREAK directives in the
> > affix file is data:text/html,<textarea>apples.oranges</textarea>.
> > 
> > Without this patch, there's no spelling mistake flagged.  With this patch sans
> > the BREAK directives, the entire text gets flagged as a spelling mistake.
> 
> It's a punctuation mistake. :-/ I would want it flagged if I failed to enter a
> space after full stop (and failed to capitalize oranges). If it's code (C, C++,
> JS, etc.) then it ought not be spellchecked.
> 
> Is that the worst case?
> 
> /be

No.  data:text/html,<textarea>I don't know him/her.</textarea>

With that, "him/her" would be flagged as misspelled.  While I would argue that we shouldn't catch "apples.oranges" too, "him/her" should definitely not be flagged.
Comment 121 Pander 2011-02-02 15:22:21 PST
(In reply to comment #119)
> (In reply to comment #118)
> > An example of a regression with this patch without the BREAK directives in the
> > affix file is data:text/html,<textarea>apples.oranges</textarea>.
> > 
> > Without this patch, there's no spelling mistake flagged.  With this patch sans
> > the BREAK directives, the entire text gets flagged as a spelling mistake.
> 
> It's a punctuation mistake. :-/ I would want it flagged if I failed to enter a
> space after full stop (and failed to capitalize oranges). If it's code (C, C++,
> JS, etc.) then it ought not be spellchecked.
> 
> Is that the worst case?
> 
> /be

Punctuation mistakes should be found by an add-on what communicates with LanguageTool, see:
  https://addons.mozilla.org/en-Us/thunderbird/addon/grammar-checker/
and
  http://sourceforge.net/tracker/?func=detail&aid=2943212&group_id=281835&atid=1195603
Comment 122 :Ehsan Akhgari (busy, don't ask for review please) 2011-02-02 15:33:34 PST
(In reply to comment #116)
> (In reply to comment #77)
> 
> OpenOffice.org has its own tokenization using IBM ICU.

Does that mean that we have to do something similar to that, too?  I thought that hunspell is capable of handling the tokenization based on the affix file rules...
Comment 123 Brendan Eich [:brendan] 2011-02-02 17:28:17 PST
Re: "him/her" -- good one, I give.

Still sad that we have to ship something that certain language users find so badly broken, one more time.

/be
Comment 124 :Ehsan Akhgari (busy, don't ask for review please) 2011-02-08 13:03:03 PST
I've been talking about this with Brendan a bit more, and I think we may have a good middle-ground for Firefox 4.  I'm proposing that we can only exempt the hyphen (both its ASCII and Unicode variants) from the word breaking logic.  This should have minimal risk, as those two characters are the default value for the BREAK directive in hunspell.

Brendan is going to bring it up in the triage meeting today, and if we get a go, I'll write the patch.
Comment 125 Németh László 2011-02-08 13:36:49 PST
(In reply to comment #124)
> I've been talking about this with Brendan a bit more, and I think we may have a
> good middle-ground for Firefox 4.  I'm proposing that we can only exempt the
> hyphen (both its ASCII and Unicode variants) from the word breaking logic. 
> This should have minimal risk, as those two characters are the default value
> for the BREAK directive in hunspell.

Hunspell handles only the hyphen (U+002D), and the terminating dots, but not n-dash (U+2013).
Comment 126 :Ehsan Akhgari (busy, don't ask for review please) 2011-02-08 15:11:44 PST
(In reply to comment #125)
> (In reply to comment #124)
> > I've been talking about this with Brendan a bit more, and I think we may have a
> > good middle-ground for Firefox 4.  I'm proposing that we can only exempt the
> > hyphen (both its ASCII and Unicode variants) from the word breaking logic. 
> > This should have minimal risk, as those two characters are the default value
> > for the BREAK directive in hunspell.
> 
> Hunspell handles only the hyphen (U+002D), and the terminating dots, but not
> n-dash (U+2013).

So I guess we should only consider ASCII hyphens in that case.
Comment 127 :Ehsan Akhgari (busy, don't ask for review please) 2011-02-08 16:48:37 PST
Created attachment 510837 [details] [diff] [review]
Low-risk patch for 2.0

This is a low risk version of the patch for 2.0 based on comments 124-126.
Comment 128 Hendrik Maryns 2011-02-09 02:48:51 PST
(In reply to comment #126)
> (In reply to comment #125)
> > (In reply to comment #124)
> > > I've been talking about this with Brendan a bit more, and I think we may have a
> > > good middle-ground for Firefox 4.  I'm proposing that we can only exempt the
> > > hyphen (both its ASCII and Unicode variants) from the word breaking logic. 
> > > This should have minimal risk, as those two characters are the default value
> > > for the BREAK directive in hunspell.
> > 
> > Hunspell handles only the hyphen (U+002D), and the terminating dots, but not
> > n-dash (U+2013).
> 
> So I guess we should only consider ASCII hyphens in that case.

You mean: you should only consider the hyphen (which happens to be ASCII).  N-dash is NOT a hyphen.
Comment 129 Axel Hecht [:Pike] 2011-02-09 03:01:36 PST
Comment on attachment 510837 [details] [diff] [review]
Low-risk patch for 2.0

Looks good
Comment 130 Mike Beltzner [:beltzner, not reading bugmail] 2011-02-09 13:20:03 PST
Comment on attachment 510837 [details] [diff] [review]
Low-risk patch for 2.0

a=beltzner
Comment 132 João Neves 2011-02-09 14:03:32 PST
Thank you all for all the work!

Will happily test it tomorrow when the nightly updates. Thank you very much.
Comment 133 :Ehsan Akhgari (busy, don't ask for review please) 2011-02-09 14:06:00 PST
(In reply to comment #132)
> Thank you all for all the work!
> 
> Will happily test it tomorrow when the nightly updates. Thank you very much.

Please do.  If you find any problems, please file a new bug in the Core::Spell Checking component.
Comment 134 Jeff Muizelaar [:jrmuizel] 2011-02-10 08:21:52 PST
This seems cause a single hyphen standing alone '-' to be marked as a spelling mistake. That doesn't seem like a desired result.
Comment 135 Jeff Muizelaar [:jrmuizel] 2011-02-10 08:25:01 PST
I've filed bug 633210 for the new issue.
Comment 136 João Neves 2011-02-10 09:07:03 PST
Verified on this textarea with: "fazê-la fá-la" and the pt-PT dictionary with no changes.

Minefield 4.0b12pre (2011-02-10)

Thank you very much!
Comment 137 :Ehsan Akhgari (busy, don't ask for review please) 2011-02-15 14:15:45 PST
This was backed out in bug 634153 because of regressions.  I'll look at it again after Firefox 4.
Comment 138 João Neves 2011-02-21 09:40:00 PST
Leaving a comment for later then:

The code should check for "-", "foo-" and "-foo". These strings should not be passed on to the spellchecker.

Best regards,
João Miguel Neves
Comment 139 R Baars 2011-02-21 09:54:58 PST
Comment 138 is not correct for Dutch. A correct Dutch phrase is:
   mond- en klauwzeer
which is a standard shorter notation for
   mondzeer en klauwzeer.
The end-hyphen is possible in all places where a word could be compounded:
   spannings- and vervelingstest

A lot of effort has been put in having Hunspell understand this behaviour and to support it.

It now even understands spannings is wrong, and suggests both spanning and spannings-
Comment 140 Mike Beltzner [:beltzner, not reading bugmail] 2011-02-23 10:03:11 PST
Comment on attachment 510837 [details] [diff] [review]
Low-risk patch for 2.0

(bookkeeping)
Comment 141 Jesse Ruderman 2011-04-04 07:06:28 PDT
Ehsan, are you going to give this another shot now that bug 620626 has landed?
Comment 142 :Ehsan Akhgari (busy, don't ask for review please) 2011-04-06 15:34:18 PDT
(In reply to comment #141)
> Ehsan, are you going to give this another shot now that bug 620626 has landed?

Yes, hopefully in the Firefox 6 timeframe (tentatively).
Comment 143 Brendan Eich [:brendan] 2011-04-07 12:55:38 PDT
Thanks to Ehsan for trying last time -- it was not the right time (sorry about that on my side as release-drivers intercessor), but we also learned some things, which does help for next time.

This is not a justification, more of a lemons -> lemonade point, but with more positive connotations, IMHO.

/be
Comment 144 :Ehsan Akhgari (busy, don't ask for review please) 2011-04-14 16:31:12 PDT
Created attachment 526149 [details] [diff] [review]
Fix to bug 634153

This patch should be applied on top of the previously attached patches in this bug (I'm attaching it separately to ease the review).

This fixes the regression in bug 634153 which caused us to back out the patch for this bug in 2.0.

The cause of that regression was that the code for detecting conditional punctuations was not updated to take account this hunk (added as part of this bug).

+  // The dot character, if appearing at the end of a word, should
+  // be considered part of that word.  Example: "etc.", or
+  // abbreviations
+  if (aIndex > 0 &&
+      mDOMWordText[aIndex] == '.' &&
+      mDOMWordText[aIndex - 1] != '.' &&
+      ClassifyCharacter(aIndex - 1, false) != CHAR_CLASS_WORD) {
+    return CHAR_CLASS_WORD;
+  }


This patch simply checks the preceding and following charatcers to explicitly make sure that they're not dot's which are masked as word-chars.  If we find such a character, we just recognize the conditional punctuation character as a word separator (because the only case such characters can be considered as part of the word is if they appear in the middle of a word (such as "he's").
Comment 145 :Ehsan Akhgari (busy, don't ask for review please) 2011-04-14 16:35:11 PDT
I prefer to land this on trunk as soon as possible so that we have enough time to deal with possible regressions before we branch to aurora.  If this gets onto aurora and we determine that we can't ship with it, I will deal with it by backing the patch out of aurora (the way I did for 2.0).
Comment 146 Olli Pettay [:smaug] 2011-04-18 09:02:47 PDT
Comment on attachment 526149 [details] [diff] [review]
Fix to bug 634153

Can we get a testcase for this, please.
Comment 147 :Ehsan Akhgari (busy, don't ask for review please) 2011-04-18 15:41:16 PDT
(In reply to comment #146)
> Comment on attachment 526149 [details] [diff] [review]
> Fix to bug 634153
> 
> Can we get a testcase for this, please.

I already landed a test case for this as part of bug 634153: <http://mxr.mozilla.org/mozilla-central/find?text=&kind=text&string=dotafterquote>
Comment 149 :Ehsan Akhgari (busy, don't ask for review please) 2011-04-18 19:52:24 PDT
I had mistakenly marked a test failing on Android here, here's the fix: http://hg.mozilla.org/mozilla-central/rev/281b86cd0403
Comment 150 bjoern 2011-09-03 07:47:50 PDT
this is still not fixed. Compare OOo with the German dictionary which correctly accepts all of the words in "Arbeits- und Familienrecht" while Firefox still complains about "Arbeits-" as it just checked "Arbeits" without the dash.

As mentioned above, please make the tokenizer behave like the one in OpenOffice.

Can someone with apropriate rights please reopen this bug?
Comment 151 R Baars 2011-09-03 07:59:51 PDT
In addition to Björns remark: it is not fixed when the hyphen is at the end (maybe start too) of the word. It is fixed for in-word positions (as far as I can tell).
Comment 152 Ryan VanderMeulen [:RyanVM] 2011-09-03 10:43:35 PDT
Probably better to take the discussion of the last two comments over to bug 632977 and leave this one alone.

Note You need to log in before you can comment on or make changes to this bug.