Closed Bug 355178 Opened 18 years ago Closed 13 years ago

spell checker doesn't accept "scot-free" (tokenization, Hunspell 1.2.8)

Categories

(Core :: Spelling checker, defect)

defect
Not set
minor

Tracking

()

RESOLVED FIXED
mozilla6

People

(Reporter: info, Assigned: ehsan.akhgari)

References

(Blocks 2 open bugs, )

Details

Attachments

(5 files, 7 obsolete files)

11.44 KB, patch
smaug
: review+
ehsan.akhgari
: approval2.0-
Details | Diff | Splinter Review
1.44 KB, patch
roc
: review+
Details | Diff | Splinter Review
1.14 KB, patch
beltzner
: approval2.0-
Details | Diff | Splinter Review
9.57 KB, patch
smaug
: review+
Pike
: feedback+
beltzner
: approval2.0-
Details | Diff | Splinter Review
1.50 KB, patch
smaug
: review+
Details | Diff | Splinter Review
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a1) Gecko/20061002 Minefield/3.0a1
Build Identifier: Thunderbird version 1.5.0.7 (20060909)

"scot-free" is a common construction in English, e.g. "He escaped scot-free", even though "scot" on its own is obscure Middle English.  So I think "scot-free" should be in the English dictionaries.  I'm not sure whether you can add a hyphenated word to the dictionary, if not I guess add "scot" so the construction is OK.

Reproducible: Always

Steps to Reproduce:
1.  Enter "He escaped scot-free"


Actual Results:  
Flags "scot"

Expected Results:  
scot-free is OK and common.
confirming:
scot-free is a word - http://onelook.com/?w=scot-free&ls=b

S. Page: Can you test other hyphenated words with one part that is not English?  Can you add "scot-free" to the dictionary?

It would be useful to know if this is a problem with hyphenated words or just a word missing from the dictionary.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Assignee: mscott → nobody
In reply to comment #1)
>  Can you add "scot-free" to the dictionary?

Sort of. I added scot-free to en-US.dic and it suggests it as a correction for scotfree.  But it still flags the "scot" part of "scot-free" as a misspelling.  I guess Hunspell checks the word before the hyphen, fails, and doesn't retry to see if the compound word is valid.

I found a different case: "Topsy-turvy" is correct English but separately each part is not English.  If I add that to en-US.dic, Firefox correctly suggests topsy-turvy as a replacement for topsyturvy, but then flags both parts of "topsy-turvy" as misspelled.

> Can you test other hyphenated words with one part that is not English?

I can't think of any others off-hand nor find a list of them.


> It would be useful to know if this is a problem with hyphenated words or just a
> word missing from the dictionary.

In general, there are no hyphenated words in the en-US.dic dictionary.  Perhaps that's why it doesn't suggest "plug-in cop-out" as corrections when you type "plugin copout"   (BTW, bug 448897 is about "plugin/plug-in".).  I found adding the hyphenated word does the right thing, but there may be some other way to indicate hyphenated compounds -- the "Hunspell features"/hunspell(4) man page doc is hard to understand.

To sum up,
* Adding a hyphenated phrase to en-US.dic makes FF3 suggest it as a spelling correction (good!).  So IMO this should be done for common phrases like cop-out and plug-in that people incorrectly spell as one word.

* Adding a hyphenated phrase to en-US.dic where either part alone is not valid English does not make the compound pass spell checking (bad!).  So adding "scot-free" and "topsy-turvy" will confuse users.
Summary: spell checker doesn't accept "scot-free" → spell checker doesn't accept "scot-free" or "topsy-turvy"
This is a longstanding problem at OpenOffice.org development, see http://www.openoffice.org/issues/show_bug.cgi?id=64400.
The upcoming Hunspell 1.2.8 solves this problem by checking the whole compound word and (when necessary) all parts of the hyphenated input word. I will also extend the en_US dictionary.

To fix the reported problem, Firefox and Thunderbird have to modify their word tokenization to handle hyphen as a word character. Thanks for the report, László
To solve this problem the only task is to change the tokenization at hyphens: Without any dictionary modification the newly integrated Hunspell 1.2.8 can break the input token at hyphens, like the recent tokenizator (default back compatibility), but Hunspell checks also the whole token (with hyphen, like in "scot-free") before this tokenization. (The default tokenization of Hunspell can be modified by the BREAK parameter of the Hunspell affix file.)

This is an important issue for the spell checking (for example, see the last duplicate Bug 466127). Could anyone help to find what component or who is responsible for the tokenization in Mozilla Firefox? (Handling abbreviations with dots is also an important tokenization problem in Firefox.)
Blocks: 360240
Blocks: 461142
Summary: spell checker doesn't accept "scot-free" or "topsy-turvy" → spell checker doesn't accept "scot-free" (tokenization, Hunspell 1.2.8)
Blocks: 276748
Blocks: 318040
http://www.openoffice.org/issues/show_bug.cgi?id=64400

has some useful information including a list of Unicode word separators used by
OpenOffice.
THis bug should not just be about the dash, but about word iteration parameterizations.

It involves the . as well, for abbreviations.
A construction like OOo would be possible, a different approach woulkd be a construction like SRX (general rules by regexp, exceptions on these rules by regexp)
It's a giant bug in bugzilla that random people like caywood@gmail.com can add the emails of other people like me to bug report cclists without their permission.
tru, I didn't do a thing. 

You probably added yourself to the bug when you voted for it in Bugzilla (or possibly one of the bugs it blocks.) That's so you can be notified of fixes. 

I guess your email has been removed, since I can't see it on any of these bugs.
It's not a bug, and in this case happened because tru's bug got duped to this bug. (That always adds the dupe reporter to the cc list of the duped-to bug they should follow.)
Note that it's not just the hyphen that causes this problem. In the Maltese language, an apostrophe (or right single quote) is also part of a word. The word [bela'] is correct, while [bela] is not. 

With the current functionality, if the user correctly writes bela', the spell checker flags the letters "bela" as incorrect, and then suggests bela' with an apostrophe as an alternative.

Ideally the list of characters should be language-dependent. 

Note also some of these characters, while forming part of a word, can also be word-breaking characters. So, in Maltese, "tal-bidu" is two words, "tal-" and "bidu", but in English, "topsy-turvy" should be checked as a single word.

I think that the solution would require a language-dependent list of characters that are to be treated as alphabetic characters, AND a separate list of characters that are to be treated as word-break characters.
Using 2.0.0.22 and the English(GB) dictionary if I type etc the etc is underlined and the drop down alternative of etc. is chosen the etc is still underlined so th replacement etc. is still regarded as a spelling error. This is true for firefox too!
(In reply to comment #15)
> Using 2.0.0.22 and the English(GB) dictionary if I type etc the etc is
> underlined and the drop down alternative of etc. is chosen the etc is still
> underlined so th replacement etc. is still regarded as a spelling error. This
> is true for firefox too!

This is being handled by bug 318040.
(In reply to comment #14)
> Note that it's not just the hyphen that causes this problem. In the Maltese
> language, an apostrophe (or right single quote) is also part of a word. The
> word [bela'] is correct, while [bela] is not. 

The same is in Belarusian and, I guess, Ukrainian. For example, if a word "сям’я" (without quotes, but with a U+2019 apostrophe sign) is containing in the dictionary, it nevertheless underlined as misspelled with no variant suggested. If I try to "misspell" this work in a way of typing "сям’яя", then the word "сям’я" is among the suggested variant (it is in the dictionary if suggested), but after it have been selected in the context menu to replace misspelled "сям’яя", it becomes underlined as misspelled too.

Firefox 3.5.3 under Windows.
Hi,

I have the same problem with sardinian dictionary. I create a simple patch that can fix it. Please some Mozilla team member can review it.
Attached patch Improved version of the patch (obsolete) — Splinter Review
Attachment #418524 - Attachment is obsolete: true
Attached patch Improved version of the patch (obsolete) — Splinter Review
Resolved minor code style problem.
Attachment #418548 - Attachment is obsolete: true
Attachment #418549 - Flags: review+
Attachment #418549 - Flags: review+ → review?(spelling-checker)
Attached patch Improved version of the patch 2 (obsolete) — Splinter Review
This version of the patch resolve also the Bug 318040, please review.
Attachment #418549 - Attachment is obsolete: true
Attachment #418984 - Flags: review?(spelling-checker)
Attachment #418549 - Flags: review?(spelling-checker)
Ehsan, Olli - You two are the last people I can think of that have touched spellcheck. Can one of you take a look?
Comment on attachment 418984 [details] [diff] [review]
Improved version of the patch 2

The patch looks sane to me, I guess Simon can review it?
Attachment #418984 - Flags: review?(spelling-checker) → review?(smontagu)
Assignee: nobody → massimeddu
Status: NEW → ASSIGNED
OS: Windows XP → All
Hardware: x86 → All
I've tested, if you use osxspell in Thunderbird, than it doesn't  flag the "scot" part of "scot-free" as a misspelling (and it doesn't flag etc.). So hunspell also shouldn't flag it.
I'm sorry, I rewrite my patch to fix a little regression in my previous patch, where words that end with more then 1 punctuation char (for example "aword),") are flagged as wrong. Now I think it's all ok!
Attachment #418984 - Attachment is obsolete: true
Attachment #419239 - Flags: review?(spelling-checker)
Attachment #418984 - Flags: review?(smontagu)
Massimeddu: please request review directly from a person ( smontagu@smontagu.org )
.. by setting the review? field to his email address
Attachment #419239 - Flags: review?(spelling-checker) → review?(smontagu)
When testing, could you secure that words like: oma's are considered as one word (plural), not as oma and s separately?

Maybe this is typical for Dutch, maybe it is not. Anyway, for Dutch, scene's is wrong, while scene and s are correct, as is scenes.
(In reply to comment #29)
> When testing, could you secure that words like: oma's are considered as one
> word (plural), not as oma and s separately?

Yes, it's considered as one word, if the hunspell dictionary is correct (for example you must use the WORDCHARS ' option into your dictionary).
Where is the list of word and non-word chars documented?

Will unicode chars like ´ also be seen as part of the word?

Right now, though Hunspell is used in lots of apps, the interface of all apps is different. Opera is very different from OOo and Mozilla in this. which makes the spellchecker behave differently in different apps.
(In reply to comment #31)
> Where is the list of word and non-word chars documented?

It's embedded into affix file of hunspell dictionary (it's different for each language).

> Will unicode chars like ´ also be seen as part of the word?

Yes

> Right now, though Hunspell is used in lots of apps, the interface of all apps
> is different. Opera is very different from OOo and Mozilla in this. which makes
> the spellchecker behave differently in different apps.

With my patch the behavior of mozilla spell checker it's very similar to the OOo 3.2 one.
Would it be possible to mention the release or date the fix probably will be in released software?
smontagu: ping
Comment on attachment 419239 [details] [diff] [review]
Fix a regression of previous patch

>-  // all other punctuation
>+  // the char dot is evalued by hunspell if is near to a word char
>+  // for example it can be used into abreviations
>+  if (aIndex > 0 &&
>+      mDOMWordText[aIndex] == '.' &&
>+      ClassifyCharacter(aIndex - 1, false) == CHAR_CLASS_WORD)
>+    return CHAR_CLASS_WORD;

This seems wrong, because it will treat a sequence of "." of any length as part of the word (etc. etc.. etc...). I think it should only return CHAR_CLASS_WORD for a single "."

>+  
>+  // all other punctuation (only if it is at start or end of the word)
>   if (charCategory == nsIUGenCategory::kSeparator ||
>       charCategory == nsIUGenCategory::kOther ||
>       charCategory == nsIUGenCategory::kPunctuation ||
>-      charCategory == nsIUGenCategory::kSymbol)
>-    return CHAR_CLASS_SEPARATOR;
>+      charCategory == nsIUGenCategory::kSymbol) {
>+    PRBool start = true, end = true;
>+    for (PRInt32 aIndexTmp = aIndex; start && aIndexTmp >= 0; aIndexTmp--) {
>+      // this will classify the current character
>+      nsIUGenCategory::nsUGenCategory
>+      charCategoryTmp = mWordUtil->GetCategories()->Get(PRUint32(mDOMWordText[aIndexTmp]));
>+      if (charCategoryTmp != nsIUGenCategory::kSeparator &&
>+          charCategoryTmp != nsIUGenCategory::kOther &&
>+          charCategoryTmp != nsIUGenCategory::kPunctuation &&
>+          charCategoryTmp != nsIUGenCategory::kSymbol)
>+        start = false;
>+    }
>+    for (PRInt32 aIndexTmp = aIndex; !start && end && aIndexTmp <= PRInt32(mDOMWordText.Length()) - 1; aIndexTmp++) {
>+      // this will classify the current character
>+      nsIUGenCategory::nsUGenCategory
>+      charCategoryTmp = mWordUtil->GetCategories()->Get(PRUint32(mDOMWordText[aIndexTmp]));
>+      if (charCategoryTmp != nsIUGenCategory::kSeparator &&
>+          charCategoryTmp != nsIUGenCategory::kOther &&
>+          charCategoryTmp != nsIUGenCategory::kPunctuation &&
>+          charCategoryTmp != nsIUGenCategory::kSymbol)
>+        end = false;
>+    }
>+    if (start || end)
>+      return CHAR_CLASS_SEPARATOR;
>+  }

This is also problematical: for example something like "apples/oranges/lemons" will be treated as a single word.
Attachment #419239 - Flags: review?(smontagu) → review-
Attached patch A fixed version (obsolete) — Splinter Review
(In reply to comment #36)
> (From update of attachment 419239 [details] [diff] [review])

> This seems wrong, because it will treat a sequence of "." of any length as part
> of the word (etc. etc.. etc...). I think it should only return CHAR_CLASS_WORD
> for a single "." 

Yes, that's right, but it shouldn't a problem, because hunspell can handle that (see above).
In any case, I fixed its in my new patch.

> This is also problematical: for example something like "apples/oranges/lemons"
> will be treated as a single word.

Yes, but the new versions of hunspell can handle this with the WORDCHARS and BREAK directives (where you can specified if "/" char is (or not) a word char for the current dictionary), so hunspell should internally divide the single string "apples/oranges/lemons" into the 3 words.

In other words, the current implementation of mozilla spellchecker define for all languages what chars are word_chars. With my patch, it's defined into hunspell for each language.

I saw that the current English dictionary don't support the char "/" as a break... so it should be updated!
Attachment #419239 - Attachment is obsolete: true
Attachment #431603 - Flags: review?(smontagu)
Comment on attachment 431603 [details] [diff] [review]
A fixed version

Sorry, I'm not confident that I know the spellchecker code well enough to review this properly.
Attachment #431603 - Flags: review?(smontagu) → review?(ryanvm)
Comment on attachment 431603 [details] [diff] [review]
A fixed version

I'm not a spellcheck peer. Maybe smaug?
Attachment #431603 - Flags: review?(ryanvm) → review?
Attachment #431603 - Flags: review? → review?(Olli.Pettay)
I'll try to review this during weekend.
When a dictionary is wanted to test this with, you could use the new Dutch one at www.opentaal.org : http://data.opentaal.org/opentaalbank/spellingcontrole200/woordenboek_nederlands-2.2.4-fx+tb+sm.xpi

A test page is here: 
http://www.opentaal.org/instructies
(Sorry, I'm a bit late with reviewing)
Comment on attachment 431603 [details] [diff] [review]
A fixed version

>diff -r 3b49923ba8c5 extensions/spellcheck/src/mozInlineSpellWordUtil.cpp
>--- a/extensions/spellcheck/src/mozInlineSpellWordUtil.cpp	Wed Mar 10 09:07:45 2010 +0100
>+++ b/extensions/spellcheck/src/mozInlineSpellWordUtil.cpp	Wed Mar 10 11:48:44 2010 +0100
>@@ -867,22 +867,52 @@ WordSplitState::ClassifyCharacter(PRInt3
>       return CHAR_CLASS_SEPARATOR;
>     if (ClassifyCharacter(aIndex + 1, false) != CHAR_CLASS_WORD)
>       return CHAR_CLASS_SEPARATOR;
> 
>     // char on either side is a word, this counts as a word
>     return CHAR_CLASS_WORD;
>   }
> 
>-  // all other punctuation
>-  if (charCategory == nsIUGenCategory::kSeparator ||
>-      charCategory == nsIUGenCategory::kOther ||
>-      charCategory == nsIUGenCategory::kPunctuation ||
>-      charCategory == nsIUGenCategory::kSymbol)
>-    return CHAR_CLASS_SEPARATOR;
>+  // the char dot is evalued by hunspell if is near to a word char
>+  // for example it can be used into abreviations
>+  if (aIndex > 0 &&
>+      mDOMWordText[aIndex] == '.' &&
>+      mDOMWordText[aIndex-1] != '.' &&
>+      ClassifyCharacter(aIndex - 1, false) == CHAR_CLASS_WORD)
>+    return CHAR_CLASS_WORD;
Nit,
if (expr) {
  statement;
}

>+    
>+   if (charCategory == nsIUGenCategory::kSeparator ||
>+       charCategory == nsIUGenCategory::kOther ||
>+       charCategory == nsIUGenCategory::kPunctuation ||
>+       charCategory == nsIUGenCategory::kSymbol) {
>+    PRBool start = true, end = true;
s/true/PR_TRUE;

>+    for (PRInt32 aIndexTmp = aIndex; start && aIndexTmp >= 0; aIndexTmp--) {
variable names which start with 'a'+Capital letter are reserved for
parameters.

--aIndexTmp. (It is slightly faster if compiler doesn't optimize)

>+      // this will classify the current character
>+      nsIUGenCategory::nsUGenCategory
>+      charCategoryTmp = mWordUtil->GetCategories()->Get(PRUint32(mDOMWordText[aIndexTmp]));
>+      if (charCategoryTmp != nsIUGenCategory::kSeparator &&
>+          charCategoryTmp != nsIUGenCategory::kOther &&
>+          charCategoryTmp != nsIUGenCategory::kPunctuation &&
>+          charCategoryTmp != nsIUGenCategory::kSymbol)
>+        start = false;
PR_FALSE

And again
if (expr) {
  statement;
}

>+    }
>+    for (PRInt32 aIndexTmp = aIndex; !start && end && aIndexTmp <= PRInt32(mDOMWordText.Length()) - 1; aIndexTmp++) {
Same thing with variable name and postfix/prefix increment.

>+      // this will classify the current character
>+      nsIUGenCategory::nsUGenCategory
>+      charCategoryTmp = mWordUtil->GetCategories()->Get(PRUint32(mDOMWordText[aIndexTmp]));
>+      if (charCategoryTmp != nsIUGenCategory::kSeparator &&
>+          charCategoryTmp != nsIUGenCategory::kOther &&
>+          charCategoryTmp != nsIUGenCategory::kPunctuation &&
>+          charCategoryTmp != nsIUGenCategory::kSymbol)
>+        end = false;
if (expr) {
  statement;
}


>+    }
>+    if (start || end)
>+      return CHAR_CLASS_SEPARATOR;
And same here.

The current code seems to use true/false in few cases, but it shouldn't.
No need to change those cases in this bug though.

Some tests would be great. Perhaps based on English and reftests.
Attachment #431603 - Flags: review?(Olli.Pettay) → review-
Also, are there cases when the patch iterates over mDOMWordText many times?
Do we really need to go from aIndex to 0 every time?
..or from aIndex to end mDOMWordText.Length()
Massimeddu, would you be interested in updating the patch based on Olli's comments?
(In reply to comment #46)
> Massimeddu, would you be interested in updating the patch based on Olli's
> comments?

I'm sorry, I have a lot of work these days... but I can recheck my patch during this weekend.

I can fix the code-style issues, but for now I don't know how to reduce the number of iterations...

In any case I think that the problem of the number of iterations is not critical, because these iterations are made only for the words that have into the string a kSeparator, kOther, kPunctuation or kSymbol char, and only one time at page loading or only for the word that you are current editing.

Any comment is well accepted!
Is there any chance of progress on this issue? Dutch spell checking is functionally bad because of this issue!
I agree, this improvement is not resource-critical, but important for the correct spell checking. By the way, hyphen (-) has to be the part of the word, but n-dash (U+2013: –) and m-dash (U+2014: —) don't, because they are always words separators. What are the kSymbol characters?
French spell checking is also flawed due to this issue. And users are eager to have this issue solved. :)
Just adding that Portuguese spell-checking is broken because of this. And the people who provide us the dictionaries for pt-PT are complaining about this and saying it looks bad on them :(.
I'd be happy to review an updated patch.
Olli, since you reported the issues, and Massimedu clearly has not had time for a long time, could you not change code yourselves, and have it reviewed by somenone else?
This is a serious problem for Irish spell checking also.  I commented to this effect in the related (duplicate?) Bug 360240.  In short, we have correct words like "dea-scéal" where the first bit "dea" is not a correct standalone word.   We also have words like "t-anam", "n-anam" which are very common, and unfortunately the one-letter prefixes "t" and "n" are highlighted as misspelled.
Blocks: 257073
(In reply to comment #55)
> This is a serious problem for Irish spell checking also.  I commented to this
> effect in the related (duplicate?) Bug 360240.  In short, we have correct words
> like "dea-scéal" where the first bit "dea" is not a correct standalone word.  
> We also have words like "t-anam", "n-anam" which are very common, and
> unfortunately the one-letter prefixes "t" and "n" are highlighted as
> misspelled.

Also for Catalan, Occitan: bug 257073  ;P
And Scots Gaelic (h- t- n- highlighted). As with Irish, very common prefixes.
Attached patch Patch (v1)Splinter Review
Assignee: massimeddu → ehsan
Attachment #431603 - Attachment is obsolete: true
Attachment #507324 - Flags: review?(Olli.Pettay)
Comment on attachment 507324 [details] [diff] [review]
Patch (v1)

So the hunspell changes are from where?
Adding "scot-free" to the dictionary (as well as updating the upstream diff) and telling Hunspell which characters are valid for breaking.
Comment on attachment 507324 [details] [diff] [review]
Patch (v1)

(In reply to comment #60)
> Adding "scot-free" to the dictionary (as well as updating the upstream diff)
> and telling Hunspell which characters are valid for breaking.

Also, see comment 37 about the BREAK directive.

But this patch is failing this reftest <http://mxr.mozilla.org/mozilla-central/source/layout/reftests/text/wordwrap-02.html?force=1>.  I need to figure out why...
Attachment #507324 - Flags: review?(Olli.Pettay)
OK, here is the problem.  Before this patch, we used to break text into words before passing it to hunspell, and we used to put a maximum of 64 characters on each word <http://mxr.mozilla.org/mozilla-central/source/extensions/spellcheck/src/mozSpellChecker.cpp#44>, which might make sense because words longer than that are rare in most languages.  But with this patch we may end up with large sequences of characters which are going to be passed to hunspell, so we may fail that max length check often, which causes us to consider the word as misspelled.

This check has been added in bug 340050.  I need to investigate whether turning it off is safe or not, but I don't think that's safe to do for Firefox 4.

I think at this point our best bet is to add spellcheck="false" to the textarea element in wordwrap-02.html.  Smaug, what do you think?
Besides the -, best option is to get all wordchars from the wordchars-clause in the affix file.
The clause BREAK specifies which char could be used as a word breaking char; default is -.
Lnaguages (mostly eastern) without word separation chars also exist. I don't know how that sould be handled.
the discussion in the openoffice.org bugtracker took 3 years or so until the right fix was implemented. It would be great if the discussion here could be kept short now by simply looking at the resulting implementation in openoffice.org, which every dictionary maintainer seems to be happy with.
again: http://www.openoffice.org/issues/show_bug.cgi?id=64400
The happiness of dictionary maintainers is less important than happiness of 
our users ;)
if you think so ...

I'd say that dictionary maintainers are the linguistic experts of their language from a spell-checker-technical point of view - which is most important at this place. I can tell you that the German Thunderbird users are not happy at all with the spell checker's word separation. The OOo users *are* happy since a while.
In some languages (at least Maltese), some characters are both word-break characters AND part of the word (in our case, the preceding word). 

For instance "tal-aħħar" should be broken into "tal-" and "aħħar". Hunspell should be able to handle the text if it is passed in as is - it will break it into two words correctly. It's a bit like the prefix "semi-", which can be applied arbitrarily to many words, except that the word "tal" without the dash would be a misspelling.
I mean, if we back out bug 340050, and that causes crashes, end users certainly
won't be happy. Also, if we're going to take a fix to FF4, it must be very safe.

Ehsan, could we just increase UNREASONABLE_WORD_LENGTH?
64 is not very unreasonable. Even in Finnish one could probably construct 
longer, even vaguely valid and usable, words.
(In reply to comment #65)
> The happiness of dictionary maintainers is less important than happiness of 
> our users ;)

Do you think we, dictionaries maintainers, don’t care of users?
They complain to us because they don’t know this is your fault. ;)
I think, by now, we agree this should be fixed. That is very positive.
It will move Mozilla forward to same quality level of spell checking as OOo has.
So let's stop discussing the need for it, and get it to work.
sorry, the risk factor here is that the spell checker we were using at one point resulted in a scary crash. scary crashes are worth $3,000 to a friendly developer and worth tens or hundreds of thousands of dollars to malicious people.

Someone will have to spend some time to ensure there are some good crash tests for long words (first against the old version which was crashing) and then determine if the same formerly crashing long words would crash if the size check is removed from the current codebase.

We all appreciate the hard work that developers do and the work of linguists to improve the user experience for their target audience, but please understand that the *safety* of our users comes first. If someone isn't sure if a word is spelled correctly, they can use an online dictionary (or perhaps decide "oh, i know this is right"). If our software isn't safe, it's possible that the individual user loses thousands of dollars (or euros if you prefer). Users who lose lots of money are considerably less happy than users who lose a half minute checking things with an online dictionary.

R Baars, please read https://bugzilla.mozilla.org/page.cgi?id=etiquette.html
especially 'no obligation'.

If you want to write a patch, you're welcome to do so. if you want to pay someone to write a patch, you can probably find an eager contractor somewhere (don't ask me, this isn't the forum and i don't know).

Please note that a patch to remove the size check should really come with crash tests (and really that should be part one as i described above) and then an adjustment of the length check.
Bugs won't get fixed because you're afaid that you introduce new bugs that you have promised to pay people thousands of dollars for. Mozilla is getting rediculous, really.
Timeless does not speak for all of Mozilla, but his point about needing some test coverage is a good one, if made quite excessively.  Once FF4 is completed, it will be easier to take such changes as well, since there will be more time to identify any fallout from it.

Before any such thing is done, though, a patch with some coverage for at least the functional changes is needed.  It sounds like there's something from OpenOffice to work from, so this is a great time for someone to ramp up on the code and produce a patch, so that more experienced people can help with reviews and shepherding it in once the decks are clear from FF4.
Please people, let's try to be nice to each other.  I'd say most people here are working as a volunteer, so we should understand that there are no obligations whatsoever.  Nevertheless, if we didn't care for this application we wouldn't be following this thread would we?

Of course on-line security is *a lot* more important than accurate spell-checking (regardless of the dollars involved), event though to most users spell-checking is more visible.  As I understand (but mind that I'm not a Mozilla developer) Olli Pettay is saying that it's probably equally safe to increase the UNREASONABLE_WORD_LENGTH to a longer value -- not remove the check! -- that would satisfy most users.  Can that be done without changing the safety of the code?
I think that's likely a safe enough change for FF4, but would defer to Ehsan.
(In reply to comment #63)
> Besides the -, best option is to get all wordchars from the wordchars-clause in
> the affix file.
> The clause BREAK specifies which char could be used as a word breaking char;
> default is -.
> Lnaguages (mostly eastern) without word separation chars also exist. I don't
> know how that sould be handled.

So, the story with Mozilla's spell checker right now is that we tokenize words before passing them to the spell checker.  Before this patch, we used to break words on separators, punctuations, symbols, and others, according to the Unicode character classes, which caused us to pass "scot" and "free" as separate words to the spell checker, instead of "scot-free".  This is suboptimal, because word breaking rules are language dependent, and it seems like hunspell's tokenizer supports handling that using the WORDCHARS and BREAK directives.  But BREAK directives were completely ignored before this patch, *unless* you chose to broke on a letter or something (I'm not sure that's needed in any language).

With this patch, what changed is that we don't break on those four character classes, *unless* we find two or more of them in a sequence, in which case, we break on every one of them.  In other words, we'll pass "scot-free" to hunspell if we find "scot-free" in the source text, but we'll pass "scot" and "free" to hunspell if we find "scot--free" in the source text.  This still might be suboptimal for some languages, but it's a small improvement over the old behavior.

But please note that we still don't honor BREAK and WORDCHARS rules in our own tokenizer.  This patch makes the BREAK directive kind of useful, but the WORDCHARS directive is still pretty useless.  But that's the most that I would be convinced we should do for Firefox 4.

For a future version of Firefox, I think we should experiment with throwing out our own tokenizer in favor of hunspell's, and I think we should look into how we can integrate its tokenizer with our DOM word reader utils.  It's not as easy as passing raw textnode data to hunspell (because we want things like "va<span>lid</span>" to pass spell checking, for example), but I think we can get there.  Once we hand off tokenization to hunspell completely, dictionary authors can take full advantage of BREAK, WORDCHARS, and other hunspell tokenization wonders.

Nemeth: please correct me if I'm wrong, and let me know if you don't think this is the direction we should be headed forwards to.
(In reply to comment #64)
> the discussion in the openoffice.org bugtracker took 3 years or so until the
> right fix was implemented. It would be great if the discussion here could be
> kept short now by simply looking at the resulting implementation in
> openoffice.org, which every dictionary maintainer seems to be happy with.
> again: http://www.openoffice.org/issues/show_bug.cgi?id=64400

That discussion is kind of long, and hard to follow.  I specifically looked for links to commits, etc, which shows how OOo handles the tokenization, but couldn't find anything.  Can someone please point me to OOo's code for tokenization?  Has OOo moved in the same direction as what I laid out in comment 76?
(In reply to comment #68)
> I mean, if we back out bug 340050, and that causes crashes, end users certainly
> won't be happy. Also, if we're going to take a fix to FF4, it must be very
> safe.

Agreed.  I think disabling the check in bug 340050 is something we want to experiment with in the nightlies post Firefox 4, but definitely not something we'd want to try close to a release.  I filed bug 629719 about that.

> Ehsan, could we just increase UNREASONABLE_WORD_LENGTH?
> 64 is not very unreasonable. Even in Finnish one could probably construct 
> longer, even vaguely valid and usable, words.

Agreed.

So, I read <http://en.wikipedia.org/wiki/Longest_words>.  It gave me the chills, as it seems like there is at least one valid word in English with 189,819 letters!  But let's not support that for Firefox 4  :-)

Hungarian is also pretty scary, but the largest word that I think we should support without scaring me too much is the Swedish Nordöstersjökustartilleriflygspaningssimulatoranläggningsmaterielunderhållsuppföljningssystemdiskussionsinläggsförberedelsearbeten with 130 characters.  So, I propose raising the limit to 130 for Firefox 4.

(Yes, I know it's arbitrary, and it's not really a scientific research, but let's be honest, I bet the number 64 wasn't also picked after years of research too... ;-) )

We should definitely try to remove this limit in some future version of Firefox altogether.
(In reply to comment #73)
> Before any such thing is done, though, a patch with some coverage for at least
> the functional changes is needed.

My patch includes _some_ tests which make me feel not too worried about taking this for Firefox 4.  I also filed bug 629734 about integrating hunspell's test suite into our own.  But we ultimately need a fuzzer for spell checking (yeah, not functional tests, I know).  Bug 629716 covers that.

> It sounds like there's something from
> OpenOffice to work from, so this is a great time for someone to ramp up on the
> code and produce a patch, so that more experienced people can help with reviews
> and shepherding it in once the decks are clear from FF4.

I'd definitely like to see OOo's code here, and I'd very much like to see patches, but I'm afraid I won't personally have enough time to work on this more before Firefox 4...
Attachment #507912 - Flags: review?(Olli.Pettay)
Attachment #507917 - Flags: review?(roc)
Attachment #507324 - Flags: review?(Olli.Pettay)
Comment on attachment 507912 [details] [diff] [review]
Part 2: raise the maximum letter count for spell checking

The number of 139...
I guess you mean 130.
Attachment #507912 - Flags: review?(Olli.Pettay) → review+
Attachment #507324 - Flags: review?(Olli.Pettay) → review+
(In reply to comment #82)
> The number of 139...
> I guess you mean 130.

Heh, of course! Sorry about the typo!
Attachment #507912 - Attachment is obsolete: true
Attachment #508164 - Flags: approval2.0?
Attachment #507324 - Flags: approval2.0?
CCing some drivers here for approval decision.  If we want to take this, we should take it for beta11.
This seems a lot like a new feature to me and feature freeze was 4 betas ago. Am I missing something?
This is a fix for breaking spell-checking in several languages. It could also be a regression bug since it works well worst than previous versions.
At least on my mac version 3.6 also marks scot-free as a spelling error.

If this is a regression, please point out what worked in FF3.6 but no longer works and add the 'regression' keyword.
Jonas, sorry, I shouldn't really comment on bugzilla at 3am. This is not a regression bug on Firefox (this broken behaviour is present since Firefox 2.0). What I meant is that hunspell handles this correctly and Firefox doesn't.

I understand your worries, and the need to follow the decisions, it's just that it leaves me feeling we got the worst end of the deal: we report an important problem for our users (several times, for the different languages), it gets classified as something not important, we need to raise awareness about it's importance, and when there's finally a patch, we're told it's to late to fix it and that we'll just have to deal with more user complaints when Firefox 4 is finally released. It's a PITA.
Yeah, I understand that. I think given that next release is just a few months out, and that we've had this behavior for several years now without users complaining too much, it's better to play it safe.

In other words, since we can get away without taking this fix, I think we should.
I'm not so sure they're not "complaining too much" - maybe their complaints just get stuck with the localisation teams... gd has only been out a few months and I've had several "queries".
For Portuguese it's the worst timing possible. Both pt-PT and pt-BR are going to spell changes in the language. People are paying more attention to what it's written and complaining.

We get complains from users and from the people supplying us with the dictionaries, complaining that it's a false positive for them :(.
If we like the patch and it sticks, why not ship it in Firefox 4?

The fear of regressions delaying 4 is good but I don't see it here. More like we won't be perfect yet (as if!). What am I missing?

/be
How I this different from a new feature? Why isn't there a fear of regressions just like with any other patch?
How about analyzing the risks of the patch? Generalizing as if every patch carried equal and large risk is no good. We're taking lots of fixes still, not classified as "feature changes" but arguably some are. We could stop in a brain off way (if only we could agree on "feature") but brain-on is always better.

What's the brain-on analysis of this patch's risks?

/be
Here is an analysis on why I picked this up, and why I think we should take this for Firefox 4.  This is a very high impact fix for a lot of European languages.  The bug itself is not a regression, and the fix is very low impact on en-US.

The code change itself is relatively small and safe.  It doesn't change the performance characteristics of the spell checker, and it doesn't regress the spell checking behavior in en-US.

The only risk that I can think of here is that this will need dictionary authors to add the appropriate BREAK directives to their dictionary affix files, if they don't have them there already.  What this means that without updated dictionary affix files, phrases such as "apples-oranges" (exactly that, without spaces before or after hyphens) will be marked as incorrectly spelled.  If we decide to take this, I'm planning to make a blog post about this fix, and we can communicate this to the dictionary authors on AMO (the fix is simple, and the target audience is relatively small).

The escape plan here is also simple.  There will be one patch to backout in case we have any regressions here (I will fix any regressions by a backout, because I won't have enough time to look into them anyways!).  Ryan and I will monitor crash-stats post landing and will back this out in case we spot any crashes after this lands.

I believe that this analysis should be enough in order to enable the drivers to reach a conclusion here.  I'd insist that *if* we want to take this, we should take it for beta11.  In fact, after much thought on this, I'll a- this myself if we don't have a decision here by the b11 freeze date (TBD).  :-)
I definitely think that this should be included in the earliest beta possible so that testing can start. This bug is a significant problem for many languages and should not be shrugged off so easily.
I think beta11 freeze is tomorrow, tuesday. I recommend bringing this up at the dev meeting tomorrow.
I'm going to recommend we take this one. Thanks for the work and analysis, Ehsan.

/be
About adding the BREAK to the dictionary: this is not required, since BREAK has a default of - when not specified.
(In reply to comment #97)
> I think beta11 freeze is tomorrow, tuesday. I recommend bringing this up at the
> dev meeting tomorrow.

OK, will do.
(In reply to comment #99)
> About adding the BREAK to the dictionary: this is not required, since BREAK has
> a default of - when not specified.

The default might not be enough, as it only breaks on hyphens.
You should have a look on these English dictionaries:
http://extensions.services.openoffice.org/en/project/dict-en-fixed

They have been made by Németh László to fix main issues when OpenOffice.org got the new tokenization behaviour (same as what you want to do here).
Comment on attachment 507324 [details] [diff] [review]
Patch (v1)

So, I brought this up in the developers meeting today, and it turns out that this is going to have l10n impact, because we ship some dictionaries with localized versions of Firefox.  This is too late in the cycle to introduce such an l10n churn, so I'd recommend that given the new information, we should not take this for Firefox 4.

I will land this soon after Firefox 4, though.
Attachment #507324 - Flags: approval2.0? → approval2.0-
Attachment #508164 - Flags: approval2.0? → approval2.0-
Whiteboard: [post-2.0]
Note to myself: this is the list of languages with which we ship a dictionary.
Is this .x fodder, or Firefox 5?

/be
It seems that dictionaries need updating, and thus this would be add-on compat. I don't think we should take this on .x. Let alone of the effort to fix and ship our own dictionaries. But that'd be easier to talk about if we knew which are affected how.
(In reply to comment #106)
> Is this .x fodder, or Firefox 5?

I'd say that this is wanted-next, but we don't have that flag in Bugzilla yet!

But it's definitely not something that we'd want to take in a dot release.
Guys, I think you are mistaken on the dictionary issue.
Please validate your assumptions with Laszlo Nemeth.
As far as I know, any existing old dictionary should function correctly with Hunspell 1.3.x, but 1.3.x has additional features that CAN be used by the dictionary people. So new dictionaries CAN be introduced. For sure, the mext release of the Dutch dictionary, on which I worked parallel with Laszlo's code changes, wil require 1.3.2 at least. It would be a big disappointment when spelling support in Firefox will soon be quite outdated compared to the one the platform and text processor will offer, especially since FF is used more and more for text entry.
We're not using hunspell 1.3.x in Firefox 4.
This is the mxr query for BREAK in the affix files we ship with our localizations: http://mxr.mozilla.org/l10n-central/search?string=BREAK&find=aff$

It doesn't look like a lot of BREAK support made it into the dictionaries we ship.

I'm reading through this bug for the second time only, did we get a concrete pointer to "the plan" that OOo put through, beyond the 104 bug comments over there?
(In reply to comment #111)
> This is the mxr query for BREAK in the affix files we ship with our
> localizations:
> http://mxr.mozilla.org/l10n-central/search?string=BREAK&find=aff$
> 
> It doesn't look like a lot of BREAK support made it into the dictionaries we
> ship.

It doesn't matter, as those BREAK directives will be ignored until this bug is fixed.

> I'm reading through this bug for the second time only, did we get a concrete
> pointer to "the plan" that OOo put through, beyond the 104 bug comments over
> there?

I'm still waiting for a response to comment 77!
I think if we wait for someone to answer comment 77 OOo questions we will be waiting a while.

I still wonder if we wouldn't be better off with this fix now, and fix the shipped dictionaries as soon as we can. Since - already breaks, what do we make worse by shipping this bug's patch ahead of dictionaries with added BREAKs that it would respect?

Sorry if this is a dumb question.

/be
Using the - as word separator was normal in OOo up to 3.1. It was only fixed for 3.2. OOo 3.2 had the - as part of a word. However, it would use the break clause from Hunspell if specified. Hunspell itself used the - as a separator when break was not specified.

So there was full backwards compatibility.
Since then, dictionary developers started to use the new functionality, having the - in a word, changing the dictionary break clause and adding appropriate words.

So by now, all existing well-maintained dictionaries should be ready for this change in Mozilla. I really don't see the problem.

In case a less well-maintained dictionary fails, it is about time it was updated. It will only fail in quite unusual words.

So actually: Mozilla spell-checkign is failing now for a lot of languages, and MIGHT fail for some after fixing this scot-free issue.

So what is worse?
Lets get it in soon. Maybe not in 4.0, but in 4.1. I see no need whatsoever (except unbased fear) for postponing it to 5.
If comment 114 is accurate (my preceding comment came down on the same side) then we should take this patch now, in Fx4.

What is the counterexample, the regression from where we are now that this patch causes without the shipped dictionaries being extended?

/be
(In reply to comment #77)

OpenOffice.org has its own tokenization using IBM ICU. Six years ago I asked to add n-dash and m-dash to the Hungarian word characters (see this patch: http://www.openoffice.org/nonav/issues/showattachment.cgi/30686/breakiterator.diff, in http://www.openoffice.org/issues/show_bug.cgi?id=56347). This was a big improvement in Hungarian spell checking.

Partially thanks to this Mozilla issue, handling *n-dash*, as word character was extended for all languages: http://qa.openoffice.org/issues/show_bug.cgi?id=64400. For back-compatibility, Hunspell has a default tokenization for words with n-dashes (ie. there is no need any dictionary modification for Mozilla). Some languages, like Hungarian has different tokenization rules in OpenOffice.org (see the previous patch), so Hunspell is able to get also the words with m-dashes during Hungarian spell checking.

About dot handling

Several languages use dots after abbreviations. To spell check these forms,  OpenOffice.org Hunspell dictionaries stores these forms only with dots, eg. "dr.". Hunspell recognizes also other words with dots, eg. sentence ending words with period (checking them without dots, too). Hunspell suggestion list doesn't contain extra dots, eg.

"dR."->"dr.", "dl"

So OpenOffice.org may extend it before the text replacement to keep the period:

"dr.", "dl."
I'm sorry, hyphen is the general word character in OpenOffice.org, n-dash is not.

(In reply to comment #115)
There is no need to modify the Mozilla dictionaries. The suggestion mechanism of Hunspell works for the longer bad input words with hyphens, too, eg.

$ hunspell -d en_US -1
element-siz	
& element-siz 15 12: element-six, element-size, element-sis, element-sin, element-sir,
This is not about whether hunspell can provide us with good suggestions.  The hyphen is the only character that hunspell can handle as the separator by default.

An example of a regression with this patch without the BREAK directives in the affix file is data:text/html,<textarea>apples.oranges</textarea>.

Without this patch, there's no spelling mistake flagged.  With this patch sans the BREAK directives, the entire text gets flagged as a spelling mistake.
(In reply to comment #118)
> An example of a regression with this patch without the BREAK directives in the
> affix file is data:text/html,<textarea>apples.oranges</textarea>.
> 
> Without this patch, there's no spelling mistake flagged.  With this patch sans
> the BREAK directives, the entire text gets flagged as a spelling mistake.

It's a punctuation mistake. :-/ I would want it flagged if I failed to enter a space after full stop (and failed to capitalize oranges). If it's code (C, C++, JS, etc.) then it ought not be spellchecked.

Is that the worst case?

/be
(In reply to comment #119)
> (In reply to comment #118)
> > An example of a regression with this patch without the BREAK directives in the
> > affix file is data:text/html,<textarea>apples.oranges</textarea>.
> > 
> > Without this patch, there's no spelling mistake flagged.  With this patch sans
> > the BREAK directives, the entire text gets flagged as a spelling mistake.
> 
> It's a punctuation mistake. :-/ I would want it flagged if I failed to enter a
> space after full stop (and failed to capitalize oranges). If it's code (C, C++,
> JS, etc.) then it ought not be spellchecked.
> 
> Is that the worst case?
> 
> /be

No.  data:text/html,<textarea>I don't know him/her.</textarea>

With that, "him/her" would be flagged as misspelled.  While I would argue that we shouldn't catch "apples.oranges" too, "him/her" should definitely not be flagged.
(In reply to comment #119)
> (In reply to comment #118)
> > An example of a regression with this patch without the BREAK directives in the
> > affix file is data:text/html,<textarea>apples.oranges</textarea>.
> > 
> > Without this patch, there's no spelling mistake flagged.  With this patch sans
> > the BREAK directives, the entire text gets flagged as a spelling mistake.
> 
> It's a punctuation mistake. :-/ I would want it flagged if I failed to enter a
> space after full stop (and failed to capitalize oranges). If it's code (C, C++,
> JS, etc.) then it ought not be spellchecked.
> 
> Is that the worst case?
> 
> /be

Punctuation mistakes should be found by an add-on what communicates with LanguageTool, see:
  https://addons.mozilla.org/en-Us/thunderbird/addon/grammar-checker/
and
  http://sourceforge.net/tracker/?func=detail&aid=2943212&group_id=281835&atid=1195603
(In reply to comment #116)
> (In reply to comment #77)
> 
> OpenOffice.org has its own tokenization using IBM ICU.

Does that mean that we have to do something similar to that, too?  I thought that hunspell is capable of handling the tokenization based on the affix file rules...
Re: "him/her" -- good one, I give.

Still sad that we have to ship something that certain language users find so badly broken, one more time.

/be
I've been talking about this with Brendan a bit more, and I think we may have a good middle-ground for Firefox 4.  I'm proposing that we can only exempt the hyphen (both its ASCII and Unicode variants) from the word breaking logic.  This should have minimal risk, as those two characters are the default value for the BREAK directive in hunspell.

Brendan is going to bring it up in the triage meeting today, and if we get a go, I'll write the patch.
(In reply to comment #124)
> I've been talking about this with Brendan a bit more, and I think we may have a
> good middle-ground for Firefox 4.  I'm proposing that we can only exempt the
> hyphen (both its ASCII and Unicode variants) from the word breaking logic. 
> This should have minimal risk, as those two characters are the default value
> for the BREAK directive in hunspell.

Hunspell handles only the hyphen (U+002D), and the terminating dots, but not n-dash (U+2013).
(In reply to comment #125)
> (In reply to comment #124)
> > I've been talking about this with Brendan a bit more, and I think we may have a
> > good middle-ground for Firefox 4.  I'm proposing that we can only exempt the
> > hyphen (both its ASCII and Unicode variants) from the word breaking logic. 
> > This should have minimal risk, as those two characters are the default value
> > for the BREAK directive in hunspell.
> 
> Hunspell handles only the hyphen (U+002D), and the terminating dots, but not
> n-dash (U+2013).

So I guess we should only consider ASCII hyphens in that case.
This is a low risk version of the patch for 2.0 based on comments 124-126.
Attachment #510837 - Flags: review?
Attachment #510837 - Flags: approval2.0?
Attachment #510837 - Flags: review? → review?(Olli.Pettay)
(In reply to comment #126)
> (In reply to comment #125)
> > (In reply to comment #124)
> > > I've been talking about this with Brendan a bit more, and I think we may have a
> > > good middle-ground for Firefox 4.  I'm proposing that we can only exempt the
> > > hyphen (both its ASCII and Unicode variants) from the word breaking logic. 
> > > This should have minimal risk, as those two characters are the default value
> > > for the BREAK directive in hunspell.
> > 
> > Hunspell handles only the hyphen (U+002D), and the terminating dots, but not
> > n-dash (U+2013).
> 
> So I guess we should only consider ASCII hyphens in that case.

You mean: you should only consider the hyphen (which happens to be ASCII).  N-dash is NOT a hyphen.
Attachment #510837 - Flags: review?(Olli.Pettay)
Attachment #510837 - Flags: review+
Attachment #510837 - Flags: feedback?(l10n)
Comment on attachment 510837 [details] [diff] [review]
Low-risk patch for 2.0

Looks good
Attachment #510837 - Flags: feedback?(l10n) → feedback+
Attachment #508164 - Flags: approval2.0- → approval2.0?
Comment on attachment 510837 [details] [diff] [review]
Low-risk patch for 2.0

a=beltzner
Attachment #510837 - Flags: approval2.0? → approval2.0+
Attachment #508164 - Flags: approval2.0? → approval2.0+
http://hg.mozilla.org/mozilla-central/rev/1db71fcbc712
http://hg.mozilla.org/mozilla-central/rev/35437a194bbc
http://hg.mozilla.org/mozilla-central/rev/f323dcb530a2
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Whiteboard: [post-2.0]
Target Milestone: --- → mozilla2.0b12
Thank you all for all the work!

Will happily test it tomorrow when the nightly updates. Thank you very much.
(In reply to comment #132)
> Thank you all for all the work!
> 
> Will happily test it tomorrow when the nightly updates. Thank you very much.

Please do.  If you find any problems, please file a new bug in the Core::Spell Checking component.
This seems cause a single hyphen standing alone '-' to be marked as a spelling mistake. That doesn't seem like a desired result.
I've filed bug 633210 for the new issue.
Depends on: 633210
Verified on this textarea with: "fazê-la fá-la" and the pt-PT dictionary with no changes.

Minefield 4.0b12pre (2011-02-10)

Thank you very much!
Depends on: 634153
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [post-2.0]
This was backed out in bug 634153 because of regressions.  I'll look at it again after Firefox 4.
Leaving a comment for later then:

The code should check for "-", "foo-" and "-foo". These strings should not be passed on to the spellchecker.

Best regards,
João Miguel Neves
Comment 138 is not correct for Dutch. A correct Dutch phrase is:
   mond- en klauwzeer
which is a standard shorter notation for
   mondzeer en klauwzeer.
The end-hyphen is possible in all places where a word could be compounded:
   spannings- and vervelingstest

A lot of effort has been put in having Hunspell understand this behaviour and to support it.

It now even understands spannings is wrong, and suggests both spanning and spannings-
Attachment #508164 - Flags: approval2.0+ → approval2.0-
Comment on attachment 510837 [details] [diff] [review]
Low-risk patch for 2.0

(bookkeeping)
Attachment #510837 - Flags: approval2.0+ → approval2.0-
Ehsan, are you going to give this another shot now that bug 620626 has landed?
(In reply to comment #141)
> Ehsan, are you going to give this another shot now that bug 620626 has landed?

Yes, hopefully in the Firefox 6 timeframe (tentatively).
Thanks to Ehsan for trying last time -- it was not the right time (sorry about that on my side as release-drivers intercessor), but we also learned some things, which does help for next time.

This is not a justification, more of a lemons -> lemonade point, but with more positive connotations, IMHO.

/be
This patch should be applied on top of the previously attached patches in this bug (I'm attaching it separately to ease the review).

This fixes the regression in bug 634153 which caused us to back out the patch for this bug in 2.0.

The cause of that regression was that the code for detecting conditional punctuations was not updated to take account this hunk (added as part of this bug).

+  // The dot character, if appearing at the end of a word, should
+  // be considered part of that word.  Example: "etc.", or
+  // abbreviations
+  if (aIndex > 0 &&
+      mDOMWordText[aIndex] == '.' &&
+      mDOMWordText[aIndex - 1] != '.' &&
+      ClassifyCharacter(aIndex - 1, false) != CHAR_CLASS_WORD) {
+    return CHAR_CLASS_WORD;
+  }


This patch simply checks the preceding and following charatcers to explicitly make sure that they're not dot's which are masked as word-chars.  If we find such a character, we just recognize the conditional punctuation character as a word separator (because the only case such characters can be considered as part of the word is if they appear in the middle of a word (such as "he's").
Attachment #526149 - Flags: review?(Olli.Pettay)
I prefer to land this on trunk as soon as possible so that we have enough time to deal with possible regressions before we branch to aurora.  If this gets onto aurora and we determine that we can't ship with it, I will deal with it by backing the patch out of aurora (the way I did for 2.0).
Comment on attachment 526149 [details] [diff] [review]
Fix to bug 634153

Can we get a testcase for this, please.
Attachment #526149 - Flags: review?(Olli.Pettay) → review+
(In reply to comment #146)
> Comment on attachment 526149 [details] [diff] [review]
> Fix to bug 634153
> 
> Can we get a testcase for this, please.

I already landed a test case for this as part of bug 634153: <http://mxr.mozilla.org/mozilla-central/find?text=&kind=text&string=dotafterquote>
http://hg.mozilla.org/mozilla-central/rev/605f6d7309fa
http://hg.mozilla.org/mozilla-central/rev/97acf2219d5a
http://hg.mozilla.org/mozilla-central/rev/a2814ef64e6d
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Whiteboard: [post-2.0]
Target Milestone: mozilla2.0b12 → mozilla6
I had mistakenly marked a test failing on Android here, here's the fix: http://hg.mozilla.org/mozilla-central/rev/281b86cd0403
this is still not fixed. Compare OOo with the German dictionary which correctly accepts all of the words in "Arbeits- und Familienrecht" while Firefox still complains about "Arbeits-" as it just checked "Arbeits" without the dash.

As mentioned above, please make the tokenizer behave like the one in OpenOffice.

Can someone with apropriate rights please reopen this bug?
In addition to Björns remark: it is not fixed when the hyphen is at the end (maybe start too) of the word. It is fixed for in-word positions (as far as I can tell).
Probably better to take the discussion of the last two comments over to bug 632977 and leave this one alone.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: