Closed Bug 360240 Opened 18 years ago Closed 3 years ago

Spell checking should allow for word groups, abbreviations, hyphenated words

Categories

(Core :: Spelling checker, defect)

defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: hendrik, Unassigned)

References

()

Details

Sometimes, words cannot occur by themselves but have to be followed/preceded by another one in order to be correct. Some spell checking word lists contain such word groups. However, FF still marks the offending word as incorrect. The same applies for common abbreviations of words with dashes. Examples: in Dutch, ‘laag-bij-de-gronds’ is correct, and is in the word list which is being assembled by the OpenTaal project (www.opentaal.org, the list can be found here: https://www.uitwisselplatform.nl/frs/download.php/178/nl_NL-pack_b3.xpi, there still seems to be a problem with the packaging). Now, when this language is chosen, ‘gronds’ is marked as wrong, which would be correct if it were a single word, but it should be marked correct in the above phrase. Similar applies for English spelling with abbreviations like ‘i.e.’, where the i is marked as wrong. Another one would be ‘a priori’, where ‘priori’ is not supposed to be considered correct if the ‘a’ is not present.
laag-bij-de-gronds is an existing word in the dictionary downloaded here: https://addons.mozilla.org/firefox/3291/
Assignee: nobody → mscott
Component: Form Manager → Spelling checker
Product: Firefox → Core
QA Contact: form.manager → spelling-checker
Version: 2.0 Branch → Trunk
(In reply to comment #1) > laag-bij-de-gronds is an existing word in the dictionary downloaded here: > https://addons.mozilla.org/firefox/3291/ Can be, but that doesn’t change anything. It accepts ‘laag-bij-de-gronds’ because it accepts every single word in it, even ‘gronds’, which is not correct. (It does occur in Dutch pages, but mostly in the combination above, where people often (incorrectly!) leave the dashes out.)
Hardware and OS should be "all"
OS: Linux → All
Hardware: PC → All
I don't know whether this should be classified as a bug or enhancement, but I agree that it would be nice for the spellchecker to support phrases. Multiple words are not treated as a single unit even when they are added to the dictionary as such.
The fact that words with dashes are not treated as single units is a bug for Irish. We have many words like "dea-scéal" for which the "dea" prefix is not a correct word on its own, but FF/TB flag it as an error. Even more bothersome (since they are much much more common) are words like "t-ainm", "n-ainm", where the single "t" or "n" is flagged as an error. If it's too complicated to solve the general problem (compare the similar OOo bug: http://www.openoffice.org/issues/show_bug.cgi?id=64400) I'd be satisfied if the spellchecker just ignored one-letter words, as is common practice for standalone checkers like ispell/aspell.
iMO, the real solution is to stop treating hyphenated words as separate words and treat them as a single word. Today, it is possible to add a hyphenated word to the dictionary, but the spell checker fails to ever match on the added definition, because it never checks hyphenated words as a single word. The name Wan-Teh frequently appears in my correspondence. The spell checker always triggers on Teh. I do not want to add Teh to the dictionary, because it is a common misspelling of The. But I do want the hyphenated word to be in the dictionary, and to stop triggering spelling errors.
Summary: Spell checking should allow for word groups, abbreviations, words with dashes... → Spell checking should allow for word groups, abbreviations, hyphenated words
This is the problem for Ukrainian too, the words with dash often are correct even if the component words inside it are not. Currently hunspell allows to treat Ukrainian words right with two lines in affix file: WORDCHARS - BREAK 1 BREAK - But Firefox does not pass compound words to spellchecker.
Assignee: mscott → nobody
Hunspell can check word groups, abbreviations and hyphenated words, so this is a tokenization and spell checker usage problem in Mozilla. Recently integrated Hunspell (version 1.2.8) has an improved BREAK method with better suggestions and with default tokenization of hyphenated words (Changelog: http://sourceforge.net/project/shownotes.php?group_id=143754&release_id=637489). Hyphen character will be a word character in OpenOffice.org (http://www.openoffice.org/issues/show_bug.cgi?id=64400) and it could be in Mozilla to solve this and other issues (See Bug 355178).
I can give some English examples that still get marked with a recent nightly. (Worth noting that I'm basing this on SeaMonkey trunk) Las Vegas -- "Las" is flagged vice versa -- "versa" is flagged Notre Dame -- "Notre" is flagged Abu Dhabi children's -- irregular plural possessive women's -- sometimes flagged, sometimes not Los Angeles de facto
Children's, women's are a different issue -- lack of 's possessive rule for those nouns -- and will be fixed by bug 479334. Hyphenation issue is Bug 355178. Multi-word group issue ("Las Vegas") is slightly different so I will leave this bug open, marking it dependent on Bug 355178.
Depends on: 355178

All the referenced bugs are fixed now.
All the examples from comment 9 work now except for "Abu Dhabi" which I assume is then just missing in the dictionary.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.