Open
Bug 632977
Opened 14 years ago
Updated 1 year ago
Investigate a "full" fix for bug 355178
Categories
(Core :: Spelling checker, enhancement, P3)
Core
Spelling checker
Tracking
()
NEW
People
(Reporter: RyanVM, Unassigned)
References
Details
Attachments
(1 obsolete file)
The original patch posted by Ehsan in bug 355178 added numerous BREAK symbols.
https://bugzilla.mozilla.org/attachment.cgi?id=507324
The patch was nixed due to l10n risk near the release of Fx4 and replaced with a lower-risk patch that only breaks on hyphens. Once Fx4 ships, we should revisit this decision and evaluate whether additional BREAK symbols should be added.
Comment 1•13 years ago
|
||
So, I don't really know how to move forward with this. Should we add those BREAK symbols again?
I don't know about the challenges of programming this, but what it should do is roughly this:
- get the wordchars clause from hunspell affix file
- if not there, assume old default
- get break chars definition from hunspell (if not defined, assume old default)
break words at all characters that are defined as BREAK or at any non-word character.
hand complete results to hunspell for processing.
So, when - is defined as a word char, and not as a break char,
from -auto / auto- / test-auto , hunspell gets : -auto / auto- / test-auto
When - is defined as a break char, hunspell gets : auto / auto / test / auto
Ruud
Comment 3•13 years ago
|
||
Hmm, does this mean that the tokenization should be handled by the browser, and not by hunspell?
Aas far as I know, Hunspell is only able to handle words, not entire texts... So I guess tokenisation is part of the app's job.
Unfortunately, since this leads to different interfaces. Maybe it is an option to use the OOo/LibreOffice code (very old code, I heard).
(Maybe, while at it....create space to add different languages plug-ins too, like grammar checkers. It is more common these days to edit text using the browser. )
Some knowledge could maybe be acquired from the Hunspell and Languagetool people, but I am not them.
Comment 5•13 years ago
|
||
(In reply to R Baars from comment #4)
> Aas far as I know, Hunspell is only able to handle words, not entire
> texts... So I guess tokenisation is part of the app's job.
> Unfortunately, since this leads to different interfaces. Maybe it is an
> option to use the OOo/LibreOffice code (very old code, I heard).
Do you know (or know anyone who knows someone who knows) where that code lives in the libreoffice repository? I grabbed a copy of the code and looked around, but nothing immediately jumped at me (their code base is huge and I'm not familiar with it).
> (Maybe, while at it....create space to add different languages plug-ins too,
> like grammar checkers. It is more common these days to edit text using the
> browser. )
That would be a different feature...
> Some knowledge could maybe be acquired from the Hunspell and Languagetool
> people, but I am not them.
OK. :-) Nemeth is CCed on this bug, so I hope he can get back to us about this.
Reporter | ||
Comment 6•13 years ago
|
||
Nemeth or Caolan, do you have any input on this bug?
Comment 7•13 years ago
|
||
For breaking text into words in LibreOffice we basically use icu and its word boundary break iterator, i.e. http://userguide.icu-project.org/boundaryanalysis
http://opengrok.libreoffice.org/xref/core/i18npool/source/breakiterator/breakiteratorImpl.cxx#108 is the entry point into our wrapper
we feed hunspell with those words
it would really be nice if the tokenization would be fixed one day. Caolan gave the information how it is done in LibreOffice - where it works very well.
We really should spend some time on this bug, because it breaks spell checking in some cases.
Priority: -- → P2
Comment 10•11 years ago
|
||
Again, I think it is not BREAK characters that should be added, but NON-break-characters, being the characters the language allows in a word.
A regular expression might be an option to express these characters.
Using the WORDCHARS from Hunspell is not safe, since there is logic built into Hunspell itself that could be in the way.
Comment 11•11 years ago
|
||
What about coding Unicode UAX #29 as default? May be some tailoring will be needed (adding hyphen as Midletter char), but IMHO is a good approach.
http://www.unicode.org/reports/tr29/#Word_Boundaries
Comment 12•11 years ago
|
||
(In reply to comment #11)
> What about coding Unicode UAX #29 as default? May be some tailoring will be
> needed (adding hyphen as Midletter char), but IMHO is a good approach.
>
> http://www.unicode.org/reports/tr29/#Word_Boundaries
That makes sense to me.
![]() |
||
Comment 13•11 years ago
|
||
So can we implement this? I still think that this is a very important bug.
Reporter | ||
Comment 14•10 years ago
|
||
(In reply to Caolan McNamara from comment #7)
> For breaking text into words in LibreOffice we basically use icu and its
> word boundary break iterator
I wonder if we could do this now that we have ICU in-tree.
Comment 15•9 years ago
|
||
Hi all, is there any news for this issue?
Comment 16•6 years ago
|
||
Moving to p3 because no activity for at least 1 year(s).
See https://github.com/mozilla/bug-handling/blob/master/policy/triage-bugzilla.md#how-do-you-triage for more information
Priority: P2 → P3
Updated•2 years ago
|
Severity: normal → S3
Updated•1 year ago
|
Attachment #9385771 -
Attachment is obsolete: true
You need to log in
before you can comment on or make changes to this bug.
Description
•