Open Bug 632977 Opened 14 years ago Updated 4 months ago

Investigate a "full" fix for bug 355178

Categories

(Core :: Spelling checker, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: RyanVM, Unassigned)

References

Details

Attachments

(1 obsolete file)

The original patch posted by Ehsan in bug 355178 added numerous BREAK symbols.
https://bugzilla.mozilla.org/attachment.cgi?id=507324

The patch was nixed due to l10n risk near the release of Fx4 and replaced with a lower-risk patch that only breaks on hyphens. Once Fx4 ships, we should revisit this decision and evaluate whether additional BREAK symbols should be added.
So, I don't really know how to move forward with this.  Should we add those BREAK symbols again?
I don't know about the challenges of programming this, but what it should do is roughly this:
- get the wordchars clause from hunspell affix file
- if not there, assume old default
- get break chars definition from hunspell (if not defined, assume old default)

break words at all characters that are defined as BREAK or at any non-word character. 
hand complete results to hunspell for processing.
So, when - is defined as a word char, and not as a break char,

from -auto / auto- / test-auto , hunspell gets : -auto / auto- / test-auto

When - is defined as a break char, hunspell gets : auto / auto / test / auto

Ruud
Hmm, does this mean that the tokenization should be handled by the browser, and not by hunspell?
Aas far as I know, Hunspell is only able to handle words, not entire texts... So I guess tokenisation is part of the app's job.
Unfortunately, since this leads to different interfaces. Maybe it is an option to use the OOo/LibreOffice code (very old code, I heard).

(Maybe, while at it....create space to add different languages plug-ins too, like grammar checkers. It is more common these days to edit text using the browser. )
Some knowledge could maybe be acquired from the Hunspell and Languagetool people, but I am not them.
(In reply to R Baars from comment #4)
> Aas far as I know, Hunspell is only able to handle words, not entire
> texts... So I guess tokenisation is part of the app's job.
> Unfortunately, since this leads to different interfaces. Maybe it is an
> option to use the OOo/LibreOffice code (very old code, I heard).

Do you know (or know anyone who knows someone who knows) where that code lives in the libreoffice repository?  I grabbed a copy of the code and looked around, but nothing immediately jumped at me (their code base is huge and I'm not familiar with it).

> (Maybe, while at it....create space to add different languages plug-ins too,
> like grammar checkers. It is more common these days to edit text using the
> browser. )

That would be a different feature...

> Some knowledge could maybe be acquired from the Hunspell and Languagetool
> people, but I am not them.

OK.  :-)  Nemeth is CCed on this bug, so I hope he can get back to us about this.
Nemeth or Caolan, do you have any input on this bug?
For breaking text into words in LibreOffice we basically use icu and its word boundary break iterator, i.e. http://userguide.icu-project.org/boundaryanalysis

http://opengrok.libreoffice.org/xref/core/i18npool/source/breakiterator/breakiteratorImpl.cxx#108 is the entry point into our wrapper

we feed hunspell with those words
Blocks: 257073
Depends on: 724533
it would really be nice if the tokenization would be fixed one day. Caolan gave the information how it is done in LibreOffice - where it works very well.
We really should spend some time on this bug, because it breaks spell checking in some cases.
Priority: -- → P2
Again, I think it is not BREAK characters that should be added, but NON-break-characters, being the characters the language allows in a word.
A regular expression might be an option to express these characters.

Using the WORDCHARS from Hunspell is not safe, since there is logic built into Hunspell itself that could be in the way.
What about coding Unicode UAX #29 as default? May be some tailoring will be needed (adding hyphen as Midletter char), but IMHO is a good approach.

http://www.unicode.org/reports/tr29/#Word_Boundaries
(In reply to comment #11)
> What about coding Unicode UAX #29 as default? May be some tailoring will be
> needed (adding hyphen as Midletter char), but IMHO is a good approach.
> 
> http://www.unicode.org/reports/tr29/#Word_Boundaries

That makes sense to me.
So can we implement this? I still think that this is a very important bug.
(In reply to Caolan McNamara from comment #7)
> For breaking text into words in LibreOffice we basically use icu and its
> word boundary break iterator

I wonder if we could do this now that we have ICU in-tree.
Hi all, is there any news for this issue?
Moving to p3 because no activity for at least 1 year(s).
See https://github.com/mozilla/bug-handling/blob/master/policy/triage-bugzilla.md#how-do-you-triage for more information
Priority: P2 → P3
Severity: normal → S3
Attachment #9385771 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: