632977 - Investigate a "full" fix for bug 355178

Reporter

Description

•

14 years ago

The original patch posted by Ehsan in bug 355178 added numerous BREAK symbols.
https://bugzilla.mozilla.org/attachment.cgi?id=507324

The patch was nixed due to l10n risk near the release of Fx4 and replaced with a lower-risk patch that only breaks on hyphens. Once Fx4 ships, we should revisit this decision and evaluate whether additional BREAK symbols should be added.

(no longer active)

Comment 1

•

13 years ago

So, I don't really know how to move forward with this.  Should we add those BREAK symbols again?

R Baars

Comment 2

•

13 years ago

I don't know about the challenges of programming this, but what it should do is roughly this:
- get the wordchars clause from hunspell affix file
- if not there, assume old default
- get break chars definition from hunspell (if not defined, assume old default)

break words at all characters that are defined as BREAK or at any non-word character. 
hand complete results to hunspell for processing.
So, when - is defined as a word char, and not as a break char,

from -auto / auto- / test-auto , hunspell gets : -auto / auto- / test-auto

When - is defined as a break char, hunspell gets : auto / auto / test / auto

Ruud

(no longer active)

Comment 3

•

13 years ago

Hmm, does this mean that the tokenization should be handled by the browser, and not by hunspell?

R Baars

Comment 4

•

13 years ago

Aas far as I know, Hunspell is only able to handle words, not entire texts... So I guess tokenisation is part of the app's job.
Unfortunately, since this leads to different interfaces. Maybe it is an option to use the OOo/LibreOffice code (very old code, I heard).

(Maybe, while at it....create space to add different languages plug-ins too, like grammar checkers. It is more common these days to edit text using the browser. )
Some knowledge could maybe be acquired from the Hunspell and Languagetool people, but I am not them.

(no longer active)

Comment 5

•

13 years ago

(In reply to R Baars from comment #4)
> Aas far as I know, Hunspell is only able to handle words, not entire
> texts... So I guess tokenisation is part of the app's job.
> Unfortunately, since this leads to different interfaces. Maybe it is an
> option to use the OOo/LibreOffice code (very old code, I heard).

Do you know (or know anyone who knows someone who knows) where that code lives in the libreoffice repository?  I grabbed a copy of the code and looked around, but nothing immediately jumped at me (their code base is huge and I'm not familiar with it).

> (Maybe, while at it....create space to add different languages plug-ins too,
> like grammar checkers. It is more common these days to edit text using the
> browser. )

That would be a different feature...

> Some knowledge could maybe be acquired from the Hunspell and Languagetool
> people, but I am not them.

OK.  :-)  Nemeth is CCed on this bug, so I hope he can get back to us about this.

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 6

•

13 years ago

Nemeth or Caolan, do you have any input on this bug?

Caolan McNamara

Comment 7

•

13 years ago

For breaking text into words in LibreOffice we basically use icu and its word boundary break iterator, i.e. http://userguide.icu-project.org/boundaryanalysis

http://opengrok.libreoffice.org/xref/core/i18npool/source/breakiterator/breakiteratorImpl.cxx#108 is the entry point into our wrapper

we feed hunspell with those words

Toni Hermoso Pulido

Updated

•

13 years ago

Blocks: 257073

(no longer active)

Updated

•

12 years ago

Depends on: 724533

bjoern

Comment 8

•

11 years ago

it would really be nice if the tokenization would be fixed one day. Caolan gave the information how it is done in LibreOffice - where it works very well.

sjw

Comment 9

•

11 years ago

We really should spend some time on this bug, because it breaks spell checking in some cases.

Priority: -- → P2

R Baars

Comment 10

•

11 years ago

Again, I think it is not BREAK characters that should be added, but NON-break-characters, being the characters the language allows in a word.
A regular expression might be an option to express these characters.

Using the WORDCHARS from Hunspell is not safe, since there is logic built into Hunspell itself that could be in the way.

Joan Montané

Comment 11

•

11 years ago

What about coding Unicode UAX #29 as default? May be some tailoring will be needed (adding hyphen as Midletter char), but IMHO is a good approach.

http://www.unicode.org/reports/tr29/#Word_Boundaries

(no longer active)

Comment 12

•

11 years ago

(In reply to comment #11)
> What about coding Unicode UAX #29 as default? May be some tailoring will be
> needed (adding hyphen as Midletter char), but IMHO is a good approach.
> 
> http://www.unicode.org/reports/tr29/#Word_Boundaries

That makes sense to me.

sjw

Comment 13

•

11 years ago

So can we implement this? I still think that this is a very important bug.

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 14

•

9 years ago

(In reply to Caolan McNamara from comment #7)
> For breaking text into words in LibreOffice we basically use icu and its
> word boundary break iterator

I wonder if we could do this now that we have ICU in-tree.

Pander

Comment 15

•

9 years ago

Hi all, is there any news for this issue?

Sylvestre Ledru [:Sylvestre]

Comment 16

•

6 years ago

Moving to p3 because no activity for at least 1 year(s).
See https://github.com/mozilla/bug-handling/blob/master/policy/triage-bugzilla.md#how-do-you-triage for more information

Priority: P2 → P3

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

Comment hidden (spam)

BMO Automation

Updated

•

4 months ago

Attachment #9385771 - Attachment is obsolete: true