Closed Bug 240600 Opened 20 years ago Closed 19 years ago

merge myspell back from the openoffice original

Categories

(Core :: Spelling checker, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.8beta1

People

(Reporter: mvl, Assigned: mscott)

References

Details

(Keywords: fixed-aviary1.0)

Attachments

(1 file, 2 obsolete files)

Once upon a time the openoffice spellchecker was ported to mozilla. All cool.
But since then, development has been split. There are bugs in the mozilla
spellchecker that are not in the openoffice one. (bug 237114). Features are
missing and added back way later with lots of pain (bug 227214)

I think it would make sense to remerge if possible. If the openoffice
spellchecker is sufficient stand alone, it could be done with minimum pain. If
we don't change much (in the ideal case, nothing) merging later would be easy.

Problem (that I can think of now) that might arise are licensing and charset
conversion. I don't know under which conditions myspell was relicensed as
mpl/gpl/lgpl. The openoffice spellchecker seems to bu onder something bsd like
or so.
Charset conversion depends on how openoffice handles that. If the caller must do
the conversion, that's fine. We can write a wrapper. If myspell does it itself,
and call openoffice functions or so, we must work around that.

Ofcourse, we must contact the openoffice developers.
http://whiteboard.openoffice.org/source/browse/whiteboard/lingucomponent/source/spellcheck/myspell/CONTRIBUTORS?rev=1.3&content-type=text/vnd.viewcvs-markup
has the addresses of the developer and contributors.
this might be a dupe.

It was always our intention to maintain our fork and the open office fork but
our fork has stagnated. I don't think there are additional licensing issues with
trying to update our 2 year old fork with all the other changes that went into
the open office fork.

I've looked into it several times, just haven't had time.
Our port contains all kind of mozilla specific code, like the string classes.
This makes updating quite hard.
My idea is to put in the openoffice code, with as few changes as possible. No
string classes, no own hash tables, but just a straight copy. This makes
updating in the future easier.
So i don't want to update our code, but replace it with openoffice code and a
wrapper.
I got a straight copy of myspell from openoffice working. All i needed was a to
change mozMySpell into a wrapper calling myspell functions and doing charset
conversions. It wasn't as hard as i thought.
Now, I am still somewhat worried about the licensing. What is in the tree now
seems to be a complete rewrite by david einstein. I am putting in a straigt copy
will lots of updates. It is under a bsd-lookalike license. (I think, i don't
know bsd that well. It allows far more then gpl does, even more then mpl)
I was envisioning us doing a merge and keeping our infrastructure such as
strings, etc. i.e. take specific features straight from the myspell engine and
port them back to our fork.

What are the code footprint ramifications of just wrapping the engine in its
entiriety? Are we pulling in lots of plumbing / string classes & stuff? 

I'm really impressed by how quickly you were able to put together a wrapper! I
just need to better understand the costs.

As far as the licensing goes, last time we talked to them we were in the clear
(in fact the myspell guy worked hand in hand with david einstein when they were
getting the fork going). If we have to we can move some stuff into
other-licenses but I don't think we will. I will try to find out more.
It surprised me too :)
basicly, it is just two functions: check and suggest. The rest is syntactic
sugar. I didn't touch stuff like personal dictionary. Only
extension/spellcheck/myspell/src is changed.

I checked codesize (using size):

current myspell, fixed makefile, debug:
   text    data     bss     dec     hex
 100221    1232       8  101461   18c55
current, whouth unichar utils (dies on some tinderboxes)
  82701    1024       4   83729   14711
OO myspell, new makefile, debug:
  66917   12720       4   79641   13719

So, putting in the OpenOffice spellchecker saves code, but adds lots of data.
That is due to a unicode table it has to convert to lowercase. I think that with
some fudging we can make that use existing mozilla code, and not have a huge table.
wow! That's extremely promising news!

Can you run this test on the new spell check engine with the latest myspell stuff? :

"I noticed spellchecker does not offer suggestions on contractions. For example,
if you type shouldnt, it will not offer shouldn't as a suggestion. Same for
wouldn' couldn't, etc...."

That was one of the scenarios myspell had fixed since we forked. Kind of a
proving test case that we have some of the new features in the myspell engine...



The patch doesn't fix that bug. that bug is way easier to solve: update the .aff
file. I copies the TRY line from the en-GB affix file from the openoffice
installation i have, and with both the current and the new engine, it suggested
wouldn't. This is because the TRY line needs a ' to suggest it.

The patch solves something: bug 237114. And added code comments :)
your saying the .aff file we ship with by default (for US english) is just
missing a single TRY line in it for the contraction case to work?
No, it is missing a ' char in the TRY line. So it myspell will never try to
insert it as a forgotten char.
Attached patch patch v1 (obsolete) — Splinter Review
Attaching the patch i created. It doesn't include removing old files, to
somewhat reduce the size.
I didn't touch the licensing, so it isn't ready to go in as is, but it is just
for experiments.
Got some codesize numbers on an linux opt build:
current
   text    data     bss     dec     hex filename
  30966     888       4   31858    7c72
new
  32411   12560       4   44975    afaf

Now, the new version is somewhat larger. Mostly to blame are affixmgr.cpp and
mozMySpell.cpp. This isn't surprising, as those two files do most the work.
csutils.cpp is indeed what causes the data size. (checked with running size on
the .o files)
the change in code size seems pretty small / not important. It would be
interesting to try to look at elminating the large data size jump which I assume
is because of the unicode tables it has in the code? How hard is it to use our
own data for just that part?
The tables are used for upper <-> lowcase conversion.
I couldn't find a case conversion of random charsets in our code. It is possible
to convert to unicode, then do the conversion, and then convert back. That would
add some code, but is doable.
The other option is to convert the tables into code. Most have a very simple
rule, and could be created on-the-fly.
Blocks: 216267
I now made it so that the tables are created when needed using the
caseconversion already in mozilla. It is somewhat ugly, because the
caseconversion only works on unichar, and not on random charsets. I was wrnig
when i said it had unicode tables. It has case conversion tables for a few
charsets, like ISO-8859-*
Anyway, the date size went down (debug):
   text    data     bss     dec     hex filename
  69291    1840       4   71135   115df obj/dist/bin/components/libmyspell.so
3k of code for 10k of data.
Attached patch patch v2 (obsolete) — Splinter Review
It's been a while, but here is the updated patch. It creates the case
conversion tables when needed, so the code size reduces a lot.

Can someone with more licensing knowledge take a look at the licensing?
Attachment #146717 - Attachment is obsolete: true
Blocks: 218463
Ok, now that 1.0 is done, I've come back and revived this great patch that mvl
put together.

I've got it working again on the trunk. Now I just need to find some english
dictionary examples of things that work with the updated myspell engine from
open offfice that wouldn't work in our old forked version so I can test that
things are working.
Assignee: nobody → mscott
Blocks: 237114
There are bugs that this bugs block that should be fixed by this, but they all
involve non-english dictionaries.
Michiel, do you remember where you got the original myspell .cpp files that you
added to this patch? Somewhere on development.openoffice.org? I want to
integrate the latest files into the patch since this work last happened in May.
I'm assuming myspell has had further improvements. 
Status: NEW → ASSIGNED
Yes, that is were i got them from (using cvs). It doesn't look like there have
been changes over there...
Yeah it looks like things haven't changed in quite a while according to CVS.
They must be doing something different that I'm not seeing though. I have a
group of words that open office 1.1.3 offers suggestions for using myspell. But
we don't offer suggestions for them before and after this patch. I even tried
copying over the enlish .aff and .dic files from 1.1.3 to our apps and still
failed to get suggestions. I wonder what they are doing outside of the myspell
core....

The words I'm using include:

cemoflag 
edeosincrysy
egnomeneious
emeasureble
enouspoucious
In their cvs, csutil.cxx and .hxx disapeared, but they are still in the
makefile. I really doubt if that builds. They must have moved the spellchecked.
We could ask them :)
I also update the dictionary and .aff file for en-US to reflect the lastest
versions being shipped with open office.

I'm now getting the same set of suggestions for varous mispellings with Open
Office 1.1.3 and a build with this patch. Before this patch, we had 0
suggestions for these words:

absoolve
addvocat
afectoinate
aglomerition
allimentory
ambedixterety
culoomniate
cemoflag
dedly
decifrable
dilitrious
erthcake
eflorecscense
farenhaet
falascieous
gagetrea
gasttruintestenal
hafharted
egnomeneious
emeasureble
enouspoucious
juditioary
jureesproodentail
calieduscop
cangaru
kichenet
leccikografy
macromolicooler
magnenemous
medeocrety
memmorandem
narsicissim
neggociable
nutricious
obscekuieous
peliolithick
paralingistick
parlamentarean
quintesintial
reaprochment
reckonasance
repiticious
scritzofrenia
sintilate
scroopulocity

Thanks for getting this started mvl! This is a huge jump in effectiveness for
our spell checker.
Attachment #149682 - Attachment is obsolete: true
Just a random plea to update the other dictionaries as well
(http://www.mozilla.org/products/thunderbird/dictionaries.html)
fixed on the aviary branch and the trunk. Thanks a lot for getting this going mvl!
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Keywords: fixed-aviary1.0
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.8beta
This fix breaks the Solaris build, error messages are:

"affixmgr.hxx", line 54: Error: FILE is not defined.
"affixmgr.hxx", line 55: Error: FILE is not defined.
"affixmgr.hxx", line 56: Error: FILE is not defined.
"affentry.cpp", line 29: Error: The function "memcpy" must have a prototype.
"affentry.cpp", line 39: Error: The function "free" must have a prototype.
<snip>

a namespace std issue?
reopen to get people's attention.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
re-closing this bug is fixed.

I'm happy to check in any patches you need for this to build on ports but right
now ports shows all green. Please file a new bug....
Status: REOPENED → RESOLVED
Closed: 20 years ago19 years ago
Resolution: --- → FIXED
Depends on: 307052
This caused bug 307052 -- using NS_LossyConvertUCS2toASCII() to get a char* to pass to myspell cannot possibly work for most of the world's languages -- it will lose chars (hence Lossy).
You need to log in before you can comment on or make changes to this bug.