Closed Bug 1137544 Opened 9 years ago Closed 9 years ago

Update en-US Hunspell dictionary to 20150215 version

Categories

(Core :: Spelling checker, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
firefox40 --- fixed

People

(Reporter: ananuti, Unassigned)

References

Details

Attachments

(4 files, 4 obsolete files)

Attached file hunspell.tar.xz (obsolete) —
+++ This bug was initially created as a clone of Bug #1133363 +++


(In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from bug 1133363 comment #23)
> 
> Can you please attach the chromium_en_US.dic_delta-affix-converted file that
> is left in the current working directory when you hit that error?  (BTW that
> error happens because "warn" is not a shell construct (AFAIK)
> <http://mxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/
> en-US/hunspell/dictionary-sources/merge-dictionaries#51>)  Feel free to move
> this part to a new bug if needed.

There is no chromium_en_US.dic_delta-affix-converted file.

I attached hunspell folder with new hunspell-en_US-20150215.dic and new upstream-hunspell.diff generated on top of the patch in bug 1133363.
Attached file merge-dictionaries-error.txt (obsolete) —
The attached shows error after removing ATP/7 from chromium_en_US.dic_delta and run merge-dictionaries.
Attached file hunspell-en_US-mozilla-2015.02.15.zip (obsolete) —
Attached in an updated Mozilla dictionary based on version 20140215 of the upstream dictionary.  It was not created using the diff.  It was created by expanding the Mozilla dictionary to form a lists of words and then using that to determine what words to add or remove from the upstream version.  This list contains all the words that were added to the Mozilla dictionary.  The dictionary was then created using the same scripts that creates the upstream dictionary.
Attached file munch-diff.pl (obsolete) —
Hi Ehsan,

This is in regard to:
https://bugzilla.mozilla.org/show_bug.cgi?id=1133363#c30

Sorry for the delay.  I created and attached a script that does a more intelligent job of doing a diff on a munched wordlist (wordlist with affix flags); however, it is still far from problem free.  Comparing munched dictionary files is problematic because I work with the lists as plain wordlists and then munch them latter using a utility so flags can change seemingly randomly depending on what words are in the list.  A far better approach is to comparing wordlists like I did to create the attached dictionary.

I will be happy to create a script for you that allows you to update to new upstream version problem free, but it will require aspell be installed to create the dictionary and may have some other dependencies.  Is something you think will be acceptable?  This will only be required for updating to new upstream versions, not editing the dictionary.

As far as your existing scripts go, I suggest you just remove them and edit the dictionary directly.  You should still keep " mozilla-specific.txt" and copy of the official upstream dictionary the Mozilla dictionary is based on for reference.  I don't see the value in automatically maintaining a diff as that can easily be recreated provided you have the base upstream version available.

Let me know if you have any questions.

At very least I suggest you just use the attached dictionary I have carefully created for you.

Thanks,
Kevin
Flags: needinfo?(ehsan)
For reference here is a link to the upstream version the attached dictionary is created from:
  http://downloads.sourceforge.net/wordlist/hunspell-en_US-2015.02.15.zip
(In reply to Kevin Atkinson from comment #4)
> Hi Ehsan,
> 
> This is in regard to:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1133363#c30
> 
> Sorry for the delay.  I created and attached a script that does a more
> intelligent job of doing a diff on a munched wordlist (wordlist with affix
> flags); however, it is still far from problem free.  Comparing munched
> dictionary files is problematic because I work with the lists as plain
> wordlists and then munch them latter using a utility so flags can change
> seemingly randomly depending on what words are in the list.  A far better
> approach is to comparing wordlists like I did to create the attached
> dictionary.

That's fine.  In the past we've edited the flags directly, but we can change that if needed.  I agree that working on a simple wordlist is easier and less error prone.

> I will be happy to create a script for you that allows you to update to new
> upstream version problem free, but it will require aspell be installed to
> create the dictionary and may have some other dependencies.  Is something
> you think will be acceptable?  This will only be required for updating to
> new upstream versions, not editing the dictionary.

Yes, absolutely.  The number of people who run such scripts is really small, so we can require whatever dependencies that we need.

> As far as your existing scripts go, I suggest you just remove them and edit
> the dictionary directly.  You should still keep " mozilla-specific.txt" and
> copy of the official upstream dictionary the Mozilla dictionary is based on
> for reference.  I don't see the value in automatically maintaining a diff as
> that can easily be recreated provided you have the base upstream version
> available.

Sure, that sounds good.

Can you please submit a patch that contains the necessary changes based on what we currently have in the tree?  I noticed that the raw word list is missing, and it's not clear to me how the script in comment 3 needs to be used (I don't know Perl!).  Also, we need a script to generate the .dic file from the wordlist.  Bonus points for a README file which describes which file is what in the new world.  Feel free to remove the stuff that is no longer needed in your patch too!

Thanks a lot!
Flags: needinfo?(ehsan)
Attached patch bug-1137544.diffSplinter Review
Sorry for the long wait.

Attached is a set of patches that does several things:

1) Simplifies the processes of adding new words by removing the unnecessary diff generation and merge-dictionary step.  The chrome diffs are also removed but the mozilla-specific.txt changes file is kept.

2) Adds a new set of scripts to easily update to new upstream versions.

3) Correctly upgrades you to the version 2015.02.15.

To edit the dictionary no additional dependencies are required.  

To update to a new upstream version Perl and Aspell are required.  There are no Perl scripts in this patchset but it uses some Perl scripts from SCOWL.  The scripts in this patchset are simple shell scripts that should be easy to any future maintainers to understand.

A new version of SCOWL (the source of the Hunspell dictionaries) was just released  just released.  May I suggest that before this patchset is accepted someone attempts to follow my instructions in the README to update en-US.dic to this version (2015.04.24)?

I tried to follow the procedure for submitting patched that I could find in the devel. documentation to the best of my availability.  Please let me know if there are any problems with my changeset.

Thanks,
Kevin
Attachment #8580518 - Attachment is obsolete: true
Attachment #8580519 - Attachment is obsolete: true
Flags: needinfo?(ehsan)
Note: In case it is not obvious the attached patchset is a set of 5 changesets that I exported using:

  hg export -r "outgoing()"

Kevin
Thanks a lot, Kevin!  I will try to take a look by early next week at the latest.  :-)
Thanks again Kevin for the awesome work!  I think the patches are mostly good, except for the .hgignore change, since that would allow files to left in people's trees without them knowing.  It should be easy to run hg purge or something similar to remove the unwanted files after a dictionary update.

(Also, a Mozilla workflow nit for the future, we usually attach one patch to Bugzilla per changeset, so you'd usually attach 5 patches to this bug for example rather than one concatenated patch.)


I did try upgrading to the latest SCOWL dictionaries, but ran into a number of issues:

* speller/make-hunspell-dict uses a comm syntax that BSD systems don't understand (comm file1 file2 -12 instead of comm -12 file1 file2).  That was trivial to work around.
* After running make-new-dict, all of the 5-mozilla* files got truncated.  I'm not sure why.
* After running install-new-dict, it seems like the en-US.dic file doesn't have the line count header:

diff --git a/extensions/spellcheck/locales/en-US/hunspell/en-US.dic b/extensions/spellcheck/locales/en-US/hunspell/en-US.dic
index bc34947..1a6b80d 100644
--- a/extensions/spellcheck/locales/en-US/hunspell/en-US.dic
+++ b/extensions/spellcheck/locales/en-US/hunspell/en-US.dic
@@ -1,4 +1,4 @@
-54728
+
 0/nm
 0th/pt
 1/n1
...

Are these issues with your scripts here, or with SCOWL's?
Flags: needinfo?(ehsan)
Hi Ehsan,

I have not tested these on a BSD system so it could very well be some portability issues.

Let me see if I can run this on a BSD system and get back to you.  What system are you using?

Kevin
Flags: needinfo?(ehsan)
Flags: needinfo?(ehsan)
Hi Ehsan,

There were some additional portability problems with make-hunspell-dict in SCOWL on FreeBSD.  Please apply the attached patch to that file and let me know if things work.  This change will be in the next SCOWL release.

I also attached an additional changeset (6-bug-1137544.diff) that avoided the dependency on unix2dos when calling make-hunspell-dict in make-new-dict and should be applied in addition to "bug-1137544.diff".

As far as the .hgignore changeset fell free to skip it.  In my projects I would include generated files in ".gitignore" but mozilla's policy may be different so I don't fell strongly either way.

Thanks,
Kevin

P.S: If you need me to split bug-1137544.diff into separate files I can do that manually for you.  I can't seam to find an "hg" command to automatically make one file per changeset; "hg export" wants wants to put them all in one file.
Flags: needinfo?(ehsan)
(In reply to Kevin Atkinson from comment #14)
> I can't seam to find an "hg" command to
> automatically make one file per changeset; "hg export" wants wants to put
> them all in one file.

you can use

hg exp -r changeset-a -o part1.patch
hg exp -r changeset-b -o part2.patch

...
(In reply to Kevin Atkinson from comment #14)
> Hi Ehsan,
> 
> There were some additional portability problems with make-hunspell-dict in
> SCOWL on FreeBSD.  Please apply the attached patch to that file and let me
> know if things work.  This change will be in the next SCOWL release.

Oops, sorry for misleading you.  I was on OSX, which is BSD based.  With attachment 8599037 [details] [diff] [review], now I get errors like this: https://gist.github.com/ehsan/358bef4a19b4ee639361

> I also attached an additional changeset (6-bug-1137544.diff) that avoided
> the dependency on unix2dos when calling make-hunspell-dict in make-new-dict
> and should be applied in addition to "bug-1137544.diff".

I think it's OK to depend on unix2dos.  :-)


When I tried the instructions in the readme file on Linux, it worked perfectly out of the box.  If you have access to an OSX system and want to investigate the above issues there, that's great.  Otherwise it would be reasonable to just edit the readme file to say that this only works well on Linux.  Since I don't expect a lot of people to want to update the dictionaries, I'm fine with it only working on Linux if needed.

Please let me know which option you prefer!  Thanks again.
Flags: needinfo?(ehsan)
(In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from comment #16)

> Oops, sorry for misleading you.  I was on OSX, which is BSD based.  With
> attachment 8599037 [details] [diff] [review], now I get errors like this:
> https://gist.github.com/ehsan/358bef4a19b4ee639361

You didn't really mislead me.  FreeBSD is the only BSD system I have easy access to.  I don't have access to a OSX system.  As far as the problem you are experiencing it looks like aspell is not in your path.

> > I also attached an additional changeset (6-bug-1137544.diff) that avoided
> > the dependency on unix2dos when calling make-hunspell-dict in make-new-dict
> > and should be applied in addition to "bug-1137544.diff".
> 
> I think it's OK to depend on unix2dos.  :-)

Well it is up to you if you use that changeset. :-)

> When I tried the instructions in the readme file on Linux, it worked
> perfectly out of the box.  If you have access to an OSX system and want to
> investigate the above issues there, that's great.  Otherwise it would be
> reasonable to just edit the readme file to say that this only works well on
> Linux.  Since I don't expect a lot of people to want to update the
> dictionaries, I'm fine with it only working on Linux if needed.
> 
> Please let me know which option you prefer!  Thanks again.

Okay see above if you can fix the problem on OSX.  Otherwise a note that it is best to use it on Linux is fine with me.

After that let me know how you want to proceed to get these changes into Firefox.

Thanks,
Kevin
Flags: needinfo?(ehsan)
(In reply to Kevin Atkinson from comment #17)
> (In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from
> comment #16)
> 
> > Oops, sorry for misleading you.  I was on OSX, which is BSD based.  With
> > attachment 8599037 [details] [diff] [review], now I get errors like this:
> > https://gist.github.com/ehsan/358bef4a19b4ee639361
> 
> You didn't really mislead me.  FreeBSD is the only BSD system I have easy
> access to.  I don't have access to a OSX system.  As far as the problem you
> are experiencing it looks like aspell is not in your path.

Oh, I feel dumb.  :-)  You're right.  That was the issue.  Fixing that enabled me to repeat the successful result on OSX as well.

> > > I also attached an additional changeset (6-bug-1137544.diff) that avoided
> > > the dependency on unix2dos when calling make-hunspell-dict in make-new-dict
> > > and should be applied in addition to "bug-1137544.diff".
> > 
> > I think it's OK to depend on unix2dos.  :-)
> 
> Well it is up to you if you use that changeset. :-)

I think I won't use this one!

> > When I tried the instructions in the readme file on Linux, it worked
> > perfectly out of the box.  If you have access to an OSX system and want to
> > investigate the above issues there, that's great.  Otherwise it would be
> > reasonable to just edit the readme file to say that this only works well on
> > Linux.  Since I don't expect a lot of people to want to update the
> > dictionaries, I'm fine with it only working on Linux if needed.
> > 
> > Please let me know which option you prefer!  Thanks again.
> 
> Okay see above if you can fix the problem on OSX.  Otherwise a note that it
> is best to use it on Linux is fine with me.
> 
> After that let me know how you want to proceed to get these changes into
> Firefox.

Now we have everything that we needed here.  I'll check in your patches plus a successful update of the dictionary based on SCOWL 2015.04.24 soon.

Thanks so much for your help here, I'm so excited to be able to pick up changes from SCOWL from now on.  :-)
Flags: needinfo?(ehsan)
BTW, please feel free to take a look at the words in 5-mozilla-added after your patches and see if it makes sense to integrate them into SCOWL.  It would be nice for these Mozilla additions to finally find their way to upstream.  :-)

(And let me know if you'd like me to submit our future additions to SCOWL occasionally, we do get bug reports for missing words from time to time and keep making gradual updates to our dictionary.)
(In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from comment #19)
> BTW, please feel free to take a look at the words in 5-mozilla-added after
> your patches and see if it makes sense to integrate them into SCOWL.  It
> would be nice for these Mozilla additions to finally find their way to
> upstream.  :-)

I think someone should go though the list first to catch problems in your dictionary.  I saw a lot of stuff that I would never accept upstream for example possessive forms of non-nouns.

> (And let me know if you'd like me to submit our future additions to SCOWL
> occasionally, we do get bug reports for missing words from time to time and
> keep making gradual updates to our dictionary.)

That will be helpful.

Please first look up the word at http://app.aspell.net/lookup to help determine why it was not added or if it was recently added.
(In reply to Kevin Atkinson from comment #20)
> (In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from
> comment #19)
> > BTW, please feel free to take a look at the words in 5-mozilla-added after
> > your patches and see if it makes sense to integrate them into SCOWL.  It
> > would be nice for these Mozilla additions to finally find their way to
> > upstream.  :-)
> 
> I think someone should go though the list first to catch problems in your
> dictionary.  I saw a lot of stuff that I would never accept upstream for
> example possessive forms of non-nouns.

Sure.  I'll find a native speaker who can help me with this (don't trust my own English quite that much!)

> > (And let me know if you'd like me to submit our future additions to SCOWL
> > occasionally, we do get bug reports for missing words from time to time and
> > keep making gradual updates to our dictionary.)
> 
> That will be helpful.
> 
> Please first look up the word at http://app.aspell.net/lookup to help
> determine why it was not added or if it was recently added.

Will do!
Attached in a fix for edit-dictionary on BSD like systems (including MAC OS-X).  Without this fix the resulting dictionary may not format correctly.

Also, please be sure to test the new edit-dictionary before committing to catch any additional problems I may of overlooked. :)
Flags: needinfo?(ehsan)
With this final patch I don't think there is anything more for me to do.

If my scripts brake fell free to contact me and I will try to help if I can find the time.  You can reach me directly at kevina@gnu.org if I don't seam to be responding to any Mozilla bug request.

If a change in SCOWL brakes the upgrade process submit a bug directly using https://github.com/kevina/wordlist/issues, this is also where you should submit any new words you think should be added.

Kevin
Thanks again, Kevin!
Flags: needinfo?(ehsan)
Attachment #8570258 - Attachment is obsolete: true
Attachment #8570262 - Attachment is obsolete: true
ni?ehsan per comment 29.
Flags: needinfo?(ehsan)
My script does replace the affix file.  If you submit patches I will happy to look into incorporating any changes to the affix file upstream.
Submitted a PR upstream: https://github.com/kevina/wordlist/pull/109
Flags: needinfo?(ehsan)
Depends on: 1162823
Depends on: 1168802
Something has gone wrong here:
https://hg.mozilla.org/mozilla-central/rev/bcb133a3cdca

BTW, my FF hangs trying to open https://hg.mozilla.org/mozilla-central/rev/bcb133a3cdca, I have to use Chrome to open it (see bug 1235321).

In this changeset a lot of correct and useful words got lost, like:
relict residuary enforceability.

So have these words really disappeared from upstream? Hard to believe, although it appears that way:
https://hg.mozilla.org/mozilla-central/diff/bcb133a3cdca/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/orig/en_US.dic

Or was there some error in the processing? The .dic file shrank from 610 KB to 586 KB.

And while we're here, can you explain why "advisor" has disappeared from the dictionary (and we have a bug to add it in, bug 1183512), when the add-on (https://addons.mozilla.org/en-US/firefox/addon/united-states-english-spellche/, last updated in March 2013) has:
adviser/M
advisor/S
advisor's
advisory/S

Equally in bug 1198052 someone asked for "infeasible" which was in the add-on once:
feasible/UI
feasibly/U

In fact, the change in this bug did:
-feasible/IU
-feasibly/U
+feasible/U
+feasibly

Seems crazy that Mozilla people add words back in that got removed upstream.
Flags: needinfo?(kevin.bugzilla)
I looked up these words in the upstream tool to see why a word is not in the dictionary: http://app.aspell.net/lookup?dict=en_US&words=relict%0D%0Aresiduary%0D%0Aenforceability%0D%0Aadvisor%0D%0Ainfeasible

"advisor" is a spelling variant of adviser and per my policy I generally only include one variant of a spelling to promote consistent spelling in the official dictionary.  If Mozilla wants to include common variants this can likely be fixed with a little effort in the build scripts.

The other words are included in the larger dictionary size.  I would not recommend this size as it is not a carefully checked for errors as the normal size and it also includes less common words that could be confused with a similarly spelled word.

The word "infeasible" should likely be added I am not sure why it was not already included.

You are free to file an issue upstream at https://github.com/kevina/wordlist/issues.  Please note that I am very busy with other projects so it may be a while before I address the issue.

Thanks,
Kevin
Flags: needinfo?(kevin.bugzilla)
You need to log in before you can comment on or make changes to this bug.