Closed Bug 1133363 Opened 9 years ago Closed 9 years ago

Please update en-US Hunspell dictionary to latest version

Categories

(Core :: Spelling checker, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla39
Tracking Status
firefox39 --- fixed

People

(Reporter: kevin.bugzilla, Assigned: ananuti)

References

Details

Attachments

(4 files)

      No description provided.
Please update the en-US Hunspell dictionary to latest upstream version that you can find at http://wordlist.aspell.net/dicts/.

The latest version includes a large number of new words, including many neologisms (newly invented words) such as "selfie" and "smartwatch", and some not so new words that should have been added a long time ago such as "inbox".

It would also be nice if you could review you list of additional words and submit any that are not already added in the latest version using the GitHub issue tracker at https://github.com/kevina/wordlist/issues.

Thanks,
Kevin Atkinson
Upstream Maintainer on en_US Hunspell dictionary.
Ekanan, is this something you'd be interested in working on by chance? :)
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(ananuti)
Yeah, I think I can pull it off.
Flags: needinfo?(ananuti)
Assignee: nobody → ananuti
Status: NEW → ASSIGNED
Ryan, 

Ehsan is afk. can you review this?
Flags: needinfo?(ryanvm)
Unfortunately, I'm not a peer for this code :(. He'll be back this week, though. Thanks for taking this!
Flags: needinfo?(ryanvm)
Attachment #8565340 - Flags: review?(ehsan)
Attachment #8565341 - Flags: review?(ehsan)
Comment on attachment 8565340 [details] [diff] [review]
Part 1 - add missing /M suffix to en-US dictionary

Review of attachment 8565340 [details] [diff] [review]:
-----------------------------------------------------------------

It's a bit sad to lose the current model of storing our local modifications on top of the upstream version.  I originally wanted to ask you to reconstruct a new upstream-hunspell.diff on top of the new baseline, but thinking more about this, that would require some work which you may or may not want to do.  (If you're up to do that, I would really like to review a separate patch, but that can happen in its own bug!)

But can you please add a new hunspell-en_US-20150215.dic file as a snapshot of the new upstream dictionary?
Attachment #8565340 - Flags: review?(ehsan) → review+
Attachment #8565341 - Flags: review?(ehsan) → review+
Ekanan, can you please generate a unified version of upstream-hunspell.diff with all of our modifications on top of the new upstream version (which includes the old Chromium modifications that we took years ago as well) and submit it to upstream as per comment 1?  I think Kevin would be interested in having a look at it, and see which ones he would like to take upstream.  That should hopefully help benefit upstream from our work in our fork.  :-)

Thanks!
Flags: needinfo?(ananuti)
Attached file upstream-hunspell.diff
diff -ru hunspell-en_US-2015.02.15/en-US.dic extensions/spellcheck/locales/en-US/hunspell/en-US.dic
Flags: needinfo?(ananuti)
(In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from comment #8)
> Comment on attachment 8565340 [details] [diff] [review]
> Part 1 - add missing /M suffix to en-US dictionary
> 
> Review of attachment 8565340 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> It's a bit sad to lose the current model of storing our local modifications
> on top of the upstream version.  I originally wanted to ask you to
> reconstruct a new upstream-hunspell.diff on top of the new baseline, but
> thinking more about this, that would require some work which you may or may
> not want to do.  (If you're up to do that, I would really like to review a
> separate patch, but that can happen in its own bug!)
> 
> But can you please add a new hunspell-en_US-20150215.dic file as a snapshot
> of the new upstream dictionary?

I tried adding a new hunspell-en_US-20150215.dic file into dictionary-sources and modified edit-dictionary and merge-dictionaries and generated upstream-hunspell.diff by |diff -r hunspell-en_US-20150215.dic ../en-US.dic > upstream-hunspell.diff|.

When I run sh merge-dictionaries, a lot of dupe words are found. Maybe I did something wrong. :(
Flags: needinfo?(ehsan)
Keywords: checkin-needed
sorry backed this out for test failures like https://treeherder.mozilla.org/logviewer.html#?job_id=6956225&repo=mozilla-inbound
Flags: needinfo?(ananuti)
Now, lolcat is not an invalid word.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d78c3ed22c93
Flags: needinfo?(ehsan)
Flags: needinfo?(ananuti)
Attachment #8569138 - Flags: review?(ehsan)
Comment on attachment 8569138 [details] [diff] [review]
Part 3 - Fix tests

Review of attachment 8569138 [details] [diff] [review]:
-----------------------------------------------------------------

Hahaha :)
Attachment #8569138 - Flags: review?(ehsan) → review+
Keywords: checkin-needed
(In reply to Ekanan Ketunuti from comment #11)
> (In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from
> comment #8)
> > Comment on attachment 8565340 [details] [diff] [review]
> > Part 1 - add missing /M suffix to en-US dictionary
> > 
> > Review of attachment 8565340 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > It's a bit sad to lose the current model of storing our local modifications
> > on top of the upstream version.  I originally wanted to ask you to
> > reconstruct a new upstream-hunspell.diff on top of the new baseline, but
> > thinking more about this, that would require some work which you may or may
> > not want to do.  (If you're up to do that, I would really like to review a
> > separate patch, but that can happen in its own bug!)
> > 
> > But can you please add a new hunspell-en_US-20150215.dic file as a snapshot
> > of the new upstream dictionary?
> 
> I tried adding a new hunspell-en_US-20150215.dic file into
> dictionary-sources and modified edit-dictionary and merge-dictionaries and
> generated upstream-hunspell.diff by |diff -r hunspell-en_US-20150215.dic
> ../en-US.dic > upstream-hunspell.diff|.
> 
> When I run sh merge-dictionaries, a lot of dupe words are found. Maybe I did
> something wrong. :(

Are they words that appear in <https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/chromium_en_US.dic_delta>?  If yes, you should remove them from the chromium dictionary.
Uh, I think it came from chromium. I will try again tomorrow.
Keywords: checkin-needed
I found an error when merge dictionary:

> sh merge-dictionaries Patching Chromium dictionary
> patching file chromium_en_US.dic_delta-patched
> Hunk #1 FAILED at 1.
> 1 out of 1 hunk FAILED -- saving rejects to file chromium_en_US.dic_delta-patched.rej
> Patching Hunspell dictionary
> patching file hunspell-en_US-20150215.dic-patched
> Updating Chromium affixes
> licence//101
> licences/14
> licencing/14
> merge-dictionaries: 51: merge-dictionaries: warn: not found

This is harder than I thought. I rather do not invest time to look into it, so I give up on that solution. :(
Keywords: checkin-needed
I am a bit confused what is going on.

Is this a still a work in progress or have the changes been committed?

I would also really hate to lose the "current model of storing our local modifications on top of the upstream version".  It would be much easier for me to review additions if I knew the reason they were added.
(In reply to Kevin Atkinson from comment #20)
> I am a bit confused what is going on.
> 
> Is this a still a work in progress or have the changes been committed?
> 
> I would also really hate to lose the "current model of storing our local
> modifications on top of the upstream version".  It would be much easier for
> me to review additions if I knew the reason they were added.

for the changes in comment #19 those have been commited
I know nothing about Mozilla development.  But if it will help I am willing to investigate some time to fix merge-dictionaries so that you won't have to lose the current model of storing the local modifications on top of the upstream version.  It should not be that difficult for me to get it working again.

Does this sound like something you want?  If so please give be about a week to get this done.

Thanks,
Kevin
(In reply to Ekanan Ketunuti from comment #18)
> I found an error when merge dictionary:
> 
> > sh merge-dictionaries Patching Chromium dictionary
> > patching file chromium_en_US.dic_delta-patched
> > Hunk #1 FAILED at 1.
> > 1 out of 1 hunk FAILED -- saving rejects to file chromium_en_US.dic_delta-patched.rej
> > Patching Hunspell dictionary
> > patching file hunspell-en_US-20150215.dic-patched
> > Updating Chromium affixes
> > licence//101
> > licences/14
> > licencing/14
> > merge-dictionaries: 51: merge-dictionaries: warn: not found
> 
> This is harder than I thought. I rather do not invest time to look into it,
> so I give up on that solution. :(

Can you please attach the chromium_en_US.dic_delta-affix-converted file that is left in the current working directory when you hit that error?  (BTW that error happens because "warn" is not a shell construct (AFAIK) <http://mxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/merge-dictionaries#51>)  Feel free to move this part to a new bug if needed.
Flags: needinfo?(ananuti)
Hi Kevin!

Firstly, the changes have been committed now as Carsten mentioned.

(In reply to Kevin Atkinson from comment #22)
> I know nothing about Mozilla development.  But if it will help I am willing
> to investigate some time to fix merge-dictionaries so that you won't have to
> lose the current model of storing the local modifications on top of the
> upstream version.  It should not be that difficult for me to get it working
> again.
> 
> Does this sound like something you want?  If so please give be about a week
> to get this done.

Yes, this is very helpful!  Right now this diff <https://hg.mozilla.org/integration/mozilla-inbound/raw-file/d594db3c891b/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/upstream-hunspell.diff> is based on the old hunspell-en_US-20081205.dic dictionary, and I would very much like to have a diff based on the latest en-US dictionary (and then work towards getting that diff to become empty eventually so that we just use the pristine upstream version!).

It seems Ekanan got stuck doing that as per the discussion above, so it would be great if you guys can coordinate so that both of you don't end up doing the same work.  :-)  The main issue is that we have traditionally used two diffs on top of upstream, one from the chromium project (chromium_en_US.dic_delta) and one ours (upstream-hunspell.diff).  As the chromium diff has not been updated in years, perhaps it's best to just merge those words into upstream-hunspell.diff, as there is no real value to maintain both separately.

Please let me know if you need help!
Hi Ehsan,

I will have a look.  As I see it, I would rather not merge chromium_en_US.dic_delta into yours because if I am going to consider words for addition it is very helpful to know where they originated from.

Using diff is not really the best way to manage wordlist changes.  I list of additions and substations (if any) from the upstream dictionary seams like a more reasonable think to do to avoid the problems of conflicts.  Writing a Perl script to manage these lists should not be too hard (I notice the shell script already uses Perl so I assume Perl is an acceptable requirement).  Would that be okay?

In any case I will first try to fix the script and the conflict before I try anything more involved.

Ekanan: Let me know if you are going to put any additional time into this.

Kevin
Blocks: 1137544
(In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from comment #23)
> Feel free to move
> this part to a new bug if needed.

Filed 1137544


(In reply to Kevin Atkinson from comment #25)
> 
> 
> Ekanan: Let me know if you are going to put any additional time into this.
> 
> Kevin

I don't really really know how to handle this Chromium shit. I'd rather spend energy on the other stuff.
Flags: needinfo?(ananuti)
Hi Ekanan,

I am very confused on what is going on here and why this bug report was closed and a new one created.

It looks like Ekanan mainly added words and didn't really update the source list.  This is unfortunate as I also removed a number of Junk entries.

I am working on a better solution for you but am now confused where to file it.

BTW: As far as the Chromium list go, I admit it can likely go and at this point is just adding complexity.

Kevin
(In reply to Kevin Atkinson from comment #28)
> Hi Ekanan,

Sorry Ekhanan, I addressed the wrong person I meant to address Ehsan.
Flags: needinfo?(ehsan)
(In reply to Kevin Atkinson from comment #25)
> Hi Ehsan,
> 
> I will have a look.  As I see it, I would rather not merge
> chromium_en_US.dic_delta into yours because if I am going to consider words
> for addition it is very helpful to know where they originated from.
> 
> Using diff is not really the best way to manage wordlist changes.  I list of
> additions and substations (if any) from the upstream dictionary seams like a
> more reasonable think to do to avoid the problems of conflicts.  Writing a
> Perl script to manage these lists should not be too hard (I notice the shell
> script already uses Perl so I assume Perl is an acceptable requirement). 
> Would that be okay?

Sure.  We have a slight preference to not use Perl if you can use Python, but if not Perl is fine too as these scripts will only be run by people who want to update the dictionary on their local machines.

(In reply to Kevin Atkinson from comment #28)
> Hi Ekanan,
> 
> I am very confused on what is going on here and why this bug report was
> closed and a new one created.

Just to follow the usual Mozilla process.  We usually use new bugs for new code changes, and close bugs when the corresponding patches for them land.  But I don't want to overburden you with all of the stuff!  Please continue the discussion either here or in bug 1137544 (the latter would be better for us).  I'm watching both places.  :-)

> It looks like Ekanan mainly added words and didn't really update the source
> list.  This is unfortunate as I also removed a number of Junk entries.

Yes, he intentionally left hunspell-en_US-20081205.dic untouched, in order to keep merge-dictionaries working for now.  You should create a new file probably called hunspell-en_US-20150215.dic with the new source list.

> I am working on a better solution for you but am now confused where to file
> it.

Preferably in bug 1137544, but don't worry too much about our process details.  :-)

> BTW: As far as the Chromium list go, I admit it can likely go and at this
> point is just adding complexity.

Sounds great!
Flags: needinfo?(ehsan)
For reference here is a link to the upstream version the attached dictionary is created from: http://downloads.sourceforge.net/wordlist/hunspell-en_US-2015.02.15.zip
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: