Closed Bug 479334 Opened 11 years ago Closed 10 years ago

Firefox and Thunderbird need a better English language dictionary

Categories

(Core :: Spelling checker, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla1.9.3a5
Tracking Status
blocking2.0 --- -

People

(Reporter: cbartley, Assigned: caywood)

References

Details

Attachments

(2 files, 12 obsolete files)

908.03 KB, patch
Gavin
: review+
Details | Diff | Splinter Review
1.93 KB, application/x-zip
Details
In the last few minutes Thunderbird's spellchecker has particularly annoyed me.  Words that it flagged that I would like it to accept include:

hmm, refactor, UI, prioritization

OK, "hmm" may be debatable, and "refactor" is fairly recent usage.  But "UI" has probably been in wide usage for at least two decades.  It's probably pretty mainstream these days, what with several hundred million English speakers online.

And "prioritization"? You've got to be kidding me.

Hmm, the Firefox 3.0.6 spellchecker is cool with "prioritization" but the Thunderbird 2.0.0.19 spellchecker is not.  So maybe Thunderbird *is* kidding me.

Nevertheless, I think there's some room for improvement here.

Sdwilsh pointed me to http://blog.chromium.org/2009/02/spell-check-dictionary-improvements.html, describing an effort by Google to improve the open source Hunspell dictionaries.  Perhaps we could leverage that effort.

If nothing else, it would be nice to have the option of a regular (US/UK) English and a Geek-enhanced English dictionary.
Flags: blocking1.9.1?
BTW: Thunderbird2 is using myspell (?) while we fixed for Gecko1.9 to huntspell.
That's the reason for the different results.

And wyh only english, they generated additional files for 19 other languages..
(In reply to comment #1)

> And wyh only english, they generated additional files for 19 other languages..

That's a good point.  My bias was that I knew for sure the English dictionary was deficient.  But if we've got better dictionaries for many other languages, then we definitely should want to use them.
Shawn, unless I'm missing something very important I don't see how this can be a blocker.
Flags: blocking1.9.1? → blocking1.9.1-
(In reply to comment #3)
> Shawn, unless I'm missing something very important I don't see how this can be
> a blocker.
I frequently see people complaining about Firefox not having some common word X in the dictionary, so I think this would be a big usability win, and the cost should be pretty low.
(In reply to comment #4)
> (In reply to comment #3)
> > Shawn, unless I'm missing something very important I don't see how this can be
> > a blocker.
> I frequently see people complaining about Firefox not having some common word X
> in the dictionary, so I think this would be a big usability win, and the cost
> should be pretty low.

I'm assuming that Shawn meant that we should pull in those new Hunspell dictionaries mentioned in the blog post (http://blog.chromium.org/2009/02/spell-check-dictionary-improvements.html) if at all possible.

I don't know if it is possible for 1.9.1, but if so, it might indeed give us an easy usability win.
Flags: blocking1.9.1- → blocking1.9.1?
I'm not saying we shouldn't do it. I'm saying I wouldn't hold the release at all for it.
Flags: blocking1.9.1? → blocking1.9.1-
Can someone (Ryan? Shawn? Benjamin?) tell me if it's still possible to get this in for Firefox 3.5?

Google's Chromium changes to the wordlist include some incredibly frequent words, some of which occur nearly 1 BILLION TIMES in Google's database (YouTube, Skype, Obama).

And they've even provided a delta from the standard hunspell dictionary. 
It seems to be autogenerated, so there are a few strange words in it that could be taken out ("deze").

http://src.chromium.org/svn/trunk/src/chrome/third_party/hunspell/dictionaries/en_US.dic_delta

Since Mozilla's 1.9.1 en-US.dic seems to be the newest version around, I think the best plan would be to add Google's list to mozilla_words.diff, then use that to patch en-US.dic. That would be more robust in case OpenOffice or Hunspell upgrade their dictionary.

I would also add the words from open dictionary suggestion bugs including:
434915, 494254, 355178, 448897, 436526, 488881, 495643, 460344, 460048
461142, 461143, 473675, 443029, 410970, 470750, 480295,

all of which are pretty simple wordlist changes.

If no one is currently driving this, and it will make it into the release, I'll make a patch in this bug over the next couple of days. Please let me know.
If you want to do that, go ahead. I'm not going to be available to do it for about a week.
I think it's too late for 3.5 final, but if it bakes on trunk, it can probably be nominated to for 1.9.1.x
Attached file My diff fixing Chromium's delta (obsolete) —
Chromium's autogenerated wordlist patch is a useful starting point, but it was contaminated by lots of the same things as Google: code, spammers, citation & social bookmarking services, other languages whose speakers use English (Dutch, Finnish, etc). 

Also, their original wordlist is a different version of the myspell dictionary, so there were some words that are already in Mozilla's en-US.dic.

I fixed as many of those problems as I could. Next patch will consolidate these and incorporate mozilla_words.diff and other bugs.
would there be a way to automate this so when they update, we can update quickly too?
I agree that option should be open. The idea was that the diff I just posted would be easy to apply to any new delta Chromium releases.

If we want to allow both en_US.dic and chromium-delta to be updated, we need four lists:

en_US.dic (mozilla's, from hunspell)
mozilla_words.diff (all other additions)

chromium-delta
chromium-delta.diff

then we patch chromium-delta with chromium-delta.diff, patch en_US.dic with mozilla_words.diff, and combine (and sort) the results.
includes the words from Ryan's latest patch in https://bugzilla.mozilla.org/show_bug.cgi?id=434915
This is just a checkpoint since I have to stop working on this for a bit -- will apply these patches and test locally as the next step.
Attached patch patch to en-US.dic (obsolete) — Splinter Review
OK, this patch contains the appropriate changes from Chromium plus other patches and bugs that were floating around.

It's been tested using a local installation of hunspell and my local firefox build. Correctness, incorrectness, suggestions, and nosuggest have all been tested.

I thought about adding parts of multiwords (vice versa, Abu Dhabi, Los Angeles etc.) but I skipped it with one or two exceptions (Rhode Island) because it seemed like it would just fill the dictionary with stuff which will have to be removed anyway. There's already another bug about upgrading hunspell and the tokenizer to deal with multiwords.

The partial diffs attached above have been obsoleted. I'll upload the final partial diffs once I get this patch accepted.
Attachment #381376 - Attachment is obsolete: true
Attachment #381379 - Attachment is obsolete: true
Attachment #381400 - Attachment is obsolete: true
Attachment #381402 - Attachment is obsolete: true
Attachment #381477 - Flags: review?(sdwilsh)
BTW there are a couple of duplicate words -- that's because Mozilla's using a newer base version of en-US.dic than Chromium. They don't do any harm but I can delete them if necessary.
Assignee: nobody → caywood
Status: NEW → ASSIGNED
Attachment #381477 - Flags: review?(sdwilsh)
Comment on attachment 381477 [details] [diff] [review]
patch to en-US.dic

(In reply to comment #18)
> BTW there are a couple of duplicate words -- that's because Mozilla's using a
> newer base version of en-US.dic than Chromium. They don't do any harm but I can
> delete them if necessary.
I would prefer it if you delete them.  Can you also make a proper hg patch for this please?
Blocks: 494254
Blocks: 434915
Blocks: 448897
Blocks: 379246
Blocks: 410970
Blocks: 479654
Blocks: 480295
Blocks: 383557
Blocks: 400849
Blocks: 402051
Blocks: 426485
Blocks: 452795
Blocks: 460344
Blocks: 461142
Blocks: 473675
Blocks: 474802
Blocks: 494888
Blocks: 495349
Blocks: 400750
Blocks: 436526
Blocks: 443029
Blocks: 460351
Blocks: 350007
Comment #16 on Bug 254814 From  Caolan McNamara

FWIW personally I see http://wordlist.sourceforge.net/ as the canonical
upstream for the en_US dictionaries.

----

This looks like a good upstream dictionary, although I'd be more convinced if the site maintainer could spell "offical" [sic]. The affix file hasn't changed, which is good. I'll compare the dictionaries to see what changes have been made.
See also http://sourceforge.net/tracker/?func=browse&group_id=10079&atid=1014602

for recent upstream bug fixes.
Blocks: 488881
Blocks: 495643
Blocks: 460048
Blocks: 237921
Attached patch Hg patch of dictionary directory (obsolete) — Splinter Review
OK, this patch is free of duplicates. Dictionary and affixes have been tested in local firefox and in command-line hunspell against a few hundred test words.

If you're going over the dictionary, I recommend diffing the first 1/3 of en-US.dic (uppercase first letter, mostly proper nouns) against the old Mozilla dictionary since that's where 99% of it came from, and it's far more inclusive for proper names than the Hunspell dictionary.

For the other 2/3 of en-US.dic (lowercase first letter), please look at the very end of upstream-hunspell.diff which is a diff against the newest Hunspell dictionary. This shows where I've changed and added words. 

I'll try to upstream those changes to once this patch gets accepted.
Attachment #381477 - Attachment is obsolete: true
Attachment #383642 - Flags: review?(sdwilsh)
Blocks: 412638
Comment on attachment 383642 [details] [diff] [review]
Hg patch of dictionary directory

Please specify the attachment number in README_mozilla.txt for the script(s) to use.  Giving instructions on how to use them would be useful here too.

r=sdwilsh with those changes.  I guess gavin should take a look at this as well once you upload a new patch.
Attachment #383642 - Flags: review?(sdwilsh) → review+
This was used to generate the new dictionary from upstream sources and patches.
Some test words, both valid and invalid, and a windows script (for inspiration only). 

To be used with a standalone build of Hunspell.
As requested, this includes a reference to the merge scripts, and directions for using them in README_mozilla.txt
Attachment #383642 - Attachment is obsolete: true
Attachment #384222 - Flags: review?(gavin.sharp)
Shawn, exactly what do you want Gavin to look at? 

It's been 3 weeks and he hasn't responded to my review request or an email. 

It might help if we were clearer about what we wanted from him.
Whiteboard: [need review gavin]
Version: unspecified → Trunk
I'm not a browser or toolkit peer, but gavin is, which is why he needs to review it.
I've spent a bit of time looking at the patch, and I'm finding it hard to make a judgement call as to whether this is the right approach, since the whole situation is rather complicated - multiple sources+patches, etc. Here's a brief summary based mostly on the contents of the README in the patch, just to make sure we're on the same page:

There are three main wordlists with your proposed patch applied: a) Hunspell, b) Google's, and c) Mozilla-specific wordlist.

1) Mozilla has historically made changes to a) (upstream-hunspell.diff)
2) You have made changes to b) to remove duplicates and "junk" from Google's wordlist (upstream-chromium.diff)

The README suggests that the final dictionary we use is the result of applying diffs 1) and 2) on a) and b) respectively, and then merging the result and weeding out duplicates. Assuming that's all correct, that leaves me with a few questions. I apologize if any of these have already been asked/answered.

- Where does the Mozilla-specific wordlist come into play in this process?
- What is the nature of the changes in upstream-hunspell.diff? Could they potentially be upstreamed into Hunspell?
  (How does the Mozilla-specific wordlist differ from upstream-hunspell.diff?)
- How did you decide which of the changes to make in upstream-chromium.diff?

Another potential issue is licensing. Have we made sure that all these files are appropriately licensed? I'm not a license expert by any means, but Gerv and Axel have some experience dealing with licenses of dictionaries specifically, so maybe they can help. If I'm reading http://src.chromium.org/svn/trunk/src/chrome/third_party/hunspell/dictionaries/README_en_US.txt says that some of Chromium files are LGPL licensed, and I think that's been a problem in the past?
Let me CC the players of bug 397150, when we fixed the licensing issues and the upstream dictionaries the last time.

Not fully reading this bug nor the other, but this sounds like a fork of an old version of the dictionary. Some of the changes suggested in the bugs here should obviously be fixed upstream, and then we'd take them downstream again in an orderly fashion.
Hi Gavin, 

Thanks for the review. Your understanding of the approach is basically right. 

- Where does the Mozilla-specific wordlist come into play in this process?
- (How does the Mozilla-specific wordlist differ from upstream-hunspell.diff?)

The Mozilla-specific wordlist (which I generated) is also merged in the last step, at the time the chromium and hunspell lists are merged.

This wordlist supersedes some of the old patch to upstream (which was called mozilla-words.diff, IIRC, but had a lot more words than just mozilla products in it). 

I think it makes sense to keep this wordlist separate because judging from previous bugs the Mozilla community seems to want to include certain words that the upstream dictionaries are unlikely to take. The upstream dictionaries already include words such as Firefox, but I'm weeding these few duplicates out manually.

- What is the nature of the changes in upstream-hunspell.diff? Could they
potentially be upstreamed into Hunspell?

There are two types of changes. 

1) a ton of proper nouns and names that were in the old mozilla dictionary which I am reluctant to remove without further consideration -- I have filed Bug 499444 about trying to figure out what to do with these. 

2) Many words and conjugations that were simply left out of the upstream dictionary. I'm happy to try to upstream as many as possible of these changes into hunspell, and I'm planning to do so once this patch gets accepted. 

- How did you decide which of the changes to make in upstream-chromium.diff?

It wasn't very hard -- most of the words were either obviously valid, popular internet products (YouTube etc.) or obvious autogenerated junk (as I say above: web code, drug spam, prolific citation & social bookmarking services, language names such as Svenska (found on English pages from Sweden to switch page language)). There were only a few borderline cases, mostly citation and social bookmarking services, and I tried to pick the ones that I had heard of or had a relatively large number of news references.

It's a reasonably short diff, so it shouldn't be too hard to maintain.

- Have we made sure that all these files are appropriately licensed?

Yes. 

Chromium words are now tri-licensed. http://blog.chromium.org/2009/02/spell-check-dictionary-improvements.html
Hunspell's license hasn't changed.
My changes and mozilla-words are all tri-licensed, obviously.

That's all the word lists. So there are no licensing problems.
Axel,

- Not fully reading this bug nor the other, but this sounds like a fork of an old version of the dictionary.

No, that's not right at all. As Shawn and I discussed above, I'm actually making an effort to get the dictionary unforked, by allowing us to take updates from upstream in as orderly a manner as possible.

- Some of the changes suggested in the bugs here should obviously be fixed upstream, and then we'd take them downstream again in an orderly fashion.

Respectfully, I strongly disagree that we should upstream everything before fixing the dictionary. AFAIK the en-US dictionary has gone 4-5 years since being patched, and the priority should be to make it work for users, then upstream the most obvious changes. That shouldn't add much difficulty because of the work I've put in to allow us to take upstream updates.
...and by "4-5 years since being patched" I really mean "1 1/2 years since Bug 397150 updated the dictionary to Hunspell's dictionary, which was itself 2-3 years out of date regarding sociopolitical or internet developments."
OK, it's been another month. Besides the answers I gave above, is there any other information Gavin or anyone else needs to approve this patch?

To summarize:

There are zero licensing issues with this patch. 

The major question seems to be whether it's the right approach or not. I believe it's both better than the existing approach, and flexible in case someone wants to change in the future. Please don't make the perfect the enemy of the good.
Comment on attachment 384222 [details] [diff] [review]
Hg dictionary directory patch with corrections

gavin is on vacation, but mconnor can probably look at this on Friday.
Attachment #384222 - Flags: review?(gavin.sharp) → review?(mconnor)
(In reply to comment #29)
> http://src.chromium.org/svn/trunk/src/chrome/third_party/hunspell/dictionaries/README_en_US.txt
> says that some of Chromium files are LGPL licensed

I'm not confident that we've addressed this concern appropriately. I'm assuming the chromium blog post that mentions tri-licensings refers to the "affix file" mentioned in that README? What about the original dictionary file that is, according to that README, only LGPL licensed? Is there updated licensing information available that isn't reflected in that README?

I'm sorry to be putting up roadblocks here, and I can understand that the process of trying to get this landed may be frustrating given the delays in getting a response that you've experienced, but I think we need to be confident about the licensing situation before we get this landed.
Gavin, please reread the Chromium blog entry. I'm absolutely certain they're talking about their automatically generated list of new words and the affixes.

"The Google translation team used their language models to generate a sorted list of the most popular words in each language. This was cross-checked with the Hunspell dictionaries to generate a list of the top 1000 words not present in each dictionary....We hope that by using the the existing GPL/LGPL/MPL tri-license for our addition, our work can be picked up by other users of Hunspell."

The license of Chrome's original dictionary is not relevant to us because we're not using any of it. 

We're using the existing Mozilla dictionary (license already OK) and the current Hunspell updates which are BSD licensed (http://wordlist.sourceforge.net/hunspell-readme). 

(In any case, Chrome's using an old version of the Hunspell dictionary with old licensing, if they updated to the current version we're using, http://wordlist.sourceforge.net/hunspell-readme, it would be BSD not LGPL.)
To make things more concrete, 

http://src.chromium.org/svn/trunk/src/chrome/third_party/hunspell/dictionaries/README_en_US.txt

is irrelevant because we're not using en-US.aff or en-US.dic from that source.

We are only using the file

http://src.chromium.org/svn/trunk/src/chrome/third_party/hunspell/dictionaries/en_US.dic_delta

which is one file containing both words and affixes. It's the only file Google added to the original Hunspell dictionary and is therefore exactly the set of changes Brett W is talking about on the Chromium blog post.
I just checked in
http://src.chromium.org/viewvc/chrome/trunk/src/chrome/third_party/hunspell/dictionaries/README.dic_delta?revision=23723&view=markup
to clarify the licensing of the .dic_delta files (it's BSD-like, not LGPL).
Thanks for clearing that up, guys.

One minor question, but otherwise I think we should get this landed:

upstream-chromium.diff seems to remove some valid words, e.g. acknowledgement/cancelling/mitochondrial . Why is that? Are these just duplicates?
Yes, those words were in the newer version of the Hunspell dictionary and the chromium patch would just add duplicates. They are included in the final merged dictionary patch.
I just tried applying upstream-chromium.diff to chromium_en_US.dic_delta (using files from that patch):

patch chromium_en_US.dic_delta upstream-chromium.diff

and it failed to apply cleanly. Am I doing something wrong?

Applying upstream-hunspell.diff to a downloaded copy of en-US.dic from http://wordlist.sourceforge.net/ worked. Should we check in a copy of the original hunspell-en_US-20081205.dic just for reference?
From my reading, looks like you guys cleared this up on your own. CC me again if you need me :-)

Gerv
Whiteboard: [need review gavin] → [need review mconnor]
If somebody can address comment 42, I will get this landed ASAP and maybe we can get it in for 1.9.2 (Firefox 3.6).
Sorry about the delay -- I was finishing and defending my dissertation. I'll figure out what's going on with the patching failure reported by Gavin possibly tonight or Thursday at the latest.
Attached patch Hg patch to dictionary directory (obsolete) — Splinter Review
This dictionary patch should work. It's been tested in my local Firefox install and with the test scripts attached above. I've also added the words from Bug 499593.
Attachment #384222 - Attachment is obsolete: true
Attached file New dictionary merging scripts (obsolete) —
Gavin, I think the problem was that I sucked at using patch. I've changed the approach slightly so the merge-dictionaries script now takes care of all the dictionary generation, including patching upstream dictionaries and should do the right thing and generate a new dictionary when applied to the dictionary directory in the previous patch.
Attachment #384220 - Attachment is obsolete: true
Gavin - has comment 42 been addressed to your satisfaction?
Whiteboard: [need review mconnor] → [needs input gavin]
OK, so a very small patch just got pushed in Bug 434915 adding a few random words like "Barack", but it's missing the majority of words in Chromium's list as well as the words in this patch.

I tried to write a principled general fix, but this approach doesn't seem to be getting me anything but frustration...and now I'm going to have to update my patch again because the dictionary has changed again.

Should I just start over and submit a few dozen small patches adding random words to the dictionary? Or should I just walk away?
(In reply to comment #49)
> Should I just start over and submit a few dozen small patches adding random
> words to the dictionary? Or should I just walk away?
Submitting a new set of patches would be appreciated.  I'll prod gavin once he's back from vacation.  I think we really really want to take your changes.
My suggestion to submit a few dozen small patches was facetious. It's a bit hard for me to make new patches right now due to travel and my build environment not being set up. I can if I have to, but there will probably be more delay.

Based on the last exchanges of comments I think my last patch here should be satisfactory, although it's up to Gavin. 

The easiest and quickest fix would be to revert Bug 434915 and apply my last patch, since mine includes and supersedes Ryan's work in that bug.
Please add patches to bug 499593, which is the correct place for additions (we can´t create a new bug and-or a new patch every time a single word is added).
(In reply to comment #52)
> Please add patches to bug 499593, which is the correct place for additions (we
> can´t create a new bug and-or a new patch every time a single word is added).
This bug adds a ton of words, and has a bunch of patches already that are almost ready to land.  It doesn't make sense at this time to suddenly jump to a different bug.
Looking into this again. Here's what I did:

- applied attachment 404003 [details] [diff] [review] to current m-c tip (ignoring failed hunks to en-US.dic and mozilla_words.diff)
- from that tree, copied chromium_en_US.dic_delta, upstream-chromium.diff, upstream-hunspell.diff, and mozilla-specific.txt to a temp working directory
- extracted attachment 404005 [details] to that same directory
- downloaded en-US.dic from http://sourceforge.net/projects/wordlist/ , renamed it to hunspell-en_US-20081205.dic
- ran merge-dictionaries

Output was:
==========
patching file chromium_en_US.dic_delta
patching file hunspell-en_US-20081205.dic
sed: 1: "hunspell-patched-stripped": extra characters at the end of h command

Sorted 57432 lines in ascending alphabetical order

Duplicated entries:
None!
delete blank lines at the front,
and double check the end of file for junk.
the line count should agree with this number:
   57428
add this to the beginning of the file.
==========

(not sure what the sed warning is about)

I diffed the output from that (en-US.dic) to our current in-tree en-US.dic. diffstat on that gave:

 en-US.dic | 1912 +++++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 1643 insertions(+), 269 deletions(-)

Reviewing that diff, I saw a single change that threw me off: accoutrement is replaced by accoutrements/M . Most of the other changes went the other way (simplified to the root word and extra flags added). Not sure if this matters either way, but thought I'd mention it.

I'm a bit confused about the output of merge-dictionaries - it seems like the "strip leading newlines and add the line count" part is supposed to be done manually after running the script? Any reason not to have the script do that automatically?

README_mozilla.txt refers to en-US.aff having two additional rules compared to the upstream version, but says they "aren't important". I'd like to understand that better - I'm not sure why this worth having in the README. If they're really not important seems like we should just remove them and the comment. Or if they are, make sure they're included in upstream-hunspell.diff, I guess?

We should probably also update that readme to avoid references to manual patching of the files in favor of just using the script (which is already mentions at the top) - people who care about the details can just read the script. I'd also reword the description of upstream-hunspell to be something like "Mozilla-specific additions to the upstream Hunspell dictionary. Some of these changes should be upstreamed, and others should probably just be removed (bug 499444)."

All that aside, I think this is ready to land. We need to re-generate an hg patch against mozilla-central, but I can do that. Not sure what the best place is for the scripts themselves to live. We could just put them in extensions/spellcheck/locales/en-US/hunspell/ with all the other files (that's pretty much where they expect to be run), but I wonder if having them under locales/en-US will be problematic from an l10n point of view?
Whiteboard: [needs input gavin]
Hi Gavin,

Sorry this took so long; I was traveling a lot and didn't have a build environment set up.

This is a better patch for the following reasons:

1. Scripts are more concise
2. Dictionary sources are in their own directory now
3. Updated with new words from Bug 499593.

Regarding your comments from last time:

(not sure what the sed warning is about)

-- Fixed.

I diffed the output from that (en-US.dic) to our current in-tree en-US.dic.
diffstat on that gave:

 en-US.dic | 1912
+++++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 1643 insertions(+), 269 deletions(-)

Reviewing that diff, I saw a single change that threw me off: accoutrement is
replaced by accoutrements/M . Most of the other changes went the other way
(simplified to the root word and extra flags added). Not sure if this matters
either way, but thought I'd mention it.

-- The wordlist with affixes is autogenerated, and sometimes the
affixes work for maximum compression rather than linguistic logic. In
this particular case "accoutrement" gets derived from accoutre/L so
accoutrements (and its possessive) get specified independently.
So it's not a bug, just something that's not easily human-readable.

I'm a bit confused about the output of merge-dictionaries - it seems like the
"strip leading newlines and add the line count" part is supposed to be done
manually after running the script? Any reason not to have the script do that
automatically?

-- Nope, fixed. I was having line ending problems before, but I fixed them.

README_mozilla.txt refers to en-US.aff having two additional rules compared to
the upstream version, but says they "aren't important". I'd like to understand
that better - I'm not sure why this worth having in the README. If they're
really not important seems like we should just remove them and the comment. Or
if they are, make sure they're included in upstream-hunspell.diff, I guess?

-- Those two additional rules were REP alot a_lot and REP sitted sat. 
Hunspell now handles the alot case correctly, although myspell didn't,
so that rule is obsolete. "Sitted" seems like a rare mistake which is unlikely
to be upstreamed, so I dropped it. I'll drop the text from the README.

We should probably also update that readme to avoid references to manual
patching of the files in favor of just using the script (which is already
mentions at the top) - people who care about the details can just read the
script. I'd also reword the description of upstream-hunspell to be something
like "Mozilla-specific additions to the upstream Hunspell dictionary. Some of
these changes should be upstreamed, and others should probably just be removed
(bug 499444)."

-- OK, done.

All that aside, I think this is ready to land. We need to re-generate an hg
patch against mozilla-central, but I can do that. Not sure what the best place
is for the scripts themselves to live. We could just put them in
extensions/spellcheck/locales/en-US/hunspell/ with all the other files (that's
pretty much where they expect to be run), but I wonder if having them under
locales/en-US will be problematic from an l10n point of view?

-- Since the scripts themselves are specific to en-US dictionary
providers it probably makes sense for them to live in a localized directory.
I don't really see how that could cause l10n problems, but you know more about
this sort of thing than I do.
I've put scripts (and the data sources) in extensions/spellcheck/locales/en-US/hunspell/dictionary-sources

Please help me get this one in! I promise to be responsive to comments (at least for the next couple weeks!)
Attachment #404003 - Attachment is obsolete: true
Attachment #404005 - Attachment is obsolete: true
Attachment #439276 - Flags: superreview?(gavin.sharp)
Fixes problem with affix file in previous patch.

I'm currently running with this one; no problems.
Attachment #439276 - Attachment is obsolete: true
Attachment #440971 - Flags: superreview?(gavin.sharp)
Attachment #439276 - Flags: superreview?(gavin.sharp)
Runs some quick dictionary tests on mac/linux systems as well as on windows. 

You will need to build hunspell to run these.
Attachment #384221 - Attachment is obsolete: true
I think we should block on this for Firefox 4.0 because it will provide a much better user experience to our users.
blocking2.0: --- → ?
Whiteboard: [needs review gavin]
Gavin, I can probably help with the review here as well if you're busy with other stuff.
Shawn and Ehsan, that's great news. I'm still here if anything further needs to be done with the patch.
Comment on attachment 440971 [details] [diff] [review]
Patch to /extensions/spellcheck/locales/en-US/hunspell

Let's get this landed - thanks again for your work and patience, M.
Attachment #440971 - Flags: superreview?(gavin.sharp) → review+
Whiteboard: [needs review gavin]
http://hg.mozilla.org/mozilla-central/rev/e227a12b97a5
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9.3a5
No longer blocks: 237921
No longer blocks: 434915
Comment on attachment 440971 [details] [diff] [review]
Patch to /extensions/spellcheck/locales/en-US/hunspell

This might be the most useful patch I've ever seen!  Let's take this on branches as well.
Attachment #440971 - Flags: approval1.9.2.6?
Attachment #440971 - Flags: approval1.9.1.11?
Blocks: 237921
Is this something that could be ported over to the Canadian dictionary?? (en-CA)  I would hate to miss out on the goodness this patch provides only because I live north of the border. :(
We don't have an en-CA dictionary (or an en-CA localized build...). Unless you've somehow gone out and made your own, you're presumably using the en-US build, so you won't be missing out on anything.
DRAT!  Sorry, I'm using the Canadian English Dictionary Add-on at https://addons.mozilla.org/en-US/firefox/addon/3653/ which I had assumed was maintained by mozilla.org.  

Guess I'll have to pass on a request to the author of that extension.
Comment on attachment 440971 [details] [diff] [review]
Patch to /extensions/spellcheck/locales/en-US/hunspell

approval minus for the branches, seems way out of scope.
Attachment #440971 - Flags: approval1.9.2.6?
Attachment #440971 - Flags: approval1.9.2.6-
Attachment #440971 - Flags: approval1.9.1.11?
Attachment #440971 - Flags: approval1.9.1.11-
blocking2.0: ? → -
(In reply to Matt Caywood from comment #55)

> -- Those two additional rules were REP alot a_lot and REP sitted sat. 
> Hunspell now handles the alot case correctly, although myspell didn't,
> so that rule is obsolete.

Hunspell as shipped in Firefox 4 through Firefox nightly and Seamonkey 2.3.1 doesn't, and so it appears this check-in broke bug 340634, which I reopened; its change to en_US.aff needs to be reapplied.
Attachment #381402 - Attachment is patch: true
Attachment #381402 - Attachment mime type: application/octet-stream → text/plain
You need to log in before you can comment on or make changes to this bug.