Closed
Bug 479334
Opened 16 years ago
Closed 14 years ago
Firefox and Thunderbird need a better English language dictionary
Categories
(Core :: Spelling checker, defect)
Core
Spelling checker
Tracking
()
RESOLVED
FIXED
mozilla1.9.3a5
Tracking | Status | |
---|---|---|
blocking2.0 | --- | - |
People
(Reporter: cbartley, Assigned: caywood)
References
Details
Attachments
(2 files, 12 obsolete files)
908.03 KB,
patch
|
Gavin
:
review+
dveditz
:
approval1.9.2.7-
dveditz
:
approval1.9.1.11-
|
Details | Diff | Splinter Review |
1.93 KB,
application/x-zip
|
Details |
In the last few minutes Thunderbird's spellchecker has particularly annoyed me. Words that it flagged that I would like it to accept include:
hmm, refactor, UI, prioritization
OK, "hmm" may be debatable, and "refactor" is fairly recent usage. But "UI" has probably been in wide usage for at least two decades. It's probably pretty mainstream these days, what with several hundred million English speakers online.
And "prioritization"? You've got to be kidding me.
Hmm, the Firefox 3.0.6 spellchecker is cool with "prioritization" but the Thunderbird 2.0.0.19 spellchecker is not. So maybe Thunderbird *is* kidding me.
Nevertheless, I think there's some room for improvement here.
Sdwilsh pointed me to http://blog.chromium.org/2009/02/spell-check-dictionary-improvements.html, describing an effort by Google to improve the open source Hunspell dictionaries. Perhaps we could leverage that effort.
If nothing else, it would be nice to have the option of a regular (US/UK) English and a Geek-enhanced English dictionary.
Updated•16 years ago
|
Flags: blocking1.9.1?
Comment 1•16 years ago
|
||
BTW: Thunderbird2 is using myspell (?) while we fixed for Gecko1.9 to huntspell.
That's the reason for the different results.
And wyh only english, they generated additional files for 19 other languages..
Reporter | ||
Comment 2•16 years ago
|
||
(In reply to comment #1)
> And wyh only english, they generated additional files for 19 other languages..
That's a good point. My bias was that I knew for sure the English dictionary was deficient. But if we've got better dictionaries for many other languages, then we definitely should want to use them.
Comment 3•16 years ago
|
||
Shawn, unless I'm missing something very important I don't see how this can be a blocker.
Flags: blocking1.9.1? → blocking1.9.1-
Comment 4•16 years ago
|
||
(In reply to comment #3)
> Shawn, unless I'm missing something very important I don't see how this can be
> a blocker.
I frequently see people complaining about Firefox not having some common word X in the dictionary, so I think this would be a big usability win, and the cost should be pretty low.
Reporter | ||
Comment 5•16 years ago
|
||
(In reply to comment #4)
> (In reply to comment #3)
> > Shawn, unless I'm missing something very important I don't see how this can be
> > a blocker.
> I frequently see people complaining about Firefox not having some common word X
> in the dictionary, so I think this would be a big usability win, and the cost
> should be pretty low.
I'm assuming that Shawn meant that we should pull in those new Hunspell dictionaries mentioned in the blog post (http://blog.chromium.org/2009/02/spell-check-dictionary-improvements.html) if at all possible.
I don't know if it is possible for 1.9.1, but if so, it might indeed give us an easy usability win.
Flags: blocking1.9.1- → blocking1.9.1?
Comment 6•16 years ago
|
||
I'm not saying we shouldn't do it. I'm saying I wouldn't hold the release at all for it.
Flags: blocking1.9.1? → blocking1.9.1-
Comment 7•16 years ago
|
||
Assignee | ||
Comment 8•15 years ago
|
||
Can someone (Ryan? Shawn? Benjamin?) tell me if it's still possible to get this in for Firefox 3.5?
Google's Chromium changes to the wordlist include some incredibly frequent words, some of which occur nearly 1 BILLION TIMES in Google's database (YouTube, Skype, Obama).
And they've even provided a delta from the standard hunspell dictionary.
It seems to be autogenerated, so there are a few strange words in it that could be taken out ("deze").
http://src.chromium.org/svn/trunk/src/chrome/third_party/hunspell/dictionaries/en_US.dic_delta
Since Mozilla's 1.9.1 en-US.dic seems to be the newest version around, I think the best plan would be to add Google's list to mozilla_words.diff, then use that to patch en-US.dic. That would be more robust in case OpenOffice or Hunspell upgrade their dictionary.
I would also add the words from open dictionary suggestion bugs including:
434915, 494254, 355178, 448897, 436526, 488881, 495643, 460344, 460048
461142, 461143, 473675, 443029, 410970, 470750, 480295,
all of which are pretty simple wordlist changes.
If no one is currently driving this, and it will make it into the release, I'll make a patch in this bug over the next couple of days. Please let me know.
Comment 9•15 years ago
|
||
If you want to do that, go ahead. I'm not going to be available to do it for about a week.
Comment 10•15 years ago
|
||
I think it's too late for 3.5 final, but if it bakes on trunk, it can probably be nominated to for 1.9.1.x
Assignee | ||
Comment 11•15 years ago
|
||
Assignee | ||
Comment 12•15 years ago
|
||
Chromium's autogenerated wordlist patch is a useful starting point, but it was contaminated by lots of the same things as Google: code, spammers, citation & social bookmarking services, other languages whose speakers use English (Dutch, Finnish, etc).
Also, their original wordlist is a different version of the myspell dictionary, so there were some words that are already in Mozilla's en-US.dic.
I fixed as many of those problems as I could. Next patch will consolidate these and incorporate mozilla_words.diff and other bugs.
Comment 13•15 years ago
|
||
would there be a way to automate this so when they update, we can update quickly too?
Assignee | ||
Comment 14•15 years ago
|
||
I agree that option should be open. The idea was that the diff I just posted would be easy to apply to any new delta Chromium releases.
If we want to allow both en_US.dic and chromium-delta to be updated, we need four lists:
en_US.dic (mozilla's, from hunspell)
mozilla_words.diff (all other additions)
chromium-delta
chromium-delta.diff
then we patch chromium-delta with chromium-delta.diff, patch en_US.dic with mozilla_words.diff, and combine (and sort) the results.
Assignee | ||
Comment 15•15 years ago
|
||
includes the words from Ryan's latest patch in https://bugzilla.mozilla.org/show_bug.cgi?id=434915
Assignee | ||
Comment 16•15 years ago
|
||
This is just a checkpoint since I have to stop working on this for a bit -- will apply these patches and test locally as the next step.
Assignee | ||
Comment 17•15 years ago
|
||
OK, this patch contains the appropriate changes from Chromium plus other patches and bugs that were floating around.
It's been tested using a local installation of hunspell and my local firefox build. Correctness, incorrectness, suggestions, and nosuggest have all been tested.
I thought about adding parts of multiwords (vice versa, Abu Dhabi, Los Angeles etc.) but I skipped it with one or two exceptions (Rhode Island) because it seemed like it would just fill the dictionary with stuff which will have to be removed anyway. There's already another bug about upgrading hunspell and the tokenizer to deal with multiwords.
The partial diffs attached above have been obsoleted. I'll upload the final partial diffs once I get this patch accepted.
Attachment #381376 -
Attachment is obsolete: true
Attachment #381379 -
Attachment is obsolete: true
Attachment #381400 -
Attachment is obsolete: true
Attachment #381402 -
Attachment is obsolete: true
Attachment #381477 -
Flags: review?(sdwilsh)
Assignee | ||
Comment 18•15 years ago
|
||
BTW there are a couple of duplicate words -- that's because Mozilla's using a newer base version of en-US.dic than Chromium. They don't do any harm but I can delete them if necessary.
Updated•15 years ago
|
Assignee: nobody → caywood
Status: NEW → ASSIGNED
Updated•15 years ago
|
Attachment #381477 -
Flags: review?(sdwilsh)
Comment 19•15 years ago
|
||
Comment on attachment 381477 [details] [diff] [review]
patch to en-US.dic
(In reply to comment #18)
> BTW there are a couple of duplicate words -- that's because Mozilla's using a
> newer base version of en-US.dic than Chromium. They don't do any harm but I can
> delete them if necessary.
I would prefer it if you delete them. Can you also make a proper hg patch for this please?
Assignee | ||
Comment 20•15 years ago
|
||
Comment #16 on Bug 254814 From Caolan McNamara
FWIW personally I see http://wordlist.sourceforge.net/ as the canonical
upstream for the en_US dictionaries.
----
This looks like a good upstream dictionary, although I'd be more convinced if the site maintainer could spell "offical" [sic]. The affix file hasn't changed, which is good. I'll compare the dictionaries to see what changes have been made.
Assignee | ||
Comment 21•15 years ago
|
||
See also http://sourceforge.net/tracker/?func=browse&group_id=10079&atid=1014602
for recent upstream bug fixes.
Assignee | ||
Comment 22•15 years ago
|
||
OK, this patch is free of duplicates. Dictionary and affixes have been tested in local firefox and in command-line hunspell against a few hundred test words.
If you're going over the dictionary, I recommend diffing the first 1/3 of en-US.dic (uppercase first letter, mostly proper nouns) against the old Mozilla dictionary since that's where 99% of it came from, and it's far more inclusive for proper names than the Hunspell dictionary.
For the other 2/3 of en-US.dic (lowercase first letter), please look at the very end of upstream-hunspell.diff which is a diff against the newest Hunspell dictionary. This shows where I've changed and added words.
I'll try to upstream those changes to once this patch gets accepted.
Attachment #381477 -
Attachment is obsolete: true
Attachment #383642 -
Flags: review?(sdwilsh)
Comment 23•15 years ago
|
||
Comment on attachment 383642 [details] [diff] [review]
Hg patch of dictionary directory
Please specify the attachment number in README_mozilla.txt for the script(s) to use. Giving instructions on how to use them would be useful here too.
r=sdwilsh with those changes. I guess gavin should take a look at this as well once you upload a new patch.
Attachment #383642 -
Flags: review?(sdwilsh) → review+
Assignee | ||
Comment 24•15 years ago
|
||
This was used to generate the new dictionary from upstream sources and patches.
Assignee | ||
Comment 25•15 years ago
|
||
Some test words, both valid and invalid, and a windows script (for inspiration only).
To be used with a standalone build of Hunspell.
Assignee | ||
Comment 26•15 years ago
|
||
As requested, this includes a reference to the merge scripts, and directions for using them in README_mozilla.txt
Attachment #383642 -
Attachment is obsolete: true
Assignee | ||
Updated•15 years ago
|
Attachment #384222 -
Flags: review?(gavin.sharp)
Assignee | ||
Comment 27•15 years ago
|
||
Shawn, exactly what do you want Gavin to look at?
It's been 3 weeks and he hasn't responded to my review request or an email.
It might help if we were clearer about what we wanted from him.
Updated•15 years ago
|
Whiteboard: [need review gavin]
Version: unspecified → Trunk
Comment 28•15 years ago
|
||
I'm not a browser or toolkit peer, but gavin is, which is why he needs to review it.
Comment 29•15 years ago
|
||
I've spent a bit of time looking at the patch, and I'm finding it hard to make a judgement call as to whether this is the right approach, since the whole situation is rather complicated - multiple sources+patches, etc. Here's a brief summary based mostly on the contents of the README in the patch, just to make sure we're on the same page:
There are three main wordlists with your proposed patch applied: a) Hunspell, b) Google's, and c) Mozilla-specific wordlist.
1) Mozilla has historically made changes to a) (upstream-hunspell.diff)
2) You have made changes to b) to remove duplicates and "junk" from Google's wordlist (upstream-chromium.diff)
The README suggests that the final dictionary we use is the result of applying diffs 1) and 2) on a) and b) respectively, and then merging the result and weeding out duplicates. Assuming that's all correct, that leaves me with a few questions. I apologize if any of these have already been asked/answered.
- Where does the Mozilla-specific wordlist come into play in this process?
- What is the nature of the changes in upstream-hunspell.diff? Could they potentially be upstreamed into Hunspell?
(How does the Mozilla-specific wordlist differ from upstream-hunspell.diff?)
- How did you decide which of the changes to make in upstream-chromium.diff?
Another potential issue is licensing. Have we made sure that all these files are appropriately licensed? I'm not a license expert by any means, but Gerv and Axel have some experience dealing with licenses of dictionaries specifically, so maybe they can help. If I'm reading http://src.chromium.org/svn/trunk/src/chrome/third_party/hunspell/dictionaries/README_en_US.txt says that some of Chromium files are LGPL licensed, and I think that's been a problem in the past?
Comment 30•15 years ago
|
||
Let me CC the players of bug 397150, when we fixed the licensing issues and the upstream dictionaries the last time.
Not fully reading this bug nor the other, but this sounds like a fork of an old version of the dictionary. Some of the changes suggested in the bugs here should obviously be fixed upstream, and then we'd take them downstream again in an orderly fashion.
Assignee | ||
Comment 31•15 years ago
|
||
Hi Gavin,
Thanks for the review. Your understanding of the approach is basically right.
- Where does the Mozilla-specific wordlist come into play in this process?
- (How does the Mozilla-specific wordlist differ from upstream-hunspell.diff?)
The Mozilla-specific wordlist (which I generated) is also merged in the last step, at the time the chromium and hunspell lists are merged.
This wordlist supersedes some of the old patch to upstream (which was called mozilla-words.diff, IIRC, but had a lot more words than just mozilla products in it).
I think it makes sense to keep this wordlist separate because judging from previous bugs the Mozilla community seems to want to include certain words that the upstream dictionaries are unlikely to take. The upstream dictionaries already include words such as Firefox, but I'm weeding these few duplicates out manually.
- What is the nature of the changes in upstream-hunspell.diff? Could they
potentially be upstreamed into Hunspell?
There are two types of changes.
1) a ton of proper nouns and names that were in the old mozilla dictionary which I am reluctant to remove without further consideration -- I have filed Bug 499444 about trying to figure out what to do with these.
2) Many words and conjugations that were simply left out of the upstream dictionary. I'm happy to try to upstream as many as possible of these changes into hunspell, and I'm planning to do so once this patch gets accepted.
- How did you decide which of the changes to make in upstream-chromium.diff?
It wasn't very hard -- most of the words were either obviously valid, popular internet products (YouTube etc.) or obvious autogenerated junk (as I say above: web code, drug spam, prolific citation & social bookmarking services, language names such as Svenska (found on English pages from Sweden to switch page language)). There were only a few borderline cases, mostly citation and social bookmarking services, and I tried to pick the ones that I had heard of or had a relatively large number of news references.
It's a reasonably short diff, so it shouldn't be too hard to maintain.
- Have we made sure that all these files are appropriately licensed?
Yes.
Chromium words are now tri-licensed. http://blog.chromium.org/2009/02/spell-check-dictionary-improvements.html
Hunspell's license hasn't changed.
My changes and mozilla-words are all tri-licensed, obviously.
That's all the word lists. So there are no licensing problems.
Assignee | ||
Comment 32•15 years ago
|
||
Axel,
- Not fully reading this bug nor the other, but this sounds like a fork of an old version of the dictionary.
No, that's not right at all. As Shawn and I discussed above, I'm actually making an effort to get the dictionary unforked, by allowing us to take updates from upstream in as orderly a manner as possible.
- Some of the changes suggested in the bugs here should obviously be fixed upstream, and then we'd take them downstream again in an orderly fashion.
Respectfully, I strongly disagree that we should upstream everything before fixing the dictionary. AFAIK the en-US dictionary has gone 4-5 years since being patched, and the priority should be to make it work for users, then upstream the most obvious changes. That shouldn't add much difficulty because of the work I've put in to allow us to take upstream updates.
Assignee | ||
Comment 33•15 years ago
|
||
...and by "4-5 years since being patched" I really mean "1 1/2 years since Bug 397150 updated the dictionary to Hunspell's dictionary, which was itself 2-3 years out of date regarding sociopolitical or internet developments."
Assignee | ||
Comment 34•15 years ago
|
||
OK, it's been another month. Besides the answers I gave above, is there any other information Gavin or anyone else needs to approve this patch?
To summarize:
There are zero licensing issues with this patch.
The major question seems to be whether it's the right approach or not. I believe it's both better than the existing approach, and flexible in case someone wants to change in the future. Please don't make the perfect the enemy of the good.
Comment 35•15 years ago
|
||
Comment on attachment 384222 [details] [diff] [review]
Hg dictionary directory patch with corrections
gavin is on vacation, but mconnor can probably look at this on Friday.
Attachment #384222 -
Flags: review?(gavin.sharp) → review?(mconnor)
Comment 36•15 years ago
|
||
(In reply to comment #29)
> http://src.chromium.org/svn/trunk/src/chrome/third_party/hunspell/dictionaries/README_en_US.txt
> says that some of Chromium files are LGPL licensed
I'm not confident that we've addressed this concern appropriately. I'm assuming the chromium blog post that mentions tri-licensings refers to the "affix file" mentioned in that README? What about the original dictionary file that is, according to that README, only LGPL licensed? Is there updated licensing information available that isn't reflected in that README?
I'm sorry to be putting up roadblocks here, and I can understand that the process of trying to get this landed may be frustrating given the delays in getting a response that you've experienced, but I think we need to be confident about the licensing situation before we get this landed.
Assignee | ||
Comment 37•15 years ago
|
||
Gavin, please reread the Chromium blog entry. I'm absolutely certain they're talking about their automatically generated list of new words and the affixes.
"The Google translation team used their language models to generate a sorted list of the most popular words in each language. This was cross-checked with the Hunspell dictionaries to generate a list of the top 1000 words not present in each dictionary....We hope that by using the the existing GPL/LGPL/MPL tri-license for our addition, our work can be picked up by other users of Hunspell."
The license of Chrome's original dictionary is not relevant to us because we're not using any of it.
We're using the existing Mozilla dictionary (license already OK) and the current Hunspell updates which are BSD licensed (http://wordlist.sourceforge.net/hunspell-readme).
(In any case, Chrome's using an old version of the Hunspell dictionary with old licensing, if they updated to the current version we're using, http://wordlist.sourceforge.net/hunspell-readme, it would be BSD not LGPL.)
Assignee | ||
Comment 38•15 years ago
|
||
To make things more concrete,
http://src.chromium.org/svn/trunk/src/chrome/third_party/hunspell/dictionaries/README_en_US.txt
is irrelevant because we're not using en-US.aff or en-US.dic from that source.
We are only using the file
http://src.chromium.org/svn/trunk/src/chrome/third_party/hunspell/dictionaries/en_US.dic_delta
which is one file containing both words and affixes. It's the only file Google added to the original Hunspell dictionary and is therefore exactly the set of changes Brett W is talking about on the Chromium blog post.
Comment 39•15 years ago
|
||
I just checked in
http://src.chromium.org/viewvc/chrome/trunk/src/chrome/third_party/hunspell/dictionaries/README.dic_delta?revision=23723&view=markup
to clarify the licensing of the .dic_delta files (it's BSD-like, not LGPL).
Comment 40•15 years ago
|
||
Thanks for clearing that up, guys.
One minor question, but otherwise I think we should get this landed:
upstream-chromium.diff seems to remove some valid words, e.g. acknowledgement/cancelling/mitochondrial . Why is that? Are these just duplicates?
Assignee | ||
Comment 41•15 years ago
|
||
Yes, those words were in the newer version of the Hunspell dictionary and the chromium patch would just add duplicates. They are included in the final merged dictionary patch.
Comment 42•15 years ago
|
||
I just tried applying upstream-chromium.diff to chromium_en_US.dic_delta (using files from that patch):
patch chromium_en_US.dic_delta upstream-chromium.diff
and it failed to apply cleanly. Am I doing something wrong?
Applying upstream-hunspell.diff to a downloaded copy of en-US.dic from http://wordlist.sourceforge.net/ worked. Should we check in a copy of the original hunspell-en_US-20081205.dic just for reference?
Comment 43•15 years ago
|
||
From my reading, looks like you guys cleared this up on your own. CC me again if you need me :-)
Gerv
Updated•15 years ago
|
Whiteboard: [need review gavin] → [need review mconnor]
Attachment #384222 -
Flags: review?(mconnor) → review+
Comment 44•15 years ago
|
||
If somebody can address comment 42, I will get this landed ASAP and maybe we can get it in for 1.9.2 (Firefox 3.6).
Assignee | ||
Comment 45•15 years ago
|
||
Sorry about the delay -- I was finishing and defending my dissertation. I'll figure out what's going on with the patching failure reported by Gavin possibly tonight or Thursday at the latest.
Assignee | ||
Comment 46•15 years ago
|
||
This dictionary patch should work. It's been tested in my local Firefox install and with the test scripts attached above. I've also added the words from Bug 499593.
Attachment #384222 -
Attachment is obsolete: true
Assignee | ||
Comment 47•15 years ago
|
||
Gavin, I think the problem was that I sucked at using patch. I've changed the approach slightly so the merge-dictionaries script now takes care of all the dictionary generation, including patching upstream dictionaries and should do the right thing and generate a new dictionary when applied to the dictionary directory in the previous patch.
Attachment #384220 -
Attachment is obsolete: true
Comment 48•15 years ago
|
||
Gavin - has comment 42 been addressed to your satisfaction?
Whiteboard: [need review mconnor] → [needs input gavin]
Assignee | ||
Comment 49•15 years ago
|
||
OK, so a very small patch just got pushed in Bug 434915 adding a few random words like "Barack", but it's missing the majority of words in Chromium's list as well as the words in this patch.
I tried to write a principled general fix, but this approach doesn't seem to be getting me anything but frustration...and now I'm going to have to update my patch again because the dictionary has changed again.
Should I just start over and submit a few dozen small patches adding random words to the dictionary? Or should I just walk away?
Comment 50•15 years ago
|
||
(In reply to comment #49)
> Should I just start over and submit a few dozen small patches adding random
> words to the dictionary? Or should I just walk away?
Submitting a new set of patches would be appreciated. I'll prod gavin once he's back from vacation. I think we really really want to take your changes.
Assignee | ||
Comment 51•15 years ago
|
||
My suggestion to submit a few dozen small patches was facetious. It's a bit hard for me to make new patches right now due to travel and my build environment not being set up. I can if I have to, but there will probably be more delay.
Based on the last exchanges of comments I think my last patch here should be satisfactory, although it's up to Gavin.
The easiest and quickest fix would be to revert Bug 434915 and apply my last patch, since mine includes and supersedes Ryan's work in that bug.
Comment 52•15 years ago
|
||
Please add patches to bug 499593, which is the correct place for additions (we can´t create a new bug and-or a new patch every time a single word is added).
Comment 53•15 years ago
|
||
(In reply to comment #52)
> Please add patches to bug 499593, which is the correct place for additions (we
> can´t create a new bug and-or a new patch every time a single word is added).
This bug adds a ton of words, and has a bunch of patches already that are almost ready to land. It doesn't make sense at this time to suddenly jump to a different bug.
Comment 54•15 years ago
|
||
Looking into this again. Here's what I did:
- applied attachment 404003 [details] [diff] [review] to current m-c tip (ignoring failed hunks to en-US.dic and mozilla_words.diff)
- from that tree, copied chromium_en_US.dic_delta, upstream-chromium.diff, upstream-hunspell.diff, and mozilla-specific.txt to a temp working directory
- extracted attachment 404005 [details] to that same directory
- downloaded en-US.dic from http://sourceforge.net/projects/wordlist/ , renamed it to hunspell-en_US-20081205.dic
- ran merge-dictionaries
Output was:
==========
patching file chromium_en_US.dic_delta
patching file hunspell-en_US-20081205.dic
sed: 1: "hunspell-patched-stripped": extra characters at the end of h command
Sorted 57432 lines in ascending alphabetical order
Duplicated entries:
None!
delete blank lines at the front,
and double check the end of file for junk.
the line count should agree with this number:
57428
add this to the beginning of the file.
==========
(not sure what the sed warning is about)
I diffed the output from that (en-US.dic) to our current in-tree en-US.dic. diffstat on that gave:
en-US.dic | 1912 +++++++++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 1643 insertions(+), 269 deletions(-)
Reviewing that diff, I saw a single change that threw me off: accoutrement is replaced by accoutrements/M . Most of the other changes went the other way (simplified to the root word and extra flags added). Not sure if this matters either way, but thought I'd mention it.
I'm a bit confused about the output of merge-dictionaries - it seems like the "strip leading newlines and add the line count" part is supposed to be done manually after running the script? Any reason not to have the script do that automatically?
README_mozilla.txt refers to en-US.aff having two additional rules compared to the upstream version, but says they "aren't important". I'd like to understand that better - I'm not sure why this worth having in the README. If they're really not important seems like we should just remove them and the comment. Or if they are, make sure they're included in upstream-hunspell.diff, I guess?
We should probably also update that readme to avoid references to manual patching of the files in favor of just using the script (which is already mentions at the top) - people who care about the details can just read the script. I'd also reword the description of upstream-hunspell to be something like "Mozilla-specific additions to the upstream Hunspell dictionary. Some of these changes should be upstreamed, and others should probably just be removed (bug 499444)."
All that aside, I think this is ready to land. We need to re-generate an hg patch against mozilla-central, but I can do that. Not sure what the best place is for the scripts themselves to live. We could just put them in extensions/spellcheck/locales/en-US/hunspell/ with all the other files (that's pretty much where they expect to be run), but I wonder if having them under locales/en-US will be problematic from an l10n point of view?
Updated•15 years ago
|
Whiteboard: [needs input gavin]
Assignee | ||
Comment 55•15 years ago
|
||
Hi Gavin,
Sorry this took so long; I was traveling a lot and didn't have a build environment set up.
This is a better patch for the following reasons:
1. Scripts are more concise
2. Dictionary sources are in their own directory now
3. Updated with new words from Bug 499593.
Regarding your comments from last time:
(not sure what the sed warning is about)
-- Fixed.
I diffed the output from that (en-US.dic) to our current in-tree en-US.dic.
diffstat on that gave:
en-US.dic | 1912
+++++++++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 1643 insertions(+), 269 deletions(-)
Reviewing that diff, I saw a single change that threw me off: accoutrement is
replaced by accoutrements/M . Most of the other changes went the other way
(simplified to the root word and extra flags added). Not sure if this matters
either way, but thought I'd mention it.
-- The wordlist with affixes is autogenerated, and sometimes the
affixes work for maximum compression rather than linguistic logic. In
this particular case "accoutrement" gets derived from accoutre/L so
accoutrements (and its possessive) get specified independently.
So it's not a bug, just something that's not easily human-readable.
I'm a bit confused about the output of merge-dictionaries - it seems like the
"strip leading newlines and add the line count" part is supposed to be done
manually after running the script? Any reason not to have the script do that
automatically?
-- Nope, fixed. I was having line ending problems before, but I fixed them.
README_mozilla.txt refers to en-US.aff having two additional rules compared to
the upstream version, but says they "aren't important". I'd like to understand
that better - I'm not sure why this worth having in the README. If they're
really not important seems like we should just remove them and the comment. Or
if they are, make sure they're included in upstream-hunspell.diff, I guess?
-- Those two additional rules were REP alot a_lot and REP sitted sat.
Hunspell now handles the alot case correctly, although myspell didn't,
so that rule is obsolete. "Sitted" seems like a rare mistake which is unlikely
to be upstreamed, so I dropped it. I'll drop the text from the README.
We should probably also update that readme to avoid references to manual
patching of the files in favor of just using the script (which is already
mentions at the top) - people who care about the details can just read the
script. I'd also reword the description of upstream-hunspell to be something
like "Mozilla-specific additions to the upstream Hunspell dictionary. Some of
these changes should be upstreamed, and others should probably just be removed
(bug 499444)."
-- OK, done.
All that aside, I think this is ready to land. We need to re-generate an hg
patch against mozilla-central, but I can do that. Not sure what the best place
is for the scripts themselves to live. We could just put them in
extensions/spellcheck/locales/en-US/hunspell/ with all the other files (that's
pretty much where they expect to be run), but I wonder if having them under
locales/en-US will be problematic from an l10n point of view?
-- Since the scripts themselves are specific to en-US dictionary
providers it probably makes sense for them to live in a localized directory.
I don't really see how that could cause l10n problems, but you know more about
this sort of thing than I do.
I've put scripts (and the data sources) in extensions/spellcheck/locales/en-US/hunspell/dictionary-sources
Please help me get this one in! I promise to be responsive to comments (at least for the next couple weeks!)
Assignee | ||
Comment 56•15 years ago
|
||
Attachment #404003 -
Attachment is obsolete: true
Attachment #404005 -
Attachment is obsolete: true
Attachment #439276 -
Flags: superreview?(gavin.sharp)
Assignee | ||
Comment 57•15 years ago
|
||
Fixes problem with affix file in previous patch.
I'm currently running with this one; no problems.
Attachment #439276 -
Attachment is obsolete: true
Attachment #440971 -
Flags: superreview?(gavin.sharp)
Attachment #439276 -
Flags: superreview?(gavin.sharp)
Assignee | ||
Comment 58•15 years ago
|
||
Runs some quick dictionary tests on mac/linux systems as well as on windows.
You will need to build hunspell to run these.
Attachment #384221 -
Attachment is obsolete: true
Comment 59•14 years ago
|
||
I think we should block on this for Firefox 4.0 because it will provide a much better user experience to our users.
blocking2.0: --- → ?
Updated•14 years ago
|
Whiteboard: [needs review gavin]
Comment 60•14 years ago
|
||
Gavin, I can probably help with the review here as well if you're busy with other stuff.
Assignee | ||
Comment 61•14 years ago
|
||
Shawn and Ehsan, that's great news. I'm still here if anything further needs to be done with the patch.
Comment 62•14 years ago
|
||
Comment on attachment 440971 [details] [diff] [review]
Patch to /extensions/spellcheck/locales/en-US/hunspell
Let's get this landed - thanks again for your work and patience, M.
Attachment #440971 -
Flags: superreview?(gavin.sharp) → review+
Updated•14 years ago
|
Whiteboard: [needs review gavin]
Updated•14 years ago
|
Keywords: checkin-needed
Comment 63•14 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9.3a5
Comment 64•14 years ago
|
||
Comment on attachment 440971 [details] [diff] [review]
Patch to /extensions/spellcheck/locales/en-US/hunspell
This might be the most useful patch I've ever seen! Let's take this on branches as well.
Attachment #440971 -
Flags: approval1.9.2.6?
Attachment #440971 -
Flags: approval1.9.1.11?
Comment 65•14 years ago
|
||
Is this something that could be ported over to the Canadian dictionary?? (en-CA) I would hate to miss out on the goodness this patch provides only because I live north of the border. :(
Comment 66•14 years ago
|
||
We don't have an en-CA dictionary (or an en-CA localized build...). Unless you've somehow gone out and made your own, you're presumably using the en-US build, so you won't be missing out on anything.
Comment 67•14 years ago
|
||
DRAT! Sorry, I'm using the Canadian English Dictionary Add-on at https://addons.mozilla.org/en-US/firefox/addon/3653/ which I had assumed was maintained by mozilla.org.
Guess I'll have to pass on a request to the author of that extension.
Comment 68•14 years ago
|
||
Comment on attachment 440971 [details] [diff] [review]
Patch to /extensions/spellcheck/locales/en-US/hunspell
approval minus for the branches, seems way out of scope.
Attachment #440971 -
Flags: approval1.9.2.6?
Attachment #440971 -
Flags: approval1.9.2.6-
Attachment #440971 -
Flags: approval1.9.1.11?
Attachment #440971 -
Flags: approval1.9.1.11-
Updated•14 years ago
|
blocking2.0: ? → -
Comment 69•13 years ago
|
||
(In reply to Matt Caywood from comment #55)
> -- Those two additional rules were REP alot a_lot and REP sitted sat.
> Hunspell now handles the alot case correctly, although myspell didn't,
> so that rule is obsolete.
Hunspell as shipped in Firefox 4 through Firefox nightly and Seamonkey 2.3.1 doesn't, and so it appears this check-in broke bug 340634, which I reopened; its change to en_US.aff needs to be reapplied.
Updated•13 years ago
|
Attachment #381402 -
Attachment is patch: true
Attachment #381402 -
Attachment mime type: application/octet-stream → text/plain
You need to log in
before you can comment on or make changes to this bug.
Description
•