en-US dictionary: Additional Mozilla words need to be cleaned up. Other issues discussed: See comment #10 and comment #12.

RESOLVED FIXED in Firefox 45

Status

()

Core
Spelling checker
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: Jorg K (GMT+2), Assigned: Jorg K (GMT+2))

Tracking

Trunk
mozilla46
Points:
---

Firefox Tracking Flags

(firefox45 fixed, firefox46 fixed)

Details

Attachments

(4 attachments, 10 obsolete attachments)

3.43 KB, text/plain
Details
3.89 KB, text/plain
Details
57.68 KB, text/plain
Details
76.25 KB, patch
Jorg K (GMT+2)
: review+
Details | Diff | Splinter Review
(Assignee)

Description

2 years ago
Firstly, in bug 1137544 the Mozilla maintained en-US dictionary got refreshed and many words previously contained got removed causing bugs:
Bug 1183512, bug 1198052.
These bugs wouldn't have been caused, had SCOWL's en_US-large been used:
http://app.aspell.net/lookup?dict=en_US-large&words=relict%0D%0Aresiduary%0D%0Aenforceability%0D%0Aadvisor%0D%0Ainfeasible%0D%0Aclich%E9%0D%0ABogot%E1%0D%0Ainfeasible%0D%0Aunfeasible

Secondly, there seems to be a problem with Mozilla's update process, since words that don't exist in the upstream dictionary ("Aurthur", bug 301712) seem to hang around for ten years:
http://app.aspell.net/lookup?dict=en_US-large&words=Aurthur

It appears less optimal to add single words only to en-US.dic as seems to be the practise:
https://hg.mozilla.org/mozilla-central/log/tip/extensions/spellcheck/locales/en-US/hunspell/en-US.dic

Instead, they should either be fed upstream and/or also be kept in a separate file "Mozilla knows better than Aspell.net", like Fukushima:
http://app.aspell.net/lookup?dict=en_US-large&words=Fukushima
so they can easily be added on the next merge.

Thirdly, we should ensure that accented words are included. Many are already included in en_US-large. There was talk of switching to UTF-8 in bug 1144254 to be able to include "naïve", which is not in the upstream dictionary, but that doesn't seem to be necessary, since the charset is already ISO8859-1 and that includes accented characters including ï.

BTW, "naïve", if we want it, would go into the "Mozilla knows better" file.
(Assignee)

Updated

2 years ago
Blocks: 1183512
(Assignee)

Updated

2 years ago
Blocks: 1198052
(Assignee)

Updated

2 years ago
Blocks: 1144254
(Assignee)

Updated

2 years ago
Blocks: 301712
(Assignee)

Comment 1

2 years ago
I'd like to hear from Ehsan what he thinks about this.
Flags: needinfo?(ehsan)
OS: Unspecified → All
Hardware: Unspecified → All
(Assignee)

Comment 2

2 years ago
I'd start with this one:
http://sourceforge.net/projects/wordlist/files/speller/2015.08.24/hunspell-en_US-large-2015.08.24.zip/download

I can see the processing that is done in
extensions/spellcheck/locales/en-US/hunspell/dictionary-sources.
There's a README and a script to do the merge (make-new-dict).

Without looking at the script in great detail, it appears to merge the existing Mozilla dictionary with the upstream one and to generate a list of words which only Mozilla has (5-mozilla-added) and Mozilla doesn't want (??) (5-mozilla-removed: currently only "ABCs" and "megadeaths").

The "Mozilla only" words are basically 12000 proper names (uppercase) (including "Aurthur", bug 301712) and about 1000 "Mozilla knows better" words (lowercase), some technical like "webdesign", "wildcard" or "widescreen", or some Mozilla specific terms, like "XUL" or "Firefox". Some of these "know better" words are actually already included in the en_US-large dictionary, like "Firefox" or "assignee". 

So my suggestion is this to simplify the process by merging two files:
The upstream file and the "Mozilla knows better" file.
We should never remove words from the upstream file.

Or alternatively a few "Mozilla knows better" files:
1) Proper names 2) Mozilla terms ("XUL") 3) "general know better". Using three files would have the advantage of being able to request the content of the "general" file upstream.

If a bug comes in for a word to be added, and we agree to add it, we add to the "Mozilla knows better" file and if it's of general interest, request it upstream, like "autocomplete" or "datasheet".
If a bug comes in for a word to be removed, we only ever remove from the "knows better" file, otherwise it's a "wontfix" and we refer it to the maintainer.
After every change we regenerate the Mozilla dictionary.

The question is what to do with "know better" words which start to appear in the upstream file, for example "Firefox" and "assignee" are in the en_US-large file. These could be moved to a "retired know better" file or deleted altogether; or left alone so they are guaranteed to never disappear. In case that we maintain three "know better" files, there could be more specific processing: Never remove Mozilla terms but remove "general know better" words since they most likely got included due to the upstream request.

One word about the "Mozilla know better" file. We should take a hard look at the 1000 extra terms, since there are many duds, especially of the form "<verb>'s" like: "relabel's", "remind's", etc. While "relabel" and "remind" are good to have, the possessive forms are just plain wrong. Also among the proper names there appear an awful lot of duds like "Brnaba", "Brnaby", "Brittne", "Brittaney", just to pick a few at letter "B".

Altogether, a messy problem.
I suggest you first start by familiarizing yourself with the current process: <https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/README>  Lots of the things you're discovering here are documented in that file.

About simplifying the process, the process for adding single words is pretty much as simple as it gets.  Please see the documentation.  The process for merging upstream changes is also pretty straightforward.  It may not be the simplest thing in the world but since these merges happen very infrequently it's enough for one person to know how to do that.

About switching to the en_US-large wordlist, I think that SCOWL recommends against that <http://wordlist.aspell.net/scowl-readme/>, so I would like to know what Kevin thinks about doing that.
Flags: needinfo?(ehsan)
(Assignee)

Comment 4

2 years ago
(In reply to :Ehsan Akhgari from comment #3)
> About switching to the en_US-large wordlist, I think that SCOWL recommends
> against that <http://wordlist.aspell.net/scowl-readme/>, so I would like to
> know what Kevin thinks about doing that.
I can't see any recommendation against the larger list, so Kevin, can you please comment. A few issues we have (bug 1183512, bug 1198052) would not have happened with the larger dictionary.

Ehsan: You are right, I am in the process of familiarizing with the current process, but sadly you missed the point of my comment. I'll try again.

For single word additions/removals/corrections we currently edit
https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/en-US.dic
(you call this: as simple as it gets).

For upstream merges, the script in question
https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/make-new-dict
takes the existing Mozilla dictionary as the master and makes adjustments.

This processing is flawed. Mozilla's intellectual property and asset is not the merged Mozilla dictionary, but the additions it makes to the SCOWL data. I call this "Mozilla knows better" files. These additions make the dictionary Mozilla special and these additions should be finely administered.

My suggestion is to turn the process around and treat en-US.dic as a generated product. That will make single word additions/removals/corrections more complicated (change "knows better" file, regenerate), but will makes things better in the long run.

It is important to know where a word comes from to take appropriate action if a complaint comes in. I suggest to distinguish between SCOWL/upstream words and "Mozilla knows better" words which are not in the upstream data (repeating):
===
If a bug comes in for a word to be added, and we agree to add it, we add to the/a "Mozilla knows better" file and if it's of general interest, request it upstream, like "autocomplete" or "datasheet".
If a bug comes in for a word to be removed, we only ever remove from the "knows better" file, otherwise it's a "wontfix" and we refer it to the maintainer.
*After every change we regenerate the Mozilla dictionary.*
===

The "Mozilla knows better words" fall into three classes:
1) Proper names, with an awful lot of duds, like "Brnaba", "Brnaby", "Brittne", "Brittaney"
2) Mozilla specific terms, which are cool to have
3) General terms, mostly internet related, like like "autocomplete" or "datasheet".
   Again, an awful lot of duds in there: "<verb>'s" like: "relabel's", "remind's"

The upstream import would also be changed. I suggest to merge the SCOWL data with the separately maintained "knows better" files". I repeat: Mozilla is not in the business of maintaining a dictionary. Mozilla is in the business of adding finely tuned additions to data which comes from elsewhere with the aim to minimise individual adjustments.

Overall, the Mozilla maintained dictionary is of poor quality, because words are missing but also because a whole lot of duds are added and these duds are maintained until kingdom come or until someone complains (bug 301712).

According to my comment #2 I think the best way forward is the following:
1) Use a larger word set to start with, that is, en_US-large.
2) Be very careful about the Mozilla specific additions we make and also review the
   "knows better" words from class 1) and class 3) carefully.
3) Reverse the current process, make the SCOWL data and the "knows better" files the master
   and generate en-US.dic.

If we keep the current process, we will promote the mess forever and keep delivering a poor quality dictionary. With en-US.dic as the master file, we simply can't tell where a word came from and have no chance to clean up the mess.

Personally, I don't care, since I'm using a well-maintained en-GB dictionary, but for the reputation of Mozilla I think it's essential that we deliver a decent dictionary. I can feel some resistance from your comment above, but I think I've shown in the spellcheck dictionary selection work in Gecko 43/44 that a lot can be improved if you allow a fresh approach.

Slightly off topic: Recently I went to a Mozilla meeting and heard a talk about telemetry. One of the astounding results mentioned there is that 99.9% or more of Indian users use the US version of Firefox. An Indian attendee of the meeting explained that due to the many dialects in India, English is the "lingua franca". We owe these people a decent dictionary and not one that let's them type "this remind's me of you".

===

The prosed way forward is
- to start with the en_US-large dictionary,
- create what is currently called
  https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/5-mozilla-added
- split this into three files:
  moz-proper-names
  moz-terms (that's the mozilla-specific.txt/5-mozilla-specific)
  moz-general
- clean-up moz-proper-names and moz-general thoroughly
- create a new process that merges SCOWL and those three files.

A new single word change would affect one of the moz-* files and regenerate; a new upstream import would replace the SCOWL data and regenerate (and perhaps strip words from the moz-general file which are no longer required).
Flags: needinfo?(kevina)
The issue here is that make-new-dict doesn't handle the case where a word is removed from SCOWL properly, I think.  We need to drop those words by default and add them back manually if desired.
Summary: Refresh en-US dictionary from SCOWL's en_US-large and streamline update from upstream → make-new-dict keeps words removed from SCOWL upstream
(Kevin, please see https://groups.google.com/forum/#!topic/mozilla.dev.platform/3q7AATMNrsQ for the context around this bug.  Thanks!)

Comment 7

2 years ago
(In reply to :Ehsan Akhgari from comment #5)
> The issue here is that make-new-dict doesn't handle the case where a word is
> removed from SCOWL properly, I think.  We need to drop those words by
> default and add them back manually if desired.

What makes you say that?  I just tested my script and if a word is removed upstream it will also be removed in the Mozilla dictionary.

Comment 8

2 years ago
(In reply to Jorg K (GMT+1) from comment #4)

> This processing is flawed. Mozilla's intellectual property and asset is not
> the merged Mozilla dictionary, but the additions it makes to the SCOWL data.
> I call this "Mozilla knows better" files. These additions make the
> dictionary Mozilla special and these additions should be finely administered.
>
> My suggestion is to turn the process around and treat en-US.dic as a
> generated product. That will make single word additions/removals/corrections
> more complicated (change "knows better" file, regenerate), but will makes
> things better in the long run.

The reason for this is technical.  Merging two munched (i.e. a dictionary that uses "artifact/SM" instead of including the expanded words "artifact", artifacts" and "artifact's) dictionaries is problematic when both dictionaries contain the same base word word for example if on dictionary contained "artifact/S" and the other contains "artifact/M" simply merging the affix list to "/SM" does not work when both prefixed and suffixes are used.  Thus it is easier to expand the munched dictionaries to simple word lists and then re-munch the lists.  The re-munching part requires a special tool (aspell) be installed.
 
> It is important to know where a word comes from to take appropriate action
> if a complaint comes in. I suggest to distinguish between SCOWL/upstream
> words and "Mozilla knows better" words which are not in the upstream data

Please note that I personally find the phrase "Mozilla knows better" insulting.

Comment 9

2 years ago
(In reply to Jorg K (GMT+1) from comment #4)
> (In reply to :Ehsan Akhgari from comment #3)
> > About switching to the en_US-large wordlist, I think that SCOWL recommends
> > against that <http://wordlist.aspell.net/scowl-readme/>, so I would like to
> > know what Kevin thinks about doing that.
> I can't see any recommendation against the larger list, so Kevin, can you
> please comment. A few issues we have (bug 1183512, bug 1198052) would not
> have happened with the larger dictionary.

Sorry, the SCOWL README needs to be updated.  I do not recommend using the large dictionary as a base for two reasons: 1) It contains less common words that might mask spelling of more common words for example "calender"/"calendar" and "ort"/"or") and 2) It has not been as carefully checked for errors.
(Assignee)

Comment 10

2 years ago
First my apologies to Kevin: "Mozilla knows better" was not meant to be insulting. It was simply meant to express "Mozilla has more" or "Mozilla considers this word important". There was the discussion about the word "naïve", bug 1144254. This word has its merits, and if Mozilla were to add it, well, it would just say: We know that SCOWL doesn't have the word, but we consider it important.

Let me put on the broken record again:
My approach to fix the Mozilla en-US dictionary has two strategies:
1) Use a sufficiently large base data set, I suggested the "large" dictionary
   so that we don't have to deal with bugs like bug 1183512 or bug 1198052.
2) Maintain Mozilla additions in three files: proper names, Mozilla terms, "general"
   additions that should be requested upstream.
The idea is that Mozilla should basically accept SCOWL data and just add a tightly controlled set of words.

This addresses all the problems I've encountered. If you can fix these problems faster than with my method, please do.

Problem 1:
At the beginning of May 2015 words got removed, see:
https://hg.mozilla.org/mozilla-central/diff/bcb133a3cdca/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/orig/en_US.dic
(don't open in Firefox, it will hang, bug 1235321):
-enrobed
-extremal
-relict
-residuary
-enforceability
(all still included in the "large" dataset).
According to Ehsan's strategy of maintaining "good" words, these should not have been removed.
Note: I've just picked some removals at random, I don't have an exact list.

Problem 2:
The current merge process is faulty:
Bad data like
derail's
derange's
deride's
desalt's
descale's
describe's
deserve's
deskill's
despoil's
detest's
dethrone's
detract's
devalue's
devote's
(This is obviously *not* in SCOWL).
seems to be carried forward. Mozilla allow you to write to your girlfriend:
"This flower remind's me of you", which might be the end of the relationship ;-)

I can't understand how the current system could manage SCOWL removals yet not remove words Mozilla specifically added. How does it know that a word came from SCOWL and can be removed or it didn't come from SCOWL and should be maintained? Broken record: Maintaining Mozilla words differently (three lists) would fix this. 

Problem 3:
The SCOWL en-US "normal" dataset is too small, we're getting bugs like bug 1183512 or bug 1198052. Imagine that carefully: We are getting a bug with all its overhead for one single word. IMHO, we're crazy dealing with it this way.
Broken record: Mozilla shouldn't have to deal with this, they should use a sufficiently large dataset and only add very specific words.

Problem 3 revisited:
The SCOWL en-US "normal" dataset is too small. Personally I believe that words taken from the "large" dataset
Bogotá/M
cliché/MS
née
should be in the Mozilla dictionary. Ehsan's comment on the mailing list was:
Wonderful! If you have a list of words using these types of characters (*) that we need to add, please file a bug, and let's do that! 
(*) referring to accents.
My answer to that was: No. We should start with a sufficiently large dataset. We should live with whatever SCOWL provides. However, if we're not happy, we could add those words to our third list which I'm proposing, the list of "general" useful words and then request the content of the list upstream. It makes no sense to request all Mozilla additions upstream, we have 12000 proper names and 37 Mozilla specific terms which we should *not* request upstream. We also have 1000 extra "general" words, including the errors mentioned above, which we should carefully clean up and then submit to SCOWL.

The Mozilla additions:
https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/5-mozilla-added
are currently generated as part of the upstream import process. My suggestion is to change the process and merge SCOWL data with pure Mozilla additions.

Problem 4:
The current Mozilla process of not classifying additions and not feeding them back to SCOWL hinders the improvement of SCOWL data. A nice example is "children's" which is a Mozilla-added word, SCOWL only carries "children" with no affix at all.
Summary: make-new-dict keeps words removed from SCOWL upstream → make-new-dict (sometimes) keeps words removed from SCOWL upstream, and many more problems, see comment #10.
(Assignee)

Comment 11

2 years ago
Further to problem 1 and 3:
https://hg.mozilla.org/mozilla-central/diff/bcb133a3cdca/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/orig/en_US.dic
(don't open in Firefox, it will hang, bug 1235321)
also did:
-feasible/IU
-feasibly/U
+feasible/U
+feasibly
causing bug 1198052.
So IMHO following SCOWL's smaller dataset, especially without reviewing the removals, is not a good idea.

(not sure how u40726@disabled.tld got NI'ed here.)
Flags: needinfo?(u40726)
(Assignee)

Comment 12

2 years ago
Finally the penny dropped. The make-new-dict script compares the new upstream data with the previous upstream data (in "orig") and therefore knows the movements of the SCOWL data:
#comm -23 1-base.txt 3-upstream.txt > 3-upstream-rem
#comm -13 1-base.txt 3-upstream.txt > 3-upstream-add

It can therefore distinguish between data from SCOWL which was there before and additional Mozilla data.

It can therefore do corrections and deletions which might be unwanted (Problem 1).
Faulty additional Mozilla data won't be touched (Problem 2).

Taking the liberty to change the summary again: What I really wanted to put is:

1) SCOWL import removes useful words from Mozilla without review (could be fixed by using large dataset),
2) "normal" SCOWL data not rich enough causing complaints (could be fixed by using large dataset),
3) additional Mozilla words need to be cleaned up and categorised for easier feedback to the SCOWL system.

For the categorising of additional Mozilla words which are reported in 5-mozilla-added I suggested to maintain three separate files: Proper names, Mozilla terms (37 at the moment), "general" additions. Maybe the three files are not necessary since they can be derived:

Proper names = All uppercase additional words minus the Mozilla terms in mozilla-specific.txt.
Mozilla terms = 37 words in mozilla-specific.txt
"general" additions = all lowercase words.

Especially the last item needs heavy cleanup.

So this bug still has two parts:
1) Assure a rich word base. My reply and the reply of the maintainer of the/a GB dictionary is:
   Use more words, in this case the en_US-large dictionary.
2) Fix the horrible mess of added Mozilla words and feed useful ones back to SCOWL,
   including "children's" ;-)
Summary: make-new-dict (sometimes) keeps words removed from SCOWL upstream, and many more problems, see comment #10. → 1) SCOWL import removes useful words from Mozilla, 2) "normal" SCOWL data not rich enough, 3) additional Mozilla words need to be cleaned up. See comment #10 and comment #12.
(Assignee)

Comment 13

2 years ago
Created attachment 8702985 [details]
331 wrong dictionary entries of the form "<verb>'s".

I'm making a start here cleaning up Mozilla-added words.

I took the words from 5-mozilla-added and extracted the lowercase ones with apostrophe "s" at the end.
Legal ones are:
"<noun>'s" like "popup's", as in "the popup's colour is green".

Illegal ones are:
"<verb>'s" like "this remind's me of you", oh horror!

In the legal list we also find "children's" which I have fed back to SCOWL since they don't have it.
https://github.com/kevina/wordlist/issues/135

Enclosed a list of 331 words that must be removed since they allow fatal misspellings.

We have a total of 1018 more or less useful "extra" lowercase words. Removing the 331 errors, we will be left with 687 which we may review further and submit to SCOWL, as I have already done for "children's".

===

While digging through the extra words I found very common words like
acknowledgement
anecdotally
assignee
capita
(just to pick a few)
which are not in the SCOWL "normal" dataset, which I find surprising. They are in the "large" dataset, so (broken record), I'm still in favour of using the large dataset to reduce the amount of words Mozilla has to add.

My approach would be:
1) remove the 331 errors
2) update with the en_US-large dataset (broken record: fixes other issues, too)
3) review additions again.
(Assignee)

Comment 14

2 years ago
Oh, another issue, Problem 5: I believe Mozilla shouldn't censor any words. Currently Mozilla, most likely by accident, removes two words (5-mozilla-removed): "ABCs" and "megadeaths".

We should fix the dictionary and the SCOWL merge process to ensure that no words are removed.
(Assignee)

Comment 15

2 years ago
Example: Word "remind's" in Mozilla only:

SCOWL:
mind's
mind/ADRSZG

Mozilla:
mind/AMDRSZG

So Mozilla permits "mind" with suffix "'s" (that's the M) and prefix "re" (that's the A).

Probably the best way to fix this is to expand the Mozilla dictionary, remove the wrong words and then munch it again.

Comment 16

2 years ago
Jork K: Thank you for working on this.

As the maintainer of SCOWL I fully recognize that the normal dictionary may be a bit too small.  SCOWL is a combination of lists from many other sources.  The normal dictionary is small because it
only includes sources that I am fairly confident 1) do not contain errors or 2) do not contain words that might cause problems (such as "ort").  The larger dictionary loosens these restrictions a bit.
Thus, I do not recommend using it as a base as it will likely introduce it's own set of problems by including invalid or problematic words.

At some point in the future I am likely to expand the normal dictionary to include a number of words in the larger size once I find a way to avoid to previously mentioned problems.  Using stats from
Google books could help with this.

The issue of including variants and words with accents can be addressed separately.  To see what is possible I suggest you use the tool http://app.aspell.net/create to create a custom dictionary.
Note, that until Mozilla switches the dictionary to UTF-8, including non-ASCII characters (i.e., accents) could create some additional complications (see bug 1162823).

Please do submit words not found in the normal dictionary that you think should be included upstream.  It is okay if you simply list them all in one issue request.  Words that are already in the larger size
and are not problematic are very likely to be included in the smaller size.

Thanks,
Kevin

BTW: A number of "useful" words where likely removed in bug 1137544 due to this change in SCOWL (From Revision 6 to 7 (December 27, 2010)):

  Moved frequently class 0 from Brian Kelk's Wordlist from 
  level 60 to 70, and also filter it with level 80 due to, too many
  misspellings.

60 is the normal size and 70 is the large.  The removal of otherwise useful words is unfortunate, however I think it is more important not to include invalid words.
(Assignee)

Comment 17

2 years ago
(In reply to Kevin Atkinson from comment #16)
> Please do submit words not found in the normal dictionary that you think
> should be included upstream.  It is okay if you simply list them all in one
> issue request.  Words that are already in the larger size
> and are not problematic are very likely to be included in the smaller size.

OK, well, these words would have three sources:

First, I'd really have to check what got removed in April/May 2015 in bug 1137544. I already pointed out a few words I spotted just by browsing the changeset. A more thorough analysis is required to see what was removed which part of that should be re-added. (See comment #10, problem 1).

The second source of words for inclusion in the "smaller size" are the ones that Mozilla have added and which can be found in the "larger size". (See comment #13, bottom).

A third source of words would be my own experience: I would move a bunch of frequently used words with accents to the "smaller size", like: Bogotá/M, cliché/MS, née but also résumé/MS and naïve(+affixes) which you don't have at all. (See comment #10, problem 3 revisited).

I've just remembered bug 1137544 comment #34 about "advisor". We have bug 1183512 to add it, but you already rejected this for the small size, since in there you want to have unique spellings. So even if I help you with the small size, I can't win. Same goes for "résumé" and "naïve".

Altogether I was hoping of not doing any of this since I'm really a programmer working on Thunderbird bugs. I just accidentally found a few bugs about missing words and started to look at the issue.

I started this bug and the discussion on dev-platfrom with the aim to reduce Mozilla involvement in maintaining the dictionary. Now you're confirming that but at the same time you're asking to contribute to the upstream project. Well, that won't get my Mozilla work done.

Warning: The following may offend, but is not intended to ;-)

I still think using the large dictionary gives the best bang for the buck for Mozilla. It has all the words I spotted that were missing, and Mozilla IMHO will live with a few misspellings. Hey, I found 331 absolutely horrible grammar errors - "remind's", so how bad can your large dataset be compared to that?

I also don't follow your other reasoning against the large dataset. Spellcheck is not meant to censor but to help *where possible*. If the user types "calender" instead of "calendar" and it doesn't get flagged, then that's totally OK. "Calender" is a valid word and a spellcheck is not a grammar or semantic check. It should also not take the position of a teacher that addresses an audience with an assumed low level of vocabulary. BTW, there are more horrible mistakes than the "calender"/"calendar" issue, namely "it's" and "its", "than" and "then" and "no", "now" and "not" - which I frequently get wrong distorting the meaning completely. And spellcheck doesn't help me there. Oops, I've just corrected "work" to "word".

About contributing to the small dictionary:

Doesn't your website go under the banner SCOWL (and Friends)? You're not a one man project, surely(?). Surely you have people to catch the "useful" words which got removed from the small set with the Revision 6 to 7 (2010) change. Surely you have other "customers" complaining that "residuary enforceability" is now no longer correct.

Sorry, I am German (and Australian), and I have already offended you with the "knows better" term. I really don't mean to offend but I hoped Mozilla could get a suitable ready-made dictionary from elsewhere to which just a few (mainly computer and internet related) terms needed to get added. (I don't want to mention the 6000 proper names Mozilla add which is another head-ache (6000 names plus the /M comes to 12000 words as I stated before).)

-- End of possibly offensive part.

I'm still thinking of how we can get to a deal here. I'd have to look at it. I'd have to see how hard it is to figure out what was removed in April/May 2015 if you don't have any other friend who can analyse the consequences of the Revision 6 to 7 (2010) change. BTW, where can I read about your "level system" 60, 70, 80? I know nothing about it.

Another remark: Like Ehsan, my mother tongue is not English (although I have dual citizenship), however, native speakers have praised my vocabulary. I go by the rule: If I know the word, it should be in the (Mozilla) dictionary. What I'm trying to say is that amongst your native speaker "friends" you should have more qualified people than me for the job of finding missing "useful" words.

Comment 18

2 years ago
(In reply to Jorg K (GMT+1) from comment #17)

> OK, well, these words would have three sources:

Okay.  Once you (or someone else as I understand if you don't want to do the work) figures out new words to be added I will be happy to review them.

> I've just remembered bug 1137544 comment #34 about "advisor". We have bug
> 1183512 to add it, but you already rejected this for the small size, since
> in there you want to have unique spellings. So even if I help you with the
> small size, I can't win. Same goes for "résumé" and "naïve".

As I tried to state before if Mozilla does not agree with this choice this can be fixed.  It is possible to generate a "normal" sized dictionary with common variants (advisor) and accented words (résumé).

> Warning: The following may offend, but is not intended to ;-)

It is not.  We have a difference in option.  I have given you my recommendation, but understand others may not agree, that is why I provide the larger size to begin with. :)

> I also don't follow your other reasoning against the large dataset.
> Spellcheck is not meant to censor but to help *where possible*. If the user
> types "calender" instead of "calendar" and it doesn't get flagged, then
> that's totally OK. "Calender" is a valid word and a spellcheck is not a
> grammar or semantic check. It should also not take the position of a teacher
> that addresses an audience with an assumed low level of vocabulary. BTW,
> there are more horrible mistakes than the "calender"/"calendar" issue,
> namely "it's" and "its", "than" and "then" and "no", "now" and "not" - which
> I frequently get wrong distorting the meaning completely. And spellcheck
> doesn't help me there. Oops, I've just corrected "work" to "word".

I do not agree with this option, but will not force by view on others. Really at this point it is up to someone like Ehsan to decide what to do.

The only point I would like to add that it is very easy to add missing words to your personal dictionary.  It is not so easy to remove words.

> Doesn't your website go under the banner SCOWL (and Friends)? You're not a
> one man project, surely(?). Surely you have people to catch the "useful"
> words which got removed from the small set with the Revision 6 to 7 (2010)
> change. Surely you have other "customers" complaining that "residuary
> enforceability" is now no longer correct.

I mostly combine word lists from others and do not have a lot of SCOWL specific help.  I get some help from Alan Beale in determining if new words suggested by others should be added.  The "and Friends" does not mean other people but rather the other related projects such as VarCon and 12 Dicts.

> I'm still thinking of how we can get to a deal here. I'd have to look at it.
> I'd have to see how hard it is to figure out what was removed in April/May
> 2015 if you don't have any other friend who can analyse the consequences of
> the Revision 6 to 7 (2010) change.

The easiest thing to do is expand and sort the old and new and then compare the two word lists.  I can do this fairly easily, but it will likely be sometime next week before I get to it.

> BTW, where can I read about your "level system" 60, 70, 80? I know
> nothing about it.

See the SCOWL README.  http://wordlist.aspell.net/scowl-readme/.  Note some parts of it (like the comment about "ort") are out of date and need to be fixed but it will give you a good idea.

At this point I will leave it up to Ehsan on what to do.  After all I maintain SCOWL and do not want to dictate what Mozilla does.
Flags: needinfo?(ehsan)
(Assignee)

Comment 19

2 years ago
OK, looks like we have a deal here ;-)

1) I'll help you add words to your small set. I'll see what got removed in April/May 2015.
2) I'll submit a cleaned up version of the Mozilla added words with recommendations as a Github issue in the next few days (see "acknowledgement").
(I'm happy to invest a few hours since that I already invested many hours on the discussion.)

Independently we/Ehsan decides what we want to use as "base" data: the small set (after improving it), the large set, or some custom set (which might complicate things), as you said:
  "normal" sized dictionary with common variants.

Broken record 1: I'd still go for the large set.
Broken record 2: My aim is to provide a rich dictionary and minimise people's complaints about missing words.

Also independently we clean up the horrible 331 "<verb>'s" errors. I wonder how they got added. Maybe the munch was faulty at some stage was munching
mind's
remind
reminds
to
mind/AMS ??
(In reply to Kevin Atkinson from comment #18)
> > I've just remembered bug 1137544 comment #34 about "advisor". We have bug
> > 1183512 to add it, but you already rejected this for the small size, since
> > in there you want to have unique spellings. So even if I help you with the
> > small size, I can't win. Same goes for "résumé" and "naïve".
> 
> As I tried to state before if Mozilla does not agree with this choice this
> can be fixed.  It is possible to generate a "normal" sized dictionary with
> common variants (advisor) and accented words (résumé).

Hmm, that's probably a good idea.  How should we do that?

> > I also don't follow your other reasoning against the large dataset.
> > Spellcheck is not meant to censor but to help *where possible*. If the user
> > types "calender" instead of "calendar" and it doesn't get flagged, then
> > that's totally OK. "Calender" is a valid word and a spellcheck is not a
> > grammar or semantic check. It should also not take the position of a teacher
> > that addresses an audience with an assumed low level of vocabulary. BTW,
> > there are more horrible mistakes than the "calender"/"calendar" issue,
> > namely "it's" and "its", "than" and "then" and "no", "now" and "not" - which
> > I frequently get wrong distorting the meaning completely. And spellcheck
> > doesn't help me there. Oops, I've just corrected "work" to "word".
> 
> I do not agree with this option, but will not force by view on others.
> Really at this point it is up to someone like Ehsan to decide what to do.
> 
> The only point I would like to add that it is very easy to add missing words
> to your personal dictionary.  It is not so easy to remove words.

I second Kevin's opinion here.  Jorg's goal seems to be to minimize bug reports about missing words, but those bug reports are pretty easy to deal with.  In my opinion, the goal must be to provide good spelling suggestions, so that if you typed something like "ort", you'd get a spelling error.  (That is a crucial difference between a spell checking dictionary and a normal dictionary.)

As such, switching to the large en-US word set is WONTFIX (Jorg, I know that makes you unhappy.  We have to agree to disagree.)

But I would still like to include common variants from SCOWL.

(In reply to Jorg K (GMT+1) from comment #19)
> 1) I'll help you add words to your small set. I'll see what got removed in
> April/May 2015.
> 2) I'll submit a cleaned up version of the Mozilla added words with
> recommendations as a Github issue in the next few days (see
> "acknowledgement").

Both of these are wonderful contributions!

> Also independently we clean up the horrible 331 "<verb>'s" errors. I wonder
> how they got added. Maybe the munch was faulty at some stage was munching
> mind's
> remind
> reminds
> to
> mind/AMS ??

I'm honestly more interested in removing them from our word list than to figure out what caused them to be added in the first place...
Flags: needinfo?(ehsan)
(Assignee)

Comment 21

2 years ago
(In reply to :Ehsan Akhgari from comment #20)
> Hmm, that's probably a good idea.  How should we do that?
Kevin would know, he suggested it.

> As such, switching to the large en-US word set is WONTFIX (Jorg, I know that
> makes you unhappy.  We have to agree to disagree.)
> But I would still like to include common variants from SCOWL.
I'm using the en-GB dictionary, so I don't care as long as we deliver something more useful than we currently have.

> (In reply to Jorg K (GMT+1) from comment #19)
> > 1) I'll help you add words to your small set. I'll see what got removed in
> > April/May 2015.
> > 2) I'll submit a cleaned up version of the Mozilla added words with
> > recommendations as a Github issue in the next few days (see
> > "acknowledgement").
> 
> Both of these are wonderful contributions!
Working on 2) right now. Sadly I don't have a working "comm" on Windows, but I'll cope.

> I'm honestly more interested in removing them from our word list than to
> figure out what caused them to be added in the first place...
They'll go, don't worry ;-)

You can start making up your mind about 6000 proper names in bug 301712 ;-)
(Assignee)

Comment 22

2 years ago
Kevin can you please give me a hand here.

I've taken the 1018 Mozilla-added words and subtracted the erroneous verb forms, leaving 687 words.
Before classifying them, I wanted to munch them, so I reduce three lines of "word", "word's" and "words" to "word/MS". I thought that would make my life easier and also help me spot errors I missed.

For example I deliberately maintained stuff like "retry's", "rework's", "unfaithful's", "unfriendly's", "unnatural's", "unworthy's", since it can be argued that these words can be used as nouns, as in:
There were two men, one friendly, one unfriendly, the unfriedly's name was John.
Most likely those words originate from the same munching error I mentioned at the bottom of comment #19 and should perhaps be removed. The en-GB I use while I type here marks all these as errors. I'll take your advice on this.

For munching, I downloaded and installed aspell from http://aspell.net/win32/ (Full installer). I now have binaries aspell.exe and also word-list-compress.exe, but don't know how to use them.
I read here: http://aspell.net/man-html/Working-With-Affix-Info-in-Word-Lists.html and tried
aspell.exe munch-list < mozilla-added-valid.txt but the program says: "Error: Unknown Action: munch-list".

Also, aspell.exe --help gives
Aspell 0.50.3 alpha.  Copyright 2000 by Kevin Atkinson.
The list of commands that follows doesn't contain "munch" or "munch-list".

Please help.
Flags: needinfo?(kevin.bugzilla)

Comment 23

2 years ago
The official Aspell binary for Windows is unfortunately very old because I have not been able to find someone to maintain it.  You need to use Aspell 0.60.  I would suggest you use Cygwin as it contains the latest version of Aspell, will give you comm and sort, and will also help avoid mixing Unix and Windows EOL.

I can also do some of the final steps for you if you let me know what needs to be done.  It may be next week before I get to it.
Flags: needinfo?(kevin.bugzilla)

Comment 24

2 years ago
(In reply to Jorg K (GMT+1) from comment #21)
> (In reply to :Ehsan Akhgari from comment #20)
> > Hmm, that's probably a good idea.  How should we do that?
> Kevin would know, he suggested it.

Ehsan, I can likely do this for you sometime next week.  I will add the common variants but not the accented words due to encoding problems.

If you want accented words than the best thing is to figure out why a UTF-8 dictionary causes problems with Mozilla (see bug 1162823).

Just let me know what you want.
Flags: needinfo?(ehsan)
(In reply to Kevin Atkinson from comment #24)
> (In reply to Jorg K (GMT+1) from comment #21)
> > (In reply to :Ehsan Akhgari from comment #20)
> > > Hmm, that's probably a good idea.  How should we do that?
> > Kevin would know, he suggested it.
> 
> Ehsan, I can likely do this for you sometime next week.  I will add the
> common variants but not the accented words due to encoding problems.

That sounds great, thanks!

> If you want accented words than the best thing is to figure out why a UTF-8
> dictionary causes problems with Mozilla (see bug 1162823).

Bug 1164263 is on file to figure out what causes that, but I haven't had the time to look at it (yet!).  The accented words can wait until we find a proper way to support them (or if they can be encoded in ISO-8859-1 in a way that doesn't cause similar problems.)
Flags: needinfo?(ehsan)
(Assignee)

Comment 26

2 years ago
(In reply to Kevin Atkinson from comment #24)
> Ehsan, I can likely do this for you sometime next week.  I will add the
> common variants but not the accented words due to encoding problems.
Wouldn't it make more sense to add back the "useful" words first which were removed in the Revision 6 to 7 (2010) transition and which subsequently got removed in bug 1137544, before adding the common variants?

I'm also still working on providing a list of Mozilla-added words for consideration, including common words as http://app.aspell.net/lookup?dict=en_US&words=acknowledgement
http://app.aspell.net/lookup?dict=en_US&words=flyer
http://app.aspell.net/lookup?dict=en_US&words=grey
http://app.aspell.net/lookup?dict=en_US&words=hijab
http://app.aspell.net/lookup?dict=en_US&words=jewellery
just to name a few.
This won't be complete before the 2nd of Jan. 2016.

Comment 27

2 years ago
(In reply to Jorg K (GMT+1) from comment #26)

> Wouldn't it make more sense to add back the "useful" words first which were
> removed in the Revision 6 to 7 (2010) transition and which subsequently got
> removed in bug 1137544, before adding the common variants?

Adding common variants is just a matter of generating a custom dictionary from SCOWL.  See http://app.aspell.net/create.
(Assignee)

Comment 28

2 years ago
Great, but we do this after fixing the normal/small size with the words I repeatedly mentioned, right?

Comment 29

2 years ago
The order doesn't matter that much, but I am in no rush. :)
Yeah, these two things can happen in either order.
(Assignee)

Comment 31

2 years ago
Created attachment 8703252 [details]
Words Mozilla adds to the SCOWL data (unreviewed)

(In reply to Kevin Atkinson from comment #23)
> I would suggest you use Cygwin as it contains the latest version of
> Aspell, will give you comm and sort, and will also help avoid mixing Unix
> and Windows EOL.
It's been a while since I used Cygwin, I find single executables more useful in a DOS environment. However, I did install Cygwin and got a decent aspell, comm and sort. Nice to have.

For aspell, this is what I've done:

Downloaded scowl-2015.08.24.zip from 
http://sourceforge.net/projects/wordlist/files/SCOWL/2015.08.24/

extracted en.dat, en_affix.dat and en_phonet.dat from speller/aspell and stored into
C:\cygwin64\lib\aspell-0.60
(since previously I got an error complaining about not finding en_US.dat). 

I typed
aspell munch-list < mozilla-added-valid.txt
I get this error:
.cset" can not be opened for reading./iso-8859-1
(I don't want to comment on the quality of the error message)

Fortunately I found a hint on Google that I had to convert the line endings in the aforementioned .dat files to Unix.

So now I have a list of 384 Mozilla added words which I will review and then submit to you for consideration, in fact, I'll attach them here in case you want to look.

There are some of the form "<noun>'s", like "children's" (already submitted as an issue) and "deconstructionist's". "deconstructionist" is in SCOWL, but you're missing the /M:
http://app.aspell.net/lookup?dict=en_US&words=deconstructionist, well for some reason it's only in the larger size (95). Why would you do this?

> I can also do some of the final steps for you if you let me know what needs
> to be done.  It may be next week before I get to it.
Very kind, but I need to set myself up correctly. For example I need to expand the Mozilla dictionary, remove the errors and munch again.
Assignee: nobody → mozilla
Status: NEW → ASSIGNED
(Assignee)

Comment 32

2 years ago
Created attachment 8703253 [details]
340 wrong dictionary entries of the form "<verb>'s" or "<adjective>'s"

Added some more to the previous list.
Attachment #8702985 - Attachment is obsolete: true
(Assignee)

Comment 33

2 years ago
OK, these are also errors to be removed:
readdress's
inactive's
deice's (as in, to de-ice)
proactive's
rejigger's
That brings the count to 345 errors to be removed.
(Assignee)

Comment 34

2 years ago
Created attachment 8703259 [details]
336 wrong dictionary entries of the form "<verb>'s" or "<adjective>'s"

Updating. Removed the 5 mentioned and re-added 9 "non-errors": 340+5-9=336.
Attachment #8703253 - Attachment is obsolete: true
(Assignee)

Comment 35

2 years ago
Created attachment 8703260 [details]
387 Mozilla-added words (unreviewed)

I updated the list of words Mozilla add to the SCOWL data, 387 in total.

Of these, there are 40 of the form "<noun>'s" where "<noun>" is already in the SCOWL dataset. In other words, these 40 are SCOWL errors since there should not be a noun at one level and the possessive form at a higher level. Sure, that can happen when basing inclusion on occurrence in other sources, since the possessive form is less used.
Attachment #8703252 - Attachment is obsolete: true
(Assignee)

Comment 36

2 years ago
I submitted this: https://github.com/kevina/wordlist/issues/136

I only reported 34 instead of 40 cases, since I decided that these six are illegal words which fall into the "<verb>'s" or "<adjective>'s" category:
concurrent's
decommission's
recommission's
recondition's
repartition's
reposition's

That brings the number of words to be removed from the Mozilla dictionary to 336+6 = 342.
(Assignee)

Comment 37

2 years ago
Created attachment 8703294 [details]
342 wrong dictionary entries of the form "<verb>'s" or "<adjective>'s"
Attachment #8703259 - Attachment is obsolete: true
(Assignee)

Comment 38

2 years ago
Created attachment 8703300 [details]
347 Mozilla-added words (reviewed), 10 to be removed.

Taking the 1018 lowercase Mozilla-added words and subtracting the 342 wrong ones, we get 676 words. Munching these, we're left with 381, subtracting the 34 possessive forms that should be handled as per the previous comment #36, I get to 347 extra words (attached).

Of these the following 10 seem to be wrong:

antiviruses (no such plural, should be antivirus programs)
carnitas (Mexican dish)
conmans (illegal plural of conman)
declaimable (no such word)
megadeathes (should be megadeaths)
mySimon (shopping website)
proactives  (illegal plural)
recurrents (no such noun, no such plural)
unclassifieds (illegal plural)
uncoloreds (illegal plural)

I have requested the remaining 337 words for promotion to level 60 or inclusion:
https://github.com/kevina/wordlist/issues/137
Attachment #8703260 - Attachment is obsolete: true
(Assignee)

Comment 39

2 years ago
Created attachment 8703323 [details]
List of 5670 words that got removed in bug 1137544

This is a list of all the 5670 words that were included in Mozilla 31 but are no longer included in Mozilla 45. The removal happened in bug 1137544.

Many of these are "useful" commonly used words. Sadly they will now be flagged as misspelled to Mozilla users.

I have requested an analysis and promotion of these words back to level 60 (default/"normal"), the SCOWL level which is used by Mozilla.

See https://github.com/kevina/wordlist/issues/138
(Assignee)

Comment 40

2 years ago
Created attachment 8703349 [details] [diff] [review]
354 dictionary corrections

This patch removes the 342 erroneous words listed in attachment 8703294 [details].
It also removes the erroneous 10 words from comment #38.
It also removes "indoor's" as suggested by Kevin in
https://github.com/kevina/wordlist/issues/136#issuecomment-168349946
and adds "megadeaths" (replacing erroneous "megadeathes, see comment #38) which was a specific erroneous Mozilla removal.

This concludes the removal of erroneous words.

The patch was created by
1) expanding the Mozilla dictionary like this:
===
#!/bin/sh
expand() {
  grep -v '^[0-9]\+$' | ./munch-list expand $1 | sort -u
}
expand en-US.aff < en-US.dic > mozilla.txt
===
2) removing the erroneous words with the "comm" command.
3) munching it back like this:
===
#!/bin/sh
expand() {
  grep -v '^[0-9]\+$' | ./munch-list expand $1 | sort -u
}
expand en.aff < en.dic.supp > 0-special
cat mozilla-corrected.txt | comm -23 - 0-special | ./make-hunspell-dict -one en_US-mozilla /dev/null
===
The steps are all copied from the existing script "make-new-dict".

Note to the reviewer:
To review this, I suggest to expand the existing dictionary, apply the patch locally and expand the new dictionary. Then compare the two word lists: There are 342+10+1 removals and 1 addition.

Further work:
=============

Questionable proper names are covered by bug 301712.

Recovering 5670 words lost in bug 1137544 (attachment 8703323 [details]) should be fixed by
https://github.com/kevina/wordlist/issues/138
They should come back to Mozilla when the SCOWL classification is done and the dictionary is refreshed from SCOWL.

Adding Mozilla-added words (possessive forms and 347-10=337 extra words, attachment 8703300 [details]) into SCOWL is covered by
https://github.com/kevina/wordlist/issues/136
https://github.com/kevina/wordlist/issues/137
They might be added to SCOWL and will then not appear as Mozilla-added words in "5-mozilla-added" when next refreshing from upstream.

We should refresh the dictionary from SCOWL as soon as a new version is available. This refresh should also include common variants as discussed in comment #18.
Attachment #8703349 - Flags: review?(ehsan)
(Assignee)

Comment 41

2 years ago
I've just compared my corrected version of the dictionary (with the 342+1) erroneous possessive forms removed with the AMO version of the dictionary from 2007 (before is was refreshed today).

Result: My corrected dictionary still contains invalid possessive forms like
above's
her's
him's
give's
get's
These come from SCOWL.

I've raised this ticked: https://github.com/kevina/wordlist/issues/141

This observation might explain the origin of the 342+1 wrong possessive forms removed in this bug. They were most likely introduced into the SCOWL data after 2007 and made their way into the Mozilla dictionary. They were then carried forward. The new process to update the Mozilla dictionary from the SCOWL data was only implemented in April 2015 in bug 1137544. I don't know how the refresh was done before and whether the previous refresh would have removed words that were removed upstream.

In any case, landing my patch will remove 342+1 invalid possessive forms from the Mozilla dictionary which don't appear in SCOWL. After SCOWL repair their data and we refresh, more invalid forms will automatically be removed from the Mozilla dictionary.
(Assignee)

Comment 42

2 years ago
I've just found out that the 6000 doubtful proper names reported in bug 301712 (most likely) come from SCOWL. They appear to be in the SCOWL data in 2007.

Let me repeat what I said in bug 301712:

Personally, I think we should make a fresh start here:
- Scrap the Mozilla dictionary
- Generate a new custom dictionary from SCOWL, "size 60"/"normal",
  add accented words, hacker terms and Roman numerals, see:
  http://app.aspell.net/create
- Add the 37 Mozilla terms and call it a day.

This would solve the following problems in one hit:
- It contains the 354 corrections proposed in my patch
  (which are errors in Mozilla not contained in SCOWL).
- It removes the 6000 doubtful names, fixing bug 301712.
- It gives us common variants, (hopefully) fixing bug 1183512 and bug 1198052.

The only downside:
We'd lose 337 extra words and possessive forms which we have, see attachment 8703300 [details], but which have been requested at SCOWL:
https://github.com/kevina/wordlist/issues/136
https://github.com/kevina/wordlist/issues/137

Or, to maintain the current status, we could add those words together with the 37 Mozilla terms, so they don't get lost.

Ehsan, how do you want to proceed?
Flags: needinfo?(ehsan)
(Assignee)

Comment 43

2 years ago
Correction: Kevin tells us in bug 301712 comment #20 that there is no SCOWL data from 2007. So wherever the 6000 doubtful names came from, it's time to get rid of them.

Those names are contained in the 2007-05-04 dictionary in the add-on that was published on AMO, just as they are in the Mozilla-maintained dictionary.
Clearing the needinfo since that discussion is happening elsewhere.
Flags: needinfo?(ehsan)
Comment on attachment 8703349 [details] [diff] [review]
354 dictionary corrections

I can't apply this patch cleanly on trunk...
Attachment #8703349 - Flags: review?(ehsan)
(Assignee)

Comment 46

2 years ago
(In reply to :Ehsan Akhgari from comment #44)
> Clearing the needinfo since that discussion is happening elsewhere.
Please read comment #42 carefully. The full discussion is happening here.

Do I really have to repeat it. My suggestion is:
- Scrap the Mozilla dictionary
- Generate a new custom dictionary from SCOWL, "size 60"/"normal",
  add accented words, hacker terms and Roman numerals, see:
  http://app.aspell.net/create
- Add the 37 Mozilla terms.
- Optionally, if you don't want to lose the words you added manually over the years, those 337,
  plus the possessive forms, they can also be added in.

From the other bug:

> What I care about is the words that we have added on top of SCOWL (which
> you can find out in the said history) to not get lost in this process.
They can be maintained, see "optionally" above.

> ship a dictionary with the SCOWL wordlist
Yes.
> + mozilla-specific.txt
Yes.
Plus the 337 + possessive forms not being in SCOWL.

> Let me restart what I think we should do here: We need to get rid of the
> incorrect words in our wordlist.
That's the patch.

But to get rid of the 6000 funny names, we need a different approach, the one stated above.

> Also, ideally, we should try to upstream the legitimate additions that we have
> made on top of the SCOWL wordlist for Kevin's consideration.
Done already, see comment #42.

> If such additions are approved for SCOWL, during the next merge from upstream,
> they will get removed from the list of Mozilla's additions.
Yes, that's why I've already suggested them, see comment #42.

As for not being able to apply the patch, I don't know what you're doing. Here is what I'm doing:

jorgk@SAPO ~/mozilla-central
$ hg pull -u
pulling from https://hg.mozilla.org/mozilla-central
searching for changes
no changes found

jorgk@SAPO ~/mozilla-central
$ hg qpush 1235506.patch
applying 1235506.patch
now at: 1235506.patch

Works.

So let's summarise here:

Two options, both maintain manual additions made:

1) Only apply the patch, that will remove the 354 errors.
2) Start from scratch, get SCOWL data with variants, add Mozilla terms, add Mozilla-added words.
   That cleans up the 6000 doubtful proper names and provides the variants, like "advisor".

Please decide.
Flags: needinfo?(ehsan)
(Assignee)

Comment 47

2 years ago
Comment on attachment 8703349 [details] [diff] [review]
354 dictionary corrections

Maybe you can work out why you can't apply the patch. This represents the minimal change that you're looking for, only removing obviously wrong words.
Attachment #8703349 - Flags: review?(ehsan)
(Assignee)

Comment 48

2 years ago
In case you really can't apply the patch, you can find a copy of the modified dictionary here:
http://www.jorgk.com/misc/en-US.zip
Comment on attachment 8703349 [details] [diff] [review]
354 dictionary corrections

Review of attachment 8703349 [details] [diff] [review]:
-----------------------------------------------------------------

Hmm, this time when I tried to apply the patch, it worked perfectly!  Not sure what went wrong the previous time.  Sorry about that.

This patch removes carnitas, added in bug 718253.  Please add it back.  Other than that, it looks good.  Thanks!

r=me with the above fixed.
Attachment #8703349 - Flags: review?(ehsan) → review+
(In reply to Jorg K (GMT+1) from comment #46)
> So let's summarise here:
> 
> Two options, both maintain manual additions made:
> 
> 1) Only apply the patch, that will remove the 354 errors.

Let's please land this patch here and move the rest to bug 301712.  This bug is already too confusing.

> 2) Start from scratch, get SCOWL data with variants, add Mozilla terms, add
> Mozilla-added words.
>    That cleans up the 6000 doubtful proper names and provides the variants,
> like "advisor".
> 
> Please decide.

I understand what you're saying, no need to repeat it over and over again!

As I said in bug 301712 comment 22, it's difficult to determine what the result of what you're proposing is going to be, since you're starting from a different baseline.  Right now, make-new-dict uses the en-US.dic file produced by running |./make-hunspell-dict -all|.  You're proposing to start from what http://app.aspell.net/create generates.  Now, I have *no idea* what is the difference between the two baseline dictionaries, so it is difficult for me to say yes or no to your specific proposal.  That being said, parts of your proposal are definitely good and wanted, such as removing those incorrect proper names, or providing the variants.

But there may be other less invasive solutions that are easier to evaluate.  For example, for providing variants such as "advisor", if there is a way to do something similar to make-hunspell-dict but to get to include those in the resulting en_US.dic, it may be much easier to modify our make-new-dict script accordingly and just reimport SCOWL.  (I'm not sure if that's something that can be done from the command line, but perhaps Kevin can help with that part.)

Also in order to remove the incorrect proper names, can you please explain how you separate them out from other wanted Mozilla specific additions (such as the 337 words that I think are ones we have manually added to our en-US dictionary)?  Is it possible to unmunch our dictionary, and remove just those incorrect proper names?

Basically, the farther away your proposal gets from how we currently generate the dictionary, the harder it gets to evaluate its impact.  I'd hate to say yes and have you spend time on doing what you're proposing only for the resulting patch to be rejected in the end, which is why I'm trying to understand all of the implications.  Hope this makes sense!
Flags: needinfo?(ehsan)
(Assignee)

Comment 51

2 years ago
Created attachment 8705363 [details] [diff] [review]
353 dictionary corrections

Carrying forward Ehsan's r+.

I added "carnitas" back in:
https://en.wikipedia.org/wiki/Carnitas
Attachment #8703349 - Attachment is obsolete: true
Attachment #8705363 - Flags: review+
(Assignee)

Comment 52

2 years ago
Dear Sheriff,

this patch affects the en-US spell check dictionary only. I don't expect this to affect anything, so please combine with suitable other patches when landing.

Thank you.
Keywords: checkin-needed

Comment 53

2 years ago
https://hg.mozilla.org/integration/fx-team/rev/45e3fc0e8816
Keywords: checkin-needed

Comment 54

2 years ago
(In reply to :Ehsan Akhgari from comment #50)

> As I said in bug 301712 comment 22, it's difficult to determine what the
> result of what you're proposing is going to be, since you're starting from a
> different baseline.  Right now, make-new-dict uses the en-US.dic file
> produced by running |./make-hunspell-dict -all|.  You're proposing to start
> from what http://app.aspell.net/create generates.  Now, I have *no idea*
> what is the difference between the two baseline dictionaries, so it is
> difficult for me to say yes or no to your specific proposal.

By design, http://app.aspell.net/create will produce the same dictionary that make-hunspell-dict does when you chose the same settings; for example http://app.aspell.net/create?defaults=en_US will create the same en_US dictionary that make-hunspell-dict will create.  The only difference is the README and the fact that it will use the latest version from git rather than a release.

> But there may be other less invasive solutions that are easier to evaluate. 
> For example, for providing variants such as "advisor", if there is a way to
> do something similar to make-hunspell-dict but to get to include those in
> the resulting en_US.dic, it may be much easier to modify our make-new-dict
> script accordingly and just reimport SCOWL.  (I'm not sure if that's
> something that can be done from the command line, but perhaps Kevin can help
> with that part.)

Yes is is something that can be done from the command line but it might mean having to patch make-hunspell-dict.  Any changes to make-hunspell-dict will end up in the next release, so I might end up first creating a new release and then submitting a patch to create a custom version.
Flags: needinfo?(ehsan)
(Assignee)

Updated

2 years ago
Summary: 1) SCOWL import removes useful words from Mozilla, 2) "normal" SCOWL data not rich enough, 3) additional Mozilla words need to be cleaned up. See comment #10 and comment #12. → en-US dictionary: Additional Mozilla words need to be cleaned up. Other issues discussed: See comment #10 and comment #12.

Comment 55

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/45e3fc0e8816
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
status-firefox46: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
(Assignee)

Comment 56

2 years ago
Comment on attachment 8705363 [details] [diff] [review]
353 dictionary corrections

Approval Request Comment
[Feature/regressing bug #]: No regression.
[User impact if declined]: Low, wrong words in dictionary.
[Describe test coverage new/current, TreeHerder]: N/A.
[Risks and why]: No risk, en-US dictionary change only.
[String/UUID change made/needed]: None.

It would be good to include the dictionary corrections in ESR 45 (also, dare I say it, for the benefit of Thunderbird users).

(Clearing NI for Ehsan since this bug is done.)
Flags: needinfo?(ehsan)
Attachment #8705363 - Flags: approval-mozilla-aurora?
(Assignee)

Updated

2 years ago
No longer blocks: 301712, 1144254, 1183512, 1198052

Comment 57

2 years ago
Created attachment 8706083 [details] [diff] [review]
add-variants.patch

Comment 58

2 years ago
Created attachment 8706084 [details] [diff] [review]
add-variants.patch
Attachment #8706083 - Attachment is obsolete: true

Comment 59

2 years ago
I am very confused as to what is going on where.  So I just decided to reply it what seams like most logical place.  Fell free to move this to a new issue if it will be more appropriate.

(In reply to Kevin Atkinson from comment #54)
> (In reply to :Ehsan Akhgari from comment #50)
>
> > But there may be other less invasive solutions that are easier to evaluate. 
> > For example, for providing variants such as "advisor", if there is a way to
> > do something similar to make-hunspell-dict but to get to include those in
> > the resulting en_US.dic, it may be much easier to modify our make-new-dict
> > script accordingly and just reimport SCOWL.  (I'm not sure if that's
> > something that can be done from the command line, but perhaps Kevin can help
> > with that part.)
> 
> Yes is is something that can be done from the command line but it might mean
> having to patch make-hunspell-dict.  Any changes to make-hunspell-dict will
> end up in the next release, so I might end up first creating a new release
> and then submitting a patch to create a custom version.

Attached is a patch (add-variants.patch) that does this.  I ended up not needing to modify make-hunspell-dict so it will work with the 2015.08.24 version of SCOWL.

It adds common variants and accented words.

Note that there is now a mixture of iso-8859-1 and utf-8 encoding.  I attempted to keep it straight but be on the lookout for encoding errors when you add non-ASCII words.

Also the SCOWL dictionary got renamed from en_US to en_US-custom.  The final Mozilla dictionary is still named en-US.
Flags: needinfo?(mozilla)
Flags: needinfo?(ehsan)

Comment 60

2 years ago
Created attachment 8706086 [details] [diff] [review]
add-variants.patch
Attachment #8706084 - Attachment is obsolete: true

Comment 61

2 years ago
Created attachment 8706090 [details] [diff] [review]
add-variants.patch
Attachment #8706086 - Attachment is obsolete: true
(Assignee)

Comment 62

2 years ago
Comment on attachment 8706090 [details] [diff] [review]
add-variants.patch

Thank you!
This bug is done.

Further works needs to move to bug 1238031.
Attachment #8706090 - Attachment is obsolete: true
Flags: needinfo?(mozilla)
Flags: needinfo?(ehsan)
status-firefox45: --- → affected
Comment on attachment 8705363 [details] [diff] [review]
353 dictionary corrections

Sure, should not have impact on the release, taking it.
Attachment #8705363 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
(Assignee)

Comment 64

2 years ago
Great, thanks. More of this is coming in bug 1238031 and bug 301712 ;-)

Comment 65

2 years ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/8464f03a7801
status-firefox45: affected → fixed
You need to log in before you can comment on or make changes to this bug.