Remove "Aurthur" from the en-US dictionary. In fact: review about 6000 proper names in the Mozilla dictionary which are not in SCOWL.

RESOLVED FIXED in Firefox 45

Status

()

Core
Spelling checker
RESOLVED FIXED
12 years ago
a year ago

People

(Reporter: abj6957, Assigned: Jorg K (GMT+2))

Tracking

unspecified
mozilla46
x86
Windows 2000
Points:
---

Firefox Tracking Flags

(firefox45 fixed, firefox46 fixed)

Details

Attachments

(4 attachments, 2 obsolete attachments)

(Reporter)

Description

12 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6

My name is Arthur, and I noticed that 'Aurthur' is in the dictionary. When
performing a spell check, 'Aurthur' was suggested as a replacement for the
misspelled word 'furthur' in my message. I checked my personal dictionary and it
wasn't in there.

Reproducible: Always

Steps to Reproduce:
1. Compose an email with 'furthur' in it.
2. Run the spell checker
3.

Actual Results:  
The Check Spelling dialog box contains 'Aurthur' and 'further' in the list of
suggested replacements.

Expected Results:  
'Aurthur' is an incorrect spelling of 'Arthur' and should not be in the dictionary.

Comment 1

12 years ago
"Aurthur", abieit a misspelling of your name, appears 33,800 times when I Google
it. There's plenty of people whose name is Aurthur -- probably too many for us
to remove it from the dictionary.
(Reporter)

Comment 2

12 years ago
I encourage you to do a little more research before assuming that plenty of
people spell their name 'Aurthur' just because it got 33,800 hits on google.
Google even asks, "did you mean 'arthur'?" when you search for 'aurthur'. You
can get all kinds of hits for misspelled words on google. And, the proper
spelling of 'Arthur' gets 4.3 million hits on google.
QA Contact: general
still bad
Status: UNCONFIRMED → NEW
Ever confirmed: true
Component: General → Spelling checker
Product: Thunderbird → Core
QA Contact: general → spelling-checker

Updated

9 years ago
Assignee: mscott → nobody

Comment 4

8 years ago
Aurthur is a surprisingly common first and last name. 

http://www.google.com/webhp?hl=en#hl=en&q=Aurthur+-Arthur

97,200 hits. Majority of these are not misspellings. 

--> wontfix.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → WONTFIX
(Reporter)

Comment 5

8 years ago
Here are some more interesting Google statistics:

"Arthur" gets 158,000,000 hits.
"Aurthur" at 97,200 hits is just .06% of "Arthur"'s hits.
"Caywood" gets 263,000 hits (2.7 times more than "Aurthur") and isn't in the dictionary.

Comment 6

8 years ago
OK...that's a reasonable point, and I should not have resolved this WONTFIX. Reopening.
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Comment hidden (obsolete)
(Comment 7 is incorrect, as discussed in bug 1183512 comment 2. Reopening & tagging that comment as obsolete.)
Status: RESOLVED → REOPENED
Resolution: INVALID → ---

Comment 9

a year ago
Created attachment 8702198 [details] [diff] [review]
Remove Aurthur from the en-US dictionary
Assignee: nobody → ananuti
Attachment #8702198 - Flags: review?(ehsan)
(Assignee)

Comment 10

a year ago
Let's fix all the issues in bug 1235506.
Depends on: 1235506

Updated

a year ago
Attachment #8702198 - Flags: review?(ehsan)
(Assignee)

Comment 11

a year ago
I've been looking at the dictionary content in the last few days, mainly in the context of bug 1235506 (in which I identified at least 331 really wrong entries in the dictionary).

The fact of the matter is that Mozilla has about 6000 proper names in the Mozilla dictionary which aren't in the SCOWL data and which can be seen here:
https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/5-mozilla-added
About 12000 words, half of them possessive forms.

"Aurthur" is one of them. "Aurthur" is a, perhaps uncommon, proper name with about 345,000 hits on Google:
https://www.google.com.au/?gws_rd=ssl#q=Aurthur&nfpr=1

In principle one should ask why Mozilla has 6000 proper names in the dictionary, amongst them "Brittaney" (453,000), "Brittani" (887,000), "Britteny" (393,000) and "Brittne" (159,000) - Google hits in parenthesis.
They are all possible misspellings of "Britney" (86,400,000).

So if we removed "Aurthur" from the dictionary, we should take a hard look at the remaining 6000 proper names and treat them all equally. They all go, they all stay or they all get some processing, for example, let them stay if more then one million Google hits, or something. I repeat: They are all *not* in the SCOWL data.
Summary: There is an invalid entry in the English/United States dictionary of 'Aurthur'. → Remove "Aurthur" from the en-US dictionary. In fact: review about 6000 proper names in the Mozilla dictionary which are not in SCOWL.
(Assignee)

Comment 12

a year ago
Comment on attachment 8702198 [details] [diff] [review]
Remove Aurthur from the en-US dictionary

Removing one uncommon proper name does not solve the problem of the other 6000 proper names that Mozilla adds to the SCOWL data. Many of them are also uncommon. We need a unified approach to deal with this.
Attachment #8702198 - Flags: feedback-

Updated

a year ago
Assignee: ananuti → nobody

Updated

a year ago
Attachment #8702198 - Attachment is obsolete: true
Comment on attachment 8702198 [details] [diff] [review]
Remove Aurthur from the en-US dictionary

r=me on landing this.

While I agree that we should review other proper names in the dictionary, the perfect should not be the enemy of the good.

Jorg, I'd appreciate if you have the time to do such a review.  If not, that's OK too.  Either way, that work can happen in a follow-up.
Attachment #8702198 - Attachment is obsolete: false
Attachment #8702198 - Flags: review+
Keywords: checkin-needed
(Assignee)

Comment 14

a year ago
(In reply to :Ehsan Akhgari from comment #13)
> Jorg, I'd appreciate if you have the time to do such a review.  If not,
> that's OK too.  Either way, that work can happen in a follow-up.
It's not a matter of "review". No on can review 6000 (silly) names.
It's a matter of taking a decision: Keep *all*, remove *all* or process *all* somehow.
What you call "good" is a 1/6000 = 0.01667% improvement, which IMHO is not worth having, since you sacrifice consistency for questionable negligible improvement.
(I am really at a loss to understand why you're doing these individual changes when a consistent approach is called for).

Kevin, do you have an idea how to handle this? You have "Britney" in the SCOWL data, so you must have a method to decide which proper names to include and which not to include.
Flags: needinfo?(kevin.bugzilla)

Comment 15

a year ago
I don't tend to review proper names.  I get a list of proper names from Alan.  I might add or remove a few that come to my attention.

My vote would be to just remove them.  If you attach the list I can do a very quick review based on frequency data from Google Books.
(Assignee)

Comment 16

a year ago
Well, as mentioned previously, the Mozilla-added words come out of the process:
https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/5-mozilla-added

The uppercase words upto line 12050 are a mix of
- some abbreviations ("API", "NGO", "CEO", DVDs"?, etc.),
- 37 Mozilla-terms which are also stored here
https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/mozilla-specific.txt
- and mostly proper English names of doubtful origin and quality.

Since they are all Mozilla-added, these words were not in the SCOWL dataset used at the time.

No reason to rush this, the branch date for Firefox 47 is on the 25th of January.
(Assignee)

Updated

a year ago
Keywords: leave-open

Comment 17

a year ago
Created attachment 8703241 [details] [diff] [review]
[updated patch] Remove Aurthur from the en-US dictionary
Attachment #8702198 - Attachment is obsolete: true
Attachment #8703241 - Flags: review+
(Assignee)

Comment 18

a year ago
I'm requesting to land this after bug 1235506. I'm preparing a patch with 354 changes over there and I'd prefer this one line change here not to rot it. I can refresh the patch here later. Besides, it's waited for ten years, so one week more or less doesn't make a difference ;-)

Also, I'm still hoping that we can do a better clean-up here with Kevin's help.
Keywords: checkin-needed, leave-open
(Assignee)

Comment 19

a year ago
More insight into this problem. I've just analysed the dictionary from 2007 which was provided as add-on here:
https://addons.mozilla.org/en-US/firefox/addon/united-states-english-spellche/
before the add-on owner refreshed it today with the original SCOWL dictionary of 2015.08.24.

The 2007 version also appears to be a "pure" SCOWL dictionary, for example it does not contain "SpiderMonkey" or "Fennec", but it has all these:
Line 7217: Britta/M
Line 7218: Brittaney/M
Line 7219: Brittani/M
Line 7220: Brittan/M
Line 7221: Brittany/MS
Line 7222: Britte/M
Line 7223: Britten/M
Line 7224: Britteny/M

It also has heaps of accented words like
Line 58: abbé/S
Line 2786: appliquéd
Line 2787: appliqué/MSG
Line 3695: attaché/S
Line 6060: blasé
Line 7957: café/MS
Line 8218: canapé/S
Line 10506: clichéd
Line 10507: cliché/SM
while still being ISO 8859-1 encoded.

So those 6000 doubtful proper names seem to have come from SCOWL.

Kevin, can you tell us whether those words were in fact included in the 2007 SCOWL data?

Personally, I think we should make a fresh start here:
- Scrap the Mozilla dictionary
- Generate a new custom dictionary from SCOWL, "size 60"/"normal",
  add accented words, hacker terms and Roman numerals, see:
  http://app.aspell.net/create
- Add the 37 Mozilla terms and call it a day.
Flags: needinfo?(ehsan)

Comment 20

a year ago
There is no 2007 version.  For all the SCOWL releases see http://sourceforge.net/projects/wordlist/files/SCOWL/ and for the official Hunspell releases see http://sourceforge.net/projects/wordlist/files/speller/.

Anything else was unlikely provided by me.
Flags: needinfo?(kevin.bugzilla)
(Assignee)

Comment 21

a year ago
Thanks for the quick reply. I didn't see older (2007) files at SourceForge.

So if Ehsan agrees, I'd like to start afresh using SCOWL data only and adding only certain things to it, see bug 1235506 comment #42. Basically we don't know where the legacy data came from and it's time to get rid of it.
(In reply to Jorg K (GMT+1) from comment #19)
> Kevin, can you tell us whether those words were in fact included in the 2007
> SCOWL data?

I'm not sure how we constructed the original dictionary, it definitely predates my involvement.  If you're curious, please see bug 319778.

> Personally, I think we should make a fresh start here:
> - Scrap the Mozilla dictionary
> - Generate a new custom dictionary from SCOWL, "size 60"/"normal",
>   add accented words, hacker terms and Roman numerals, see:
>   http://app.aspell.net/create
> - Add the 37 Mozilla terms and call it a day.

Please look at the history of en-US.dic.  What I care about is the words that we have added on top of SCOWL (which you can find out in the said history) to not get lost in this process.  It's really difficult to determine what the exact result of the change that you're proposing is going to be (but as far as I understand you are proposing to throw of that work, and ship a dictionary with the SCOWL wordlist + https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/mozilla-specific.txt).  That is not OK, as I have stated several times.

Let me restart what I think we should do here: We need to get rid of the incorrect words in our wordlist.  Also, ideally, we should try to upstream the legitimate additions that we have made on top of the SCOWL wordlist for Kevin's consideration.  If such additions are approved for SCOWL, during the next merge from upstream, they will get removed from the list of Mozilla's additions.
Flags: needinfo?(ehsan)
(Assignee)

Comment 23

a year ago
OK, let's move the discussion back here:

(In reply to :Ehsan Akhgari from comment bug 1235506 comment #50)
> Also in order to remove the incorrect proper names, can you please explain
> how you separate them out from other wanted Mozilla specific additions (such
> as the 337 words that I think are ones we have manually added to our en-US
> dictionary)?  Is it possible to unmunch our dictionary, and remove just
> those incorrect proper names?

Yes, we can unmunch the dictionary. I stated that in bug 1235506 comment #40. Did you read that ;-)
How do you think I removed "remind's" form the compressed mind/AMDRSZG ? Surely not manually for 342 wrong compressions of /M with other affixes.

SCOWL:
mind's
mind/ADRSZG

Previous Mozilla:
mind/AMDRSZG

I expanded the dictionary, subtracted the errors and munched it again. I'm fully set up with aspell, etc.
(And I document every step I take. Yes, it gets lengthy, but it gives a complete record.)

Coming back to the proper names:
They are not "incorrect", they are uncommon. "Aurthur" is an uncommon surname, and for example Britteny might mask bad spelling for Britney (Spears, haha).

Kevin voted for the removal of *all* common names which aren't in SCOWL, see comment #15. I haven't received the frequency data he promised.

Now to remove them without doing damage is a piece of cake: They are all uppercase and they are all in
https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/5-mozilla-added

It's easy to remove all uppercase words from the dictionary as long as they are not Mozilla-terms (37 of them) and are not abbreviations, like "API" or "ATPase" (not [A-Z][a-z]*).

The 337 other useful added words are all lowercase, so they remain untouched.

If you want, I can remove them. Please let me know.

===

Change of subject:
One day SCOWL will issue a new dictionary, I gave them a few issues to work on:
https://github.com/kevina/wordlist/issues
In one of them they mention a 2016-01 release (see: https://github.com/kevina/wordlist/issues/135).

At that stage we should not use the "normal"/"size 60" pre-canned dictionary but instead create a custom one here including common variants and accented words which *can* be encoded in ISO 8859-1 encoded.
http://app.aspell.net/create
That will create a "normal" .dic/.aff dictionary or a wordlist.

> For example, for providing variants such as "advisor", if there is a way to
> do something similar to make-hunspell-dict but to get to include those in
> the resulting en_US.dic, it may be much easier to modify our make-new-dict
> script accordingly and just reimport SCOWL.

The aim would be to modify make-new-dict, so instead of building all the dictionaries from wordlists here:
https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/make-new-dict#33
it would start with the downloaded wordlist as 3-upstream.txt here:
https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/make-new-dict#42
I'd have to check whether the word list created at SCOWL's "create" page already tags the offensive and profane words. I have the impression that yes, since the plain pre-canned SCOWL dictionaries already have "!". I'd have to check this with Kevin.

We should move all this into a different bug.

The question is whether we want to do the processing now based on SCOWL 2015.08.24 data or wait for their next release. 

Sorry, NI again ;-) Two issues: Remove proper names or not. Reimport SCOWL now or later.
Flags: needinfo?(ehsan)
(Assignee)

Comment 24

a year ago
Kevin, can you please comment. Sorry, you've already commented in bug 1235506 comment #54 where you said that you'd give us a special version of make-hunspell-dict (if I understood you correctly). I guess that would be one that also creates the input dictionary we need, ie. with variants and accented characters.
Flags: needinfo?(kevin.bugzilla)

Comment 25

a year ago
(In reply to Jorg K (GMT+1) from comment #24)

> I guess that would be
> one that also creates the input dictionary we need, ie. with variants and
> accented characters.

Yes that is correct.  It will be ready sometime within the next two weeks.
Flags: needinfo?(kevin.bugzilla)
(Assignee)

Comment 26

a year ago
Thanks, Kevin!

(In reply to Jorg K (GMT+1) from comment #23)
> Two issues: Remove proper names or not. Reimport SCOWL now or later.

So we can remove the doubtful proper names now and in another bug reimport SCOWL data when Kevin provides the modified script.

Sounds good?
(Assignee)

Updated

a year ago
No longer depends on: 1235506
(Assignee)

Comment 27

a year ago
I've raised bug 1238031 for the refresh with common variants and accented words.
(In reply to Jorg K (GMT+1) from comment #23)
> OK, let's move the discussion back here:
> 
> (In reply to :Ehsan Akhgari from comment bug 1235506 comment #50)
> > Also in order to remove the incorrect proper names, can you please explain
> > how you separate them out from other wanted Mozilla specific additions (such
> > as the 337 words that I think are ones we have manually added to our en-US
> > dictionary)?  Is it possible to unmunch our dictionary, and remove just
> > those incorrect proper names?
> 
> Yes, we can unmunch the dictionary. I stated that in bug 1235506 comment
> #40. Did you read that ;-)

I did, but it doesn't contain the answer to my question.  :-)  The below text does though!

> Coming back to the proper names:
> They are not "incorrect", they are uncommon. "Aurthur" is an uncommon
> surname, and for example Britteny might mask bad spelling for Britney
> (Spears, haha).
> 
> Kevin voted for the removal of *all* common names which aren't in SCOWL, see
> comment #15. I haven't received the frequency data he promised.
> 
> Now to remove them without doing damage is a piece of cake: They are all
> uppercase and they are all in
> https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/
> en-US/hunspell/dictionary-sources/5-mozilla-added
> 
> It's easy to remove all uppercase words from the dictionary as long as they
> are not Mozilla-terms (37 of them) and are not abbreviations, like "API" or
> "ATPase" (not [A-Z][a-z]*).
> 
> The 337 other useful added words are all lowercase, so they remain untouched.
> 
> If you want, I can remove them. Please let me know.

So here is the problem.  There are names in that list that we should definitely not remove.  For example, with the example of "Britteny", the existence of that name in the word list is bad as it may cause Hunspell to suggest it which is probably not what the user wants.  But there are other names in that list such as "Geoff" and "Geoffrey" are actual proper names which don't have this problem, so we should keep them.

To be honest, I won't even be able to review such a change that leaves names such as "Geoff" but removes ones such as "Britteny" since I'm not a native speaker.  But if you're suggesting that we should remove *all* of these names wholesale, I'm not sure if I agree that's a net benefit.

> Change of subject:
> One day SCOWL will issue a new dictionary, I gave them a few issues to work
> on:
> https://github.com/kevina/wordlist/issues
> In one of them they mention a 2016-01 release (see:
> https://github.com/kevina/wordlist/issues/135).
> 
> At that stage we should not use the "normal"/"size 60" pre-canned dictionary
> but instead create a custom one here including common variants and accented
> words which *can* be encoded in ISO 8859-1 encoded.
> http://app.aspell.net/create
> That will create a "normal" .dic/.aff dictionary or a wordlist.

With the new make-new-dict script that Kevin offered help with, that should be no problem!

> > For example, for providing variants such as "advisor", if there is a way to
> > do something similar to make-hunspell-dict but to get to include those in
> > the resulting en_US.dic, it may be much easier to modify our make-new-dict
> > script accordingly and just reimport SCOWL.
> 
> The aim would be to modify make-new-dict, so instead of building all the
> dictionaries from wordlists here:
> https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/
> en-US/hunspell/dictionary-sources/make-new-dict#33
> it would start with the downloaded wordlist as 3-upstream.txt here:
> https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/
> en-US/hunspell/dictionary-sources/make-new-dict#42
> I'd have to check whether the word list created at SCOWL's "create" page
> already tags the offensive and profane words. I have the impression that
> yes, since the plain pre-canned SCOWL dictionaries already have "!". I'd
> have to check this with Kevin.
> 
> We should move all this into a different bug.
> 
> The question is whether we want to do the processing now based on SCOWL
> 2015.08.24 data or wait for their next release. 

I think we can use the 2015-08-24 dictionary as a baseline for now and switch to the new SCOWL when it comes out.  No need to block on that.

> Sorry, NI again ;-) Two issues: Remove proper names or not. Reimport SCOWL
> now or later.

No problem, thanks for your perseverance so far.  :-)
Flags: needinfo?(ehsan)
(Assignee)

Comment 29

a year ago
(In reply to :Ehsan Akhgari from comment #28)
> So here is the problem.  There are names in that list that we should
> definitely not remove.  For example, with the example of "Britteny", the
> existence of that name in the word list is bad as it may cause Hunspell to
> suggest it which is probably not what the user wants.  But there are other
> names in that list such as "Geoff" and "Geoffrey" are actual proper names
> which don't have this problem, so we should keep them.
> 
> To be honest, I won't even be able to review such a change that leaves names
> such as "Geoff" but removes ones such as "Britteny" since I'm not a native
> speaker.  But if you're suggesting that we should remove *all* of these
> names wholesale, I'm not sure if I agree that's a net benefit.
OK, so in summary: You'd like to keep more "useful" names and remove less "useful" ones that could lead to undesired suggestions. That's a fair enough opinion and I can't even contradict ;-)

The problem is that I can't review 6000 names.

My investigations have shown that those names were included in the 2006/2007 dictionary (previously published on AMO) which pre-dates the Hunspell transition in bug 319778 in Aug. 2007. Mind you, this bug here is from 2005, so "Authur" and many of his equally unfortunate friends where in the dictionary in 2005.

Without Kevin's frequency data I can't make a choice. So I will attach a list of the names and ask Kevin for help again.

===

The issue of including common variants:

> > The question is whether we want to do the processing now based on SCOWL
> > 2015.08.24 data or wait for their next release. 
> I think we can use the 2015-08-24 dictionary as a baseline for now and
> switch to the new SCOWL when it comes out.  No need to block on that.
This issue has moved to bug 1238031. Kevin promised us a modified "make-hunspell-dict" script in his release, so we'll wait for that.
(Assignee)

Comment 30

a year ago
Created attachment 8705809 [details]
6000+ proper names from the dictionary

I took all the uppercase Mozilla-added words, removed Mozilla-specific terms and abbreviations and other names like "Wikipedia". I'm left with about 6000 proper names for which we're considering removal.

Kevin, as suggested in comment #15, can you please give us some frequency data.

Thank you in advance.
Flags: needinfo?(kevin.bugzilla)

Comment 31

a year ago
Jork K: Sorry for not getting back to you.

For frequency data just copy and paste the list into the app at http://app.aspell.net/lookup-freq.  Don't worry it can handle very large lists although it might take a little while.  You can speed things up by selecting the "Original Words" or "Neither" option under "Also Report".

You will notice that words such as "the" will be at the top of the list, this is not a bug as the word "The" is in the list.  This is not a bug.  The lookup is case intensive and the most likely capitalization of a word is displayed.  The 6000+ proper names list contains a large number of words in which the lowercase version is already in the dictionary so I am not sure how helpful my tool will be until these entries are cleaned up.
Flags: needinfo?(kevin.bugzilla)
(Assignee)

Comment 32

a year ago
Created attachment 8705844 [details]
Frequency report

OK, I did as suggested:
"Neither" and "Normal Report". Here is the result.

Some words are spat out in lowercase since they are already included in SCOWL in lowercase. This shows the terrible quality of the Mozilla data, for example Mozilla add:
"The" "Don't" "Any" "Star", etc.

So I propose this:
Step 1: Remove everything which has two stars or less, like "Aurthur".
        All the silly "Britt*" name have one star.
        There are some terrible names down there with one star:
        Barbabra, Auroora, Justinn
        That will kill about 3000 names.
        Oops, I spotted Mozillian(s) with one star, so we keep those.
Step 2: Remove all the uppercase versions of the lowercase words
        where the lowercase word is already in SCOWL,
        so "The" "Had" "My" "See" "Any" "Even".
        That's another 330 words.

Sounds like a net gain? ;-)

I think so. Should I go ahead and prepare a patch?
Flags: needinfo?(ehsan)
(Assignee)

Comment 33

a year ago
BTW, "Geoff" has four stars and would be maintained.
Thanks for your help with this so far!

(In reply to Jorg K (GMT+1) from comment #32)
> So I propose this:
> Step 1: Remove everything which has two stars or less, like "Aurthur".
>         All the silly "Britt*" name have one star.
>         There are some terrible names down there with one star:
>         Barbabra, Auroora, Justinn
>         That will kill about 3000 names.
>         Oops, I spotted Mozillian(s) with one star, so we keep those.

I'm sure this was mentioned in one of the bugs before, but I think I've missed it, sorry about that:

Before I can give a useful answer, can you please refresh my memory on what these stars mean?  i.e., what words will have only 2 stars and what words will have 4?

(I noticed that Geoff has 4 stars as you mentioned, but Geoffry has 2 stars.  I won't be surprised if after this change we'd get some bug reports and add a few proper names that didn't make the cut-off back, such as Geoffry, but that seems better than curating this big list by hand!)

> Step 2: Remove all the uppercase versions of the lowercase words
>         where the lowercase word is already in SCOWL,
>         so "The" "Had" "My" "See" "Any" "Even".
>         That's another 330 words.

Yeah, these definitely need to go!
Flags: needinfo?(ehsan)
(Assignee)

Comment 35

a year ago
(In reply to :Ehsan Akhgari from comment #34)
> Thanks for your help with this so far!

Well, we're doing well. 353 bad entries removed and variants and accented words added in two other bugs. So this is the last step to getting a clean slate.

> I'm sure this was mentioned in one of the bugs before, but I think I've
> missed it, sorry about that:
> 
> Before I can give a useful answer, can you please refresh my memory on what
> these stars mean?  i.e., what words will have only 2 stars and what words
> will have 4?

No, the star business wasn't mentioned before.

> (I noticed that Geoff has 4 stars as you mentioned, but Geoffry has 2 stars.
> I won't be surprised if after this change we'd get some bug reports and add
> a few proper names that didn't make the cut-off back, such as Geoffry, but
> that seems better than curating this big list by hand!)

I don't know what they mean exactly, I pasted the output from http://app.aspell.net/lookup-freq straight into the attachment, including the column headers. It's just some sort of ranking system. The page says:

===
A word that is already included is labled as "incl.". A word with 5 stars should most likely be included unless there is a good reason not to. A word with 3 stars (***) is still worth considering and a word with 1 star (*) should most likely not be considered. 
===

To me it's just a way to draw the line somewhere.

As for Geoffry: I'd kick him out for sure since the real Geoffry is Geoffrey!

See this detailed report:

Word                 |  Adj. Freq   Newness Rank | Normal dict | Large dict
  similar words      |  (per million)            | should incl | should incl
---------------------|---------------------------|-------------|-------------
Geoffry              |       0.0472  1.1  186232 |    **       |    ***  
  Geoffrey           |    146.9x     1.0   10902 |    incl.    |    incl.
  Geoffroy           |      5.9x     1.0   70118 |    ***      |    ***  
  Geffroy            |      1.4x     0.5  153618 |    **       |    ***  
  Geffrey            |      1.0x     1.3  184147 |    **       |    ***  
  Jeoffry            |      0.7x     2.1  223858 |    **       |    **   
  Geoffroi           |      0.7x     1.0  230550 |    **       |    **   
  geoffroyi          |      0.6x     1.1  249143 |    **       |    **   
  Geoffery           |      0.5x     1.1  268804 |    **       |    **   

Shall I proceed?
OK, sure, let's use a minimum of three stars as the cut-off point!
(Assignee)

Comment 37

a year ago
Will do. I'll wait for bug 1238031 to land first.
(Assignee)

Comment 38

a year ago
Comment on attachment 8703241 [details] [diff] [review]
[updated patch] Remove Aurthur from the en-US dictionary

We really don't need this patch any more since we're doing a more comprehensive clean-up.
Attachment #8703241 - Attachment is obsolete: true
(Assignee)

Comment 39

a year ago
Created attachment 8708730 [details] [diff] [review]
Patch to remove 564 +  6301 words.

Notes for review:

Concert the existing en-US.dic to UTF-8:
iconv -f ISO-8859-1 -t UTF-8 en-US.dic > en-US-utf8.dic

Expand the en-US.dict using this script expand-dict:
./expand-dict en-US-utf8.dic > wordlist.txt
Script:
===
#!/bin/sh
expand() {
  grep -v '^[0-9]\+$' | ./munch-list expand $1 | sort -u
}
cat $1 | expand en-US.aff
===
(munch-list is in the "speller" directory).
Note: I had no luck expanding the ISO-8859-1 encoded dictionary, so I converted it to UTF-8. It's not important, as long as it can be expanded.

This patch removes all the 564 words in to-remove-upper.txt (enclosed): Example:
The Has My Honey Don't.
The equivalent lowercase words are in to-remove-lower.txt (enclosed) for checking purposes.

It also removes 6301 words in two-star-remove-upper.txt (enclosed). These are proper names with less then three stars rating.

Do some sanity checks for the 564 words:
sort 5-mozilla-added > 5-mozilla-added-sorted.txt
comm -12 5-mozilla-added-sorted.txt to-remove-lower.txt
Result: None, that means, none of the 564 lowercase words are Mozilla-added.

comm -13 wordlist.txt to-remove-lower.txt
Result: None, that means all the 564 lowercase words are in the Mozilla dictionary,
since they aren't Mozilla added, they must come from SCOWL.
Therefore the uppercase equivalent 564 can be removed from the Mozilla dictionary.

comm -12 5-mozilla-added-sorted.txt to-remove-upper.txt | wc -l
Result: 564, that means that all 564 uppercase words are added by Mozilla,
despite the fact that they are already in lowercase in SCOWL.

Check the two-star names:
comm -12 5-mozilla-added-sorted.txt two-star-remove-upper.txt | wc -l
Result: 6301, that means that all 6301 uppercase words are added by Mozilla.

wc -l wordlist.txt
Result: 136205 word in expanded dictionary.

Now apply the patch, convert and expand the dictionary again:
iconv -f ISO-8859-1 -t UTF-8 en-US.dic > en-US-utf8.dic
./expand-dict en-US-utf8.dic > wordlist-new.txt

wc -l wordlist-new.txt: Result: 129340 = 136205 - 564 - 6301.

===

Looking at the patch you will see 138 lines with are added and ask yourself, how is that possible if the patch only removes words.

Each of these additions has a reason. Take
+Dorris.
This goes together with
-Dorri/SM.
So Dorri and Dorri's get removed since Dorri has one star, but Dorris remains, since she has three stars.

Take:
+Witty's
This goes with
-Witty/M
So Witty and and Witty's got removed, Witty's got added. Why, well, witty is already in SCOWL, so there is no need for Witty.

Take:
+Amber/M
This goes with
-Amber/MY
Amberly('s) got removed, leaving Amber('s).

Last one:
-Any/M
+Any's
Any got removed, since "any" is already in the SCOWL dictionary, leaving Any's.
Attachment #8708730 - Flags: review?(ehsan)
(Assignee)

Comment 40

a year ago
Created attachment 8708732 [details]
Wordlists that can be useful to review the patch.

Here are the word lists mentioned in the previous comment.
Assignee: nobody → mozilla
Status: REOPENED → ASSIGNED
(Assignee)

Comment 41

a year ago
(In reply to Jorg K (GMT+1) from comment #39)
> Note: I had no luck expanding the ISO-8859-1 encoded dictionary, so I
> converted it to UTF-8. It's not important, as long as it can be expanded.
I've been doing the processing using Cygwin on Windows. Kevin told me in I private message that I should use "export LC_ALL=C". With this set, my "expand-dict" script works on the ISO-8859-1 encoded Mozilla dictionary, so need for the "iconv".

(In reply to Jorg K (GMT+1) from comment #40)
> Here are the word lists mentioned in the previous comment.
Due to the fact that I did all the intermediate processing in UTF-8, these lists are encoded in UTF-8 and also not sorted in LC_ALL=C order. I won't resubmit the lists, you can "iconv" and sort them.

So I suggest the following review process:
Create expand-dict as follows in the speller directory:
===
#!/bin/sh
expand() {
  grep -v '^[0-9]\+$' | ./munch-list expand $1 | sort -u
}
cat $1 | expand en-US.aff
===

Use it to create wordlist.txt before and wordlist-new.txt after applying the patch.
Check the line count: Before: 136205, after: 129340, difference: 6865 = 564 + 6301.
You can then check that the removals are what I provided in the ZIP file (after sorting and converting, sorry).

You can also do the "sanity checks", but please keep in mind that you'd need to convert and sort the files from the ZIP file.

Sorry about all the conversion issues. Please let me know if anything is unclear.

===

I'm just looking through the patch and found another weird one:
-Anglicize/GDS
+Anglicize

Looking at 5-mozilla-added, I see:
Anglicized (D suffix)
Anglicizes (S suffix)
Anglicizing (G suffix, eating the "e")
Since these words have lower case equivalents in SCOWL they were removed.

"Anglicize" is in SCOWL and was consequently not removed. Looks like a SCOWL error to carry the lower case and upper case word at the same time: https://github.com/kevina/wordlist/issues/143
(Assignee)

Updated

a year ago
Severity: trivial → normal
(Sorry for the lag, I was out sick yesterday.  Will try my best to review this tomorrow!)
(Assignee)

Comment 43

a year ago
Thanks. This is terrible pea-counting business. The two other dictionary improvement bugs where uplifted to Aurora, so it would be good if this one could follow to complete the set.
Comment on attachment 8708730 [details] [diff] [review]
Patch to remove 564 +  6301 words.

Review of attachment 8708730 [details] [diff] [review]:
-----------------------------------------------------------------

r=me.  Thanks a lot!
Attachment #8708730 - Flags: review?(ehsan) → review+
(Assignee)

Updated

a year ago
Keywords: checkin-needed

Comment 45

a year ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/ae02da0ca244
Keywords: checkin-needed
(Assignee)

Comment 46

a year ago
Comment on attachment 8708730 [details] [diff] [review]
Patch to remove 564 +  6301 words.

Sorry to request this before the patch has landed on M-C but the branch date is looming.

Approval Request Comment
[Feature/regressing bug #]: No regression.
[User impact if declined]: Dictionary with very uncommon words leading to bad spelling suggestions.
[Describe test coverage new/current, TreeHerder]: N/A.
[Risks and why]: No risk, en-US dictionary change only.
[String/UUID change made/needed]: None.

It would be good to include a cleaned-up dictionary in ESR 45 (also, dare I say it, for the benefit of Thunderbird users).

This is the third and final bug for the dictionary clean-up. For the other two see bug 1235506 comment #63 and bug 1238031 comment #21.
Attachment #8708730 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/ae02da0ca244
Status: ASSIGNED → RESOLVED
Last Resolved: a year agoa year ago
status-firefox46: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
(Assignee)

Comment 48

a year ago
Sorry about the NI, I'd be good to uplift before the branch date.
Flags: needinfo?(sledru)
Comment hidden (obsolete)
(Assignee)

Comment 50

a year ago
Correction:
When uplifting, please uplift this *before* bug 1240916, otherwise the patch won't apply.
status-firefox45: --- → affected
Flags: needinfo?(sledru)
Attachment #8708730 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Comment 51

a year ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/002e22c85b33
status-firefox45: affected → fixed
You need to log in before you can comment on or make changes to this bug.