1808872 - Add new words to en-US dictionary (post 20230105 update)

Francesco Lodolo [:flod]

Assignee

Description

•

2 years ago

•

Edited

Using this bug to track and discuss requests for new words in the Mozilla en-US dictionary.

Try to provide information on the terms you want to add, in particular references to external sources that confirm the usage of the term (e.g. Merriam-Webster or Oxford online dictionaries).
Include all possible forms, e.g. plural and genitive for nouns, different tenses for verbs.

Francesco Lodolo [:flod]

Assignee

Comment 1

•

2 years ago

Wi-Fi seems to be missing (we have WiFi). Missed bug 1224488 because of the title when updating the dictionary.

Words to add:

Wi-Fi
Wi-Fi's

Francesco Lodolo [:flod]

Assignee

Updated

•

2 years ago

Duplicate of this bug: 1224488

Francesco Lodolo [:flod]

Assignee

Comment 3

•

2 years ago

https://www.merriam-webster.com/dictionary/bouldering

boulderer 
bouldering

Francesco Lodolo [:flod]

Assignee

Comment 4

•

2 years ago

From the old metabug.

Scunthorpe
https://en.wikipedia.org/wiki/Scunthorpe_problem

Francesco Lodolo [:flod]

Assignee

Comment 5

•

2 years ago

https://www.oxfordlearnersdictionaries.com/definition/american_english/flexor
(flexor muscles, or flexors)

flexor 
flexors 
extensor 
extensors

Gregory Pappas [:gregp]

Comment 6

•

2 years ago

Fibromyalgia
https://en.wikipedia.org/wiki/Fibromyalgia

Francesco Lodolo [:flod]

Assignee

Comment 7

•

2 years ago

(In reply to Gregory Pappas [:gregp] from comment #6)

Fibromyalgia

For completeness
https://www.merriam-webster.com/dictionary/fibromyalgia

fibromyalgia
fibromyalgic
fibromyalgics

Francesco Lodolo [:flod]

Assignee

Comment 8

•

2 years ago

Attached file words.txt (obsolete) — Details

I'll keep a file attached and update it periodically, at least there's no need to go through the entire list of comments.

Still not sure how frequently it makes sense to land updates, probably once per cycle if there's enough words.

Francesco Lodolo [:flod]

Assignee

Comment 9

•

2 years ago

transcreation

I can't find a definition on dictionaries, but this is a common word in the localization industry
https://en.wikipedia.org/wiki/Transcreation

Francesco Lodolo [:flod]

Assignee

Comment 10

•

2 years ago

https://www.merriam-webster.com/dictionary/AIDER

aider

Teal Dulcet [:tdulcet] (TB Council)

Comment 11

•

2 years ago

Attached file Wiktionary words.tsv (obsolete) — Details

(In reply to Francesco Lodolo [:flod] from comment #4)

Scunthorpe
https://en.wikipedia.org/wiki/Scunthorpe_problem

To attempt to be more systematic, I attached a list of 22,963 words that are not currently in the Mozilla en-US dictionary, which includes several from the above comments. They are a very small subset of the over 1.3 million English words in the Wiktionary online dictionary. Specifically, they are the words that are notable enough to have one or more Wikipedia pages about them, since based on comments 4, 6, 9, etc., this seems to meet the criteria for inclusion.

The data was gathered from Wiktionary using the open source Wiktextract utility (GitHub), which I downloaded from https://kaikki.org. I then converted the raw JSON data into a simple TSV file, while only including words with at least one Wikipedia page. I further reduced the list by removing all words from several parts of speech categories (such as names and phrases) and those with any non-alphabetic characters (except for the apostrophe).

The attached TSV (tab-separated values) file is formatted like this:

<word>	<comma separated part(s) of speech>	<comma separated Wikipedia page(s)>

Just add the https://en.wikipedia.org/wiki/ prefix to each quoted Wikipedia page to generate URL(s) for the required reference. For example (from comment 3):

bouldering	noun,verb	'bouldering'

The Wikipedia page is then: https://en.wikipedia.org/wiki/bouldering.

I am of course not suggesting that all of these words be added to the Mozilla dictionary and some may be offensive or otherwise unsuitable, but there are still a lot of good candidates to consider... Any corrections to the data should be made directly on the Wiktionary website.

I would be happy to share the scripts I wrote to generate this, if anyone would like to reproduce the results, adjust the parameters or adapt it for another language. Long term, it would be great if Mozilla could setup a method to automatically update their dictionaries with new words (similar to bug 1618282 for updating regular dependencies).

(In reply to Francesco Lodolo [:flod] from comment #1)

Wi-Fi seems to be missing (we have WiFi).

I believe the issue is that there are currently no hyphenated words in the Mozilla dictionary. I also excluded them from the attached list, as I was unsure if this was intentional.

Teal Dulcet [:tdulcet] (TB Council)

Updated

•

2 years ago

Attachment #9312190 - Attachment mime type: application/octet-stream → text/plain

Francesco Lodolo [:flod]

Assignee

Comment 12

•

2 years ago

(In reply to Teal Dulcet [:tdulcet] from comment #11)

To attempt to be more systematic, I attached a list of 22,963 words that are not currently in the Mozilla en-US dictionary, which includes several from the above comments.

Thanks for taking the time to create this list. I truly appreciate the effort, and I'll try to go through the list over time, but I don't think automation should be the approach to manage dictionary updates going forward.

The main reason is that there's no way someone can safely review a bulk update with so many words (22k, or even "just" 1k), and the focus should be on quality more than quantity at this point.

For example, I looked at terms with at least 6 characters in your list, since there are a lot of acronyms in there (there are smarter ways to filter them out directly in the script, e.g. ignore words with just uppercase characters). That leaves about 16.8k words.

A lot of them are reported as errors in the macOS dictionary, but maybe a start is to include at least those that are considered OK, since clearly they have been reviewed and approved for that dictionary.

If you could put the script somewhere (e.g. GitHub, even just a Gist), it might be useful for others.

Wi-Fi seems to be missing (we have WiFi).

I believe the issue is that there are currently no hyphenated words in the Mozilla dictionary. I also excluded them from the attached list, as I was unsure if this was intentional.

I couldn't find any history about excluding hyphenated words and, based on my experience, Hunspell doesn't have issues with multiple valid words joined by hyphen (e.g. wire-less isn't marked as error, even if not explicitly listed, but Wi-Fi is because there's no Wi and Fi).

Teal Dulcet [:tdulcet] (TB Council)

Comment 13

•

2 years ago

Attached file Wiktionary words.tsv (obsolete) — Details

(In reply to Francesco Lodolo [:flod] from comment #12)

The main reason is that there's no way someone can safely review a bulk update with so many words (22k, or even "just" 1k), and the focus should be on quality more than quantity at this point.

Most dictionaries only add a few dozen words per year, so if Mozilla is now going to update their dictionary "once per cycle" per comment 8, which I presume means once per Firefox release or approximately once per month, it seems like the burden would be easily manageable. For a quick search, it looks like the issue is that the dictionary has not been systematically updated since bug 479334 in 2010, so there is a lot of catching up to do...

For example, I looked at terms with at least 6 characters in your list, since there are a lot of acronyms in there (there are smarter ways to filter them out directly in the script, e.g. ignore words with just uppercase characters). That leaves about 16.8k words.

The Mozilla dictionary actually already includes 694 acronyms, which is why I included them in my word list before. However, I attached an updated list with just 7,519 words, which removes those acronyms as requested. I also moved the forms to be on the same line as the base word, instead of separate lines as before, so they are no longer included in the above word count.

The attached TSV file is now formatted like this:

<word>	[comma separated form(s)]	<comma separated part(s) of speech>	<comma separated Wikipedia page(s)>

A lot of them are reported as errors in the macOS dictionary, but maybe a start is to include at least those that are considered OK, since clearly they have been reviewed and approved for that dictionary.

Great idea! If only this could be easily automated, as then one could check the full 1.3 million English words on Wiktionary and also test other dictionaries... I was thinking someone could also write a simple scraper for the Merriam-Webster or Oxford dictionaries you suggested in comment 0, but I am not sure if there would be copyright problems.

If you could put the script somewhere (e.g. GitHub, even just a Gist), it might be useful for others.

OK, I created a new Gist here: https://gist.github.com/tdulcet/75f80d6a9da049b8378ca8ce9339f77f. On Linux, just run the commands listed at the top of the script (on lines 10-20) to reproduce the attached list. It just requires that Hunspell and Python 3 are installed.

I couldn't find any history about excluding hyphenated words and, based on my experience, Hunspell doesn't have issues with multiple valid words joined by hyphen (e.g. wire-less isn't marked as error, even if not explicitly listed, but Wi-Fi is because there's no Wi and Fi).

My understanding from reading bug 1224488 is that hyphenated words are not (currently) supported, so yes, it appears one would need to add separate entries in the dictionary for Wi and Fi in order for Wi-Fi to work as expected. However, there are clearly several bugs here, as all of the suggestions it currently gives me for "Wi-Fi" such as WI-fi and Wu-Fi are still marked as an error after selecting them.

Attachment #9312190 - Attachment is obsolete: true

Francesco Lodolo [:flod]

Assignee

Comment 14

•

2 years ago

Attached file words.txt (obsolete) — Details

Only updating the initial list for now.

Attachment #9311193 - Attachment is obsolete: true

Francesco Lodolo [:flod]

Assignee

Comment 15

•

2 years ago

After looking at the list of words from Wiktionary, I think it requires some more thinking. Still worth going through it and picking obvious terms, at least as a first step, then iterate.

The list can be narrowed down further by initially removing proper nouns (starting with uppercase letter), but that still leaves 15k+ words. I checked a couple of random words that I've never seen, and I found no results in Google n-gram (American English, 1990-2019). Some other words are already covered in the current dictionary too (the script seems to look at the SCOWL dictionary, not current Mozilla).

One potential approach would be to slowly go through Google n-gram API and add words above a threshold, to cover the most common missing terms.

Side note: the current dictionary is based on SCOWL 60. The list of words could be easily expanded by going to level 70, for example, but IIRC even the maintainer suggested not to do that, because that level would require more scrutiny.

Francesco Lodolo [:flod]

Assignee

Comment 16

•

2 years ago

Attached file words.txt (obsolete) — Details

Other issues I've noticed with the list from Wiktionary:

British spelling.
Spelling as a single word when the accepted form is hyphenated.
A lot of uncountable nouns with plural form.

I've extracted about 200 words, and every time I go back to the list I'd remove some out of doubt.

The other thing that stands out is how strict Oxford is vs Merriam-Webster. The dictionary used in macOS seems closer to the former (e.g. "yeet" as a verb is marked as an error, but available in M-W).

Attachment #9312329 - Attachment is obsolete: true

Teal Dulcet [:tdulcet] (TB Council)

Comment 17

•

2 years ago

(In reply to Francesco Lodolo [:flod] from comment #15)

Some other words are already covered in the current dictionary too (the script seems to look at the SCOWL dictionary, not current Mozilla).

Oh, thanks for catching this. I updated the commands in my script to use the correct Mozilla dictionary and it reduced my previous attachment from 7,519 to 7,429 words. I also added an optional command to remove all words with any uppercase letters, which further reduces the list to 5,437 words (not including the forms).

One potential approach would be to slowly go through Google n-gram API and add words above a threshold, to cover the most common missing terms.

I downloaded the raw Google Ngram version 3 data for 1-grams, but it is extremely noisy with misspellings and British word variants, so I am not sure how useful it would be for creating a word list.

Another idea is that the Wiktionary data does also provide detailed information about the senses (meanings) for each word. I am not sure if this is helpful, but here is an updated list which includes a new column with the number of senses for each word. I had to upload it to Google Drive, as the file size is above the BMO limit for attachments. The list is also now sorted by this column, where the the words with the most senses (24) are on top. I removed the requirement for a Wikipedia page in this list, as some of the words with many senses did not have any Wikipedia pages listed, although they are still provided if present. There are now 647,101 words total, although only 53K have more than one sense, 11K have more than two, just 4,002 have more than three senses and it continues to decrease exponentially from there...

For reference, the TSV file is now formatted like this:

<word>	<number of senses>	[comma separated form(s)]	<comma separated part(s) of speech>	[comma separated Wikipedia page(s)]

(In reply to Francesco Lodolo [:flod] from comment #16)

Spelling as a single word when the accepted form is hyphenated.

I added another optional command to allow hyphenated words, which would add 482 words my previous attachment (when still allowing uppercase letters), including "Wi-Fi". However, only at most 144 of these new words have components that are not already in the dictionary, so the rest would already be supported.

I've extracted about 200 words

Nice, thanks for doing that! This will be a huge improvement to the Mozilla dictionary.

Francesco Lodolo [:flod]

Assignee

Comment 18

•

2 years ago

Spelling as a single word when the accepted form is hyphenated.

I added another optional command to allow hyphenated words, which would add 482 words my previous attachment (when still allowing uppercase letters), including "Wi-Fi". However, only at most 144 of these new words have components that are not already in the dictionary, so the rest would already be supported.

To be clear, the problem is that Wiktionary considers valid a compound noun, when dictionaries only include separate words (e.g. wirehouse vs wire house) or hyphenated form (namedropping vs name-dropping).

In some cases there are things that I would call plain errors (bicepses or buffalos are not the correct plural form for biceps and buffalo).

As mentioned, I plan to periodically go back to this list and try to extract "safe" words, but there is still a lot of manual checks to avoid introducing mistakes.

Teal Dulcet [:tdulcet] (TB Council)

Comment 19

•

2 years ago

Attached file Wiktionary words.tsv — Details

(In reply to Francesco Lodolo [:flod] from comment #18)

To be clear, the problem is that Wiktionary considers valid a compound noun, when dictionaries only include separate words (e.g. wirehouse vs wire house) or hyphenated form (namedropping vs name-dropping).

For many of these, Wiktionary actually does include both forms, but because I removed all words with non-alphabetic characters, they are not on any of the above lists:

$ grep 'name-\?dropping' wiktionary.tsv
name-drop       1       name-drops,name-dropping,name-dropped   verb
name-dropping   2       name-droppings  noun,verb       'name-dropping'
name-droppings  1       -       noun
namedrop        1       namedropped,namedropping,namedrops      verb
namedropping    2       namedroppings   noun,verb
namedroppings   1       -       noun

I attached an updated (and hopefully my final Wiktionary) list which and expands on my second list from comment 13 by including these hyphenated words. The file has 7,894 words (479 are hyphenated) and is in the format of my third list from comment 17, but I reinstated the Wikipedia page requirement and it is again sorted/alphabetized by the word column. This list includes name-dropping (and "Wi-Fi"), but not namedropping because it does not have a Wikipedia page listed.

Because I was previously using the wrong Mozilla dictionary, I was also incorrect in comment 13 when I said there were no hyphenated words. In fact, there are currently 23 hyphenated words, including most notably "add-ons". There are also 27,263 words with at least one uppercase letter.

Note that the Microsoft en-US dictionary does accept both namedropping and buffalos.

As mentioned, I plan to periodically go back to this list and try to extract "safe" words, but there is still a lot of manual checks to avoid introducing mistakes.

Thanks again for doing this! Your time is greatly appreciated by the millions of Firefox/Thunderbird users.

I am sure the quality of the Wiktionary data will also improve over time, especially when more organizations like Mozilla start using it.

Attachment #9312319 - Attachment is obsolete: true

Loren Pechtel

Comment 20

•

2 years ago

Kyiv -- Updated spelling of the capital of Ukraine, Kiev should probably not be removed.

Loren Pechtel

Comment 21

•

2 years ago

fissionables -- while "fissionable" is not normally a noun the plural form normally refers to elements capable of sustaining fission reactions--uranium and plutonium.

Francesco Lodolo [:flod]

Assignee

Comment 22

•

2 years ago

Attached file words.txt (obsolete) — Details

Attachment #9312347 - Attachment is obsolete: true

Dave Townsend [:mossop]

Comment 23

•

2 years ago

anymore
unrideable

Francesco Lodolo [:flod]

Assignee

Comment 24

•

2 years ago

anymore
unrideable
(In reply to Dave Townsend [:mossop] from comment #23)

anymore
unrideable

anymore should be working (works for me)
https://searchfox.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/en-US.dic#15097

Adding unrideable
https://www.merriam-webster.com/dictionary/unrideable

Loren Pechtel

Comment 25

•

2 years ago

unattenuated -- something which has not been attenuated or reduced. https://www.merriam-webster.com/dictionary/unattenuated

Francesco Lodolo [:flod]

Assignee

Comment 26

•

2 years ago

Attached file words.txt — Details

Going to add a patch for this batch of words, and move the alias over to a new bug.

Attachment #9312607 - Attachment is obsolete: true

Francesco Lodolo [:flod]

Assignee

Updated

•

2 years ago

Alias: enus-dictionary

Francesco Lodolo [:flod]

Assignee

Comment 27

•

2 years ago

Attached file Bug 1808872 - Add new words to en-US dictionary, r=bolsson! — Details

Phabricator Automation

Updated

•

2 years ago

Assignee: nobody → francesco.lodolo

Status: NEW → ASSIGNED

Francesco Lodolo [:flod]

Assignee

Updated

•

2 years ago

Attachment #9312471 - Attachment is obsolete: true

Francesco Lodolo [:flod]

Assignee

Updated

•

2 years ago

Attachment #9312471 - Attachment is obsolete: false

Teal Dulcet [:tdulcet] (TB Council)

Comment 28

•

2 years ago

Attached file Google Ngram words.tsv — Details

(In reply to Teal Dulcet [:tdulcet] from comment #17)

(In reply to Francesco Lodolo [:flod] from comment #15)

One potential approach would be to slowly go through Google n-gram API and add words above a threshold, to cover the most common missing terms.

I downloaded the raw Google Ngram version 3 data for 1-grams, but it is extremely noisy with misspellings and British word variants, so I am not sure how useful it would be for creating a word list.

Before this bug is closed, as suggested by Francesco in comment 15, I attached a list of the top 100,000 "words" from the Google Ngram American English 1-grams data that are not currently in the Mozilla en-US dictionary. Specifically, this is the top 100K from my resulting full list of 8,590,525 "words" sorted based on their number of occurrences/matches, where the words with the most occurrences are on top. However, as I mentioned above, the vast majority of these "words" are obviously from misspellings, OCR errors, British word variants, Unicode homoglyphs or other errors, so significant care would likely need to be taken before considering any of them for inclusion. Google Ngram also has a different definition for a word than Mozilla (see this note from their FAQ):

In English, contractions become two words (they're becomes the bigram they 're, we'll becomes we 'll, and so on). The possessive 's is also split off, but R'n'B remains one token. Negations (n't) are normalized so that don't becomes do not.

As I mentioned above, I downloaded the Google Ngram version 3 data for American English 1-grams. I then filtered the raw TSV data by removing all words without a valid part of speech category and those with any non-alphabetic characters (except for the apostrophe and hyphen). I also combined the data for the various casing forms and summed the counts for each year. The casing for the Google Ngram data is rather arbitrary, so I had to convert both this and the Mozilla lists to lowercase when comparing them, otherwise it would produce a lot of words in title and other casing styles that are currently lowercase in the Mozilla dictionary.

The attached TSV (tab-separated values) file is formatted like this:

<word>	<ranking>	[comma separated casing form(s)]	<comma separated part(s) of speech>	<occurrences count>	<books count>

For example (from comment 3):

bouldering	303769	bouldering,Bouldering,BOULDERING	ADJ,NOUN,VERB	21443	6087

This means "bouldering" was the top 303,769th word (before removing words already in the Mozilla dictionary) and it occurred 21,443 times, in 6,087 different books. The casing forms are also sorted by occurrence, so this means that the all lowercase form of "bouldering" on the left occurred most. The word column is lowercase if one of the casing forms is all lowercase, otherwise it is the form with the most occurrences. (Note that this example is actually number 163,893 in the resulting list, so it did not make the top 100K cut for the attached file, although "bouldery" at number 76,207 did.)

I believe my Wiktionary word list in comment 19 (attachment 9312471 [details]) is a much better source of potential words to consider, as it has significantly fewer errors, but this list could still be useful... One idea would be to look at words that are on both the Wiktionary list and rank high on this list. Of the 137,647 words (131,674 unique) currently in the Mozilla dictionary, only 231 of them or 0.175% are not in this list, so while it may be noisy, it is also very complete.

I created a new Gist for the script I used to generate this here: https://gist.github.com/tdulcet/b54041bbe532341617099bf1d26af093. On Linux, just run the commands listed at the top of the script (on lines 7-24) to reproduce the attached list. As before, it just requires that Hunspell and Python 3 are installed. Note that the raw data is around 27 GiB uncompressed (compared to only 1.6 GiB for Wiktionary). While my script of course does not store most of that, I would still recommend at least 32 GiB of RAM.

Teal Dulcet [:tdulcet] (TB Council)

Comment 29

•

2 years ago

Attached file Chromium words.txt — Details

(In reply to Francesco Lodolo [:flod] from comment #12)

the macOS dictionary, but maybe a start is to include at least those that are considered OK, since clearly they have been reviewed and approved for that dictionary.

I am not sure if this belongs here or in the new bug, but I compared the Chromium/Chrome and Mozilla en-US dictionaries and attached a list of the 412 words that were in the former, but not in the latter. This includes a few words like "Kyiv" (comment 20) that were suggested above. It also includes several words that are already in both dictionaries, but with different casing. Since all of these words have already been reviewed by Google, I suspect based on comment 12 that most of them could be included in the Mozilla dictionary without any controversy. There are a few words that appear to be misspellings, but presumably Chromium had a reason for including them.

If anyone wants to reproduce the attached list, just run these commands on Linux:

wget https://hg.mozilla.org/mozilla-central/raw-file/tip/extensions/spellcheck/locales/en-US/hunspell/en-US.aff
wget https://hg.mozilla.org/mozilla-central/raw-file/tip/extensions/spellcheck/locales/en-US/hunspell/en-US.dic
unmunch en-US.dic en-US.aff > mozilla.txt
# Convert 'mozilla.txt' to UTF-8

# Download en_US.dic, en_US.dic_delta and en_US.aff from: https://source.chromium.org/chromium/chromium/src/+/main:third_party/hunspell_dictionaries/
unmunch <(cat en_US.dic en_US.dic_delta) en_US.aff > chromium.txt
# Convert 'chromium.txt' to UTF-8

comm -13 <(sort -u mozilla.txt) <(sort -u chromium.txt) > 'Chromium words.txt'

This just requires that Hunspell is installed. Chromium of course also uses SCOWL 60, so almost all of these new words are from their en_US.dic_delta file, which seems to serve the same purpose as Mozilla's 5-mozilla-added.txt file.

Teal Dulcet [:tdulcet] (TB Council)

Comment 30

•

2 years ago

Attached file technical words.txt — Details

I also compared the LibreOffice and Mozilla en-US dictionaries, although there were only three words in the former that were not in the latter, which are the same three words that have been purposely excluded by the 5-mozilla-removed.txt file. If anyone wants to reproduce this result, just run the above commands to download the Mozilla dictionary and then run this:

wget https://raw.github.com/LibreOffice/dictionaries/master/en/en_US.aff
wget https://raw.github.com/LibreOffice/dictionaries/master/en/en_US.dic
unmunch en_US.dic en_US.aff > libreoffice.txt
# Convert 'libreoffice.txt' to UTF-8

comm -13 <(sort -u mozilla.txt) <(sort -u libreoffice.txt) > 'LibreOffice words.txt'

However, LibreOffice also includes a separate "Technical" dictionary. I attached a list of the 269 words that were in this, but not in the Mozilla dictionary. The dictionary includes 62 words that end with an equal sign, which I suspect meant that the 's form of the word is also supported. To avoid confusion, I striped all those equal signs from the attached list. This list includes a lot of software and company names that I was surprised were not already in the Mozilla dictionary. It also includes the full lowercase Greek alphabet for some reason, which can be ignored. As with the Chromium words, since all of these words have already been reviewed by the Document Foundation, I suspect based on comment 12 that most of them could be included in the Mozilla dictionary without any controversy.

If anyone wants to reproduce the attached list, just run the above commands to download the Mozilla dictionary and then run this:

wget https://raw.github.com/LibreOffice/core/master/extras/source/wordbook/technical.dic
comm -13 <(sort -u mozilla.txt) <(tail -n +5 technical.dic | tr -d '=' | sort -u) > 'technical words.txt'

Some of these technical words may be good candidates for the 5-mozilla-specific.txt file.

Francesco Lodolo [:flod]

Assignee

Comment 31

•

2 years ago

To be be clear, all the .txt files are generated
https://firefox-source-docs.mozilla.org/extensions/spellcheck/index.html#info-about-the-included-scripts

As mentioned before, I'm not sure we should add too many proper nouns.

The Chrome diff is definitely "interesting", from a quick look I couldn't spot anything useful to save (I mean, besides the British spellings and technical jargon like const or nowrap, seleccionar?? That can't be right…)

Geoff Kuenning

Comment 32

•

2 years ago

One thing to consider when adding words is whether they'll hide typos of more common words. For example, "ort" is a word well known to crossword solvers, but if it is added to the dictionary it will hide transpositions of "rot", rollover additions of t to "or", and missing letters from "fort", "port", "sort", "tort", and "wort". (And one can argue that "wort" shouldn't be there, either.)

Major proper nouns should be included, especially if they are common or difficult to spell. Thus, Samuel and Kazakhstan should be present. But uncommon spellings of proper names should arguably be omitted.

As far as validating a large list of words, dividing it into small pieces and plugging away actually works. When I validated the ispell dictionaries, I did several hundred words a night (or more; I really don't remember), using a paper dictionary to look up anything that I was even slightly unsure of. It took me months, but was well worth it; I found a lot of errors. The task would obviously go much faster if there were multiple volunteers who could divvy the job up in parallel.

Teal Dulcet [:tdulcet] (TB Council)

Comment 33

•

2 years ago

Attached file Ispell words.txt — Details

(In reply to Francesco Lodolo [:flod] from comment #31)

As mentioned before, I'm not sure we should add too many proper nouns.

Yes, although as I mentioned in comment 19, there are already 27,263 words with at least one uppercase letter in the Mozilla dictionary and almost all of them are proper nouns. For consistency, it would probably be best to include all of the common and notable ones, such as most of those listed in comment 30 (attachment 9313542 [details]).

The Chrome diff is definitely "interesting", from a quick look I couldn't spot anything useful to save (I mean, besides the British spellings and technical jargon like const or nowrap, seleccionar?? That can't be right…)

Yeah, "seleccionar" seems to be a Spanish word for "to select", I am not sure why Google has it in their English dictionary. However, a quick look at the list reveals several potentially useful words to add, such as Ctrl, Reddit, init, localhost, mailto, IETF, etc., as well as other names of places like "Kyiv"...

(In reply to Geoff Kuenning from comment #32)

When I validated the ispell dictionaries, I did several hundred words a night (or more; I really don't remember), using a paper dictionary to look up anything that I was even slightly unsure of.

Thank you Geoff for this comment and for creating those dictionaries. I compared your Ispell small and medium American English dictionaries to the Mozilla en-US dictionary and attached a list of the 4,506 words that were in the former, but not in the latter. Even without the proper nouns and acronyms, there are still 3,426 words to consider. Compared to the many other word lists I have checked, this looks to be very high quality and I am surprised most of the words are not already in the Mozilla dictionary. Since Geoff has already manually checked all of these words, I suspect that most if not all of them could be included in the Mozilla dictionary.

If anyone wants to reproduce the attached list, just run the above commands in comment 29 to download the Mozilla dictionary and then run this (Bash syntax):

wget https://www.cs.hmc.edu/~geoff/tars/ispell-3.4.05.tar.gz
tar -xzvf ispell-3.4.05.tar.gz
unmunch <(cat ispell-3.4.05/languages/english/{american.[01],english.[01]}) ispell-3.4.05/languages/english/english.aff > ispell.txt

comm -13 <(sort -u mozilla.txt) <(sort -u ispell.txt) > 'Ispell words.txt'

As before, this just requires that Hunspell is installed. I thought this comment in the Ispell README was pertinent (lines 112-116):

The English-language dictionary comes in four sizes: small, medium, large, and extra-large. ... The small and medium dictionaries have been hand-checked against a paper dictionary to improve their accuracy.

Pulsebot

Comment 34

•

2 years ago

Pushed by flodolo@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1947e23c9db9 Add new words to en-US dictionary, r=bolsson

Geoff Kuenning

Comment 35

•

2 years ago

Two things. First, unmunch comes from hunspell and I'm not sure it will give correct results for an ispell affix file. Here's an alternative that I know works: cat ispell-3.4.05/languages/english/{american.[01],english.[01]} | ispell -e | tr ' ' \012 | sort -u

(I should really write a one-line script that automates this; I use it surprisingly often.)

Second, FWIW I have figured out a way to at least semi-automate checking the larger ispell dictionaries so that the workload of validating them would be manageable, but I won't have time to do it until after I retire...which fortunately will be this summer.

Cristian Tuns

Comment 36

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/1947e23c9db9

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

status-firefox111: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 111 Branch

words.txt 2 years ago Francesco Lodolo [:flod] 147 bytes, text/plain		Details
Wiktionary words.tsv 2 years ago Teal Dulcet [:tdulcet] (TB Council) 737.37 KB, text/plain		Details
Wiktionary words.tsv 2 years ago Teal Dulcet [:tdulcet] (TB Council) 344.28 KB, text/plain		Details
words.txt 2 years ago Francesco Lodolo [:flod] 167 bytes, text/plain		Details
words.txt 2 years ago Francesco Lodolo [:flod] 1.77 KB, text/plain		Details
Wiktionary words.tsv 2 years ago Teal Dulcet [:tdulcet] (TB Council) 380.81 KB, text/plain		Details
words.txt 2 years ago Francesco Lodolo [:flod] 1.79 KB, text/plain		Details
words.txt 2 years ago Francesco Lodolo [:flod] 1.81 KB, text/plain		Details
Bug 1808872 - Add new words to en-US dictionary, r=bolsson! 2 years ago Francesco Lodolo [:flod] 48 bytes, text/x-phabricator-request		Details \| Review
Google Ngram words.tsv 2 years ago Teal Dulcet [:tdulcet] (TB Council) 7.44 MB, text/plain		Details
Chromium words.txt 2 years ago Teal Dulcet [:tdulcet] (TB Council) 3.67 KB, text/plain		Details
technical words.txt 2 years ago Teal Dulcet [:tdulcet] (TB Council) 1.89 KB, text/plain		Details
Ispell words.txt 2 years ago Teal Dulcet [:tdulcet] (TB Council) 44.49 KB, text/plain		Details