Closed Bug 1811451 Opened 2 years ago Closed 8 months ago

Track new words and corrections to en-US dictionary

Categories

(Core :: Spelling Checker: en-US Dictionary, enhancement)

enhancement

Tracking

()

RESOLVED FIXED
128 Branch
Tracking Status
firefox128 --- fixed

People

(Reporter: flod, Assigned: flod)

Details

Attachments

(2 files, 1 obsolete file)

Using this bug to track and discuss requests for new words in the Mozilla en-US dictionary.

  • Try to provide information on the terms you want to add, in particular references to external sources that confirm the usage of the term (e.g. Merriam-Webster or Oxford online dictionaries).
  • Include all possible forms, e.g. plural and genitive for nouns, different tenses for verbs.

A list of terms extracted from Wiktionary, ispell is also available in bug 1808872, and can be used as a reference for new words.

There is a large list of words in the previous bug, but I'd like to focus with everyday cases first.

Here's one way to help:

  • Use Nightly, so you have the latest version of the dictionary (the dictionary from bug 1808872 should be in tomorrow's Nightly)
  • Check your personal dictionary, i.e. misspelled words that you have added to dictionary over time. Open about:profiles, identify your current profile and click the button to open the root folder. Inside the profile folder, there is a persdict.dat file (that's a text file with the words you added).
  • Make a copy of persdict.dat and empty the original file, so you will rely only on the built-in dictionary (without your exceptions).
  • Open the copy in a text editor, and copy the list of words in the clipboard.
  • Paste the list in a comment. Then remove words that are not relevant (overly specific, known misspelling, embarrassing topics, etc.), and the ones that are not marked as an error by the dictionary. If you can find definitions online (e.g. Oxford or Merriam-Webster) to confirm that spelling, even better.
  • Restore the original persdict.dat (or an edited version without the words that are now covered by the updated built-in dictionary).
Attached file Ispell words 2.txt

[BMO unfortunately no longer allows users without special permission to comment on closed bugs, so I will respond to Geoff from bug 1808872 comment 35 here:]

First, unmunch comes from hunspell and I'm not sure it will give correct results for an ispell affix file. Here's an alternative that I know works

Thanks for catching this. I used your correct command to generate the Ispell dictionary and after comparing that result again with the Mozilla dictionary, it increased the size of the list from 4,506 words to 13,997 words. Without the proper nouns and acronyms, there are now 12,507 words to consider.

Second, FWIW I have figured out a way to at least semi-automate checking the larger ispell dictionaries so that the workload of validating them would be manageable, but I won't have time to do it until after I retire...which fortunately will be this summer.

Nice, I (and I am sure many others) will look forward to that... I know people have previously tried using the dict command to do this, but the open source dictionaries available for it seem to be very dated and low quality. The best that I have been able to find so far is the Wiktionary data (see bug 1808872 comment 11, 13, 17, 19), although it does not (yet) differentiate American/British/Australian spellings.


Francesco - Thank you again for all your work on the Mozilla dictionary over the last three weeks. You have already added a couple of hundred missing words, which will be a big improvement for Firefox/Thunderbird users.

(In reply to Francesco Lodolo [:flod] from comment #1)

There is a large list of words in the previous bug

For your and everyone's convenience, here is a brief summary of those word lists for consideration ordered from high to low quality:

  1. Ispell small and medium American English dictionaries - 13,997 words, see attached above.
  2. Wiktionary English dictionary - 7,894 words (not including forms) with at least one listed Wikipedia page, see bug 1808872 comment 19 (attachment 9312471 [details])
  3. LibreOffice Technical dictionary - 269 words, see bug 1808872 comment 30 (attachment 9313542 [details])
  4. Chromium/Chrome en-US dictionary - 412 words, see bug 1808872 comment 29 (attachment 9313541 [details])
  5. Google Ngram American English 1-grams data - top 100,000 words, see bug 1808872 comment 28 (attachment 9313271 [details])

Note that there is a lot of overlap between these lists and they of course still include some of the words just added in bug 1808872.

Oxytocin is in fact a word. I used the regex dictionary, and looking just at words prefixed by "oxy", I got all of these which Firefox considers misspellings, after removing prefixes and hyphenated words. Surprisingly, oxyacetylene is recognized.

carboxyhemoglobin
carboxyl
carboxylase
carboxylation
carboxymethylcellulose
carboxypeptidase
decarboxylase
decarboxylation
deoxycorticosterone
deoxygenate
deoxyribonuclease
deoxyribonucleotide
deoxyribose
dicarboxylic
dideoxyinosine
dihydroxyphenylalanine
doxycycline
ethoxyl
hematoxylin
hydroxy
hydroxyapatite
hydroxyl
hydroxylamine
hydroxylate
hydroxyurea
methoxychlor
oxyacid
oxycephaly
oxycodone
oxygenase
oxygenise
oxyhemoglobin
oxyhydrogen
oxymetazoline
oxysulfide
oxytetracycline
oxytocia
oxytocic
oxytocin
oxytone
oxyuriasis
paroxytone
phenoxybenzamine
propoxyphene
protoxylem
pyroxylin
tetrahydroxy
tricarboxylic

A lot of these technical, of course, but they show up not just in the source of this site but American Heritage as well. Ispell would be a great resource.

I don't know which version you tested that list in, but all the words of common use are already covered for me:

boxy
epoxy
foxy
heterodoxy
orthodoxy
oxygen
oxygenate
oxymoron
paroxysm
proxy

Ignore the first comment. I made it by mistake. Those are all valid, yes, which is why I removed then from the second list of only invalid words.

Decarboxylation in particular is valid on iOS but not FF.

Attachment #9317198 - Attachment is obsolete: true

From https://mastodon.online/@billyjoebowers/109895989109439040

Please add:

chorizo
habanero

Both are food items and can be found in the en_US-large SCOWL list: http://app.aspell.net/lookup?dict=en_US;words=chorizo%0D%0Ahabanero%0D%0A

Request to add words:

aggress
aggressed
aggressing
aggresses 

Verb, source: https://www.merriam-webster.com/dictionary/aggress

intransitive verb
: to commit aggression : act aggressively

Cholla is being "corrected" to cholera.

https://www.merriam-webster.com/dictionary/cholla

(Which gives no hint of it's true evil.)

[In response to Francesco from bug 1808872 comment 16]

Other issues I've noticed with the list from Wiktionary:

  • British spelling.

I took a second try at creating a wordlist from Wiktionary. While is not (yet) possible to differentiate the American and British spellings directly, the senses (definitions) for most words in the Wiktionary data are marked with various tags. There are hundreds of these tags used, but I was able to update my script to skip words where all the senses include either the UK, Britain, British, Commonwealth or England tags and do not include the US tag. This of course is not a perfect solution, as some words are not properly/consistently tagged, but this seems to eliminate the vast majority of those British spellings. I also included the Australia/Australian and Canada/Canadian tags to eliminate most of the Australian and Canadian spellings respectively as well. In addition, I removed all words with the obsolete, archaic, misspelling or nonstandard tags, to hopefully improve the quality of the wordlist.

Instead of attaching the resulting wordlist to this bug as I did before, I decided to create a simple webpage for it, so that the data could be displayed in an HTML table. The words are linked to their respective Wiktionary pages and similarly, the Wikipedia page titles are linked to the respective Wikipedia pages, so users can now just click on the links to view the references. In addition, users can now select if they want to hide the acronyms or proper nouns and the table will automatically update accordingly. Users can also select if they want to show words without any Wikipedia pages. Hopefully this will make it easier for people to review and systematically find those words that should be included in the Mozilla dictionary. With the default options, there are now a total of 7,699 words for consideration, although the full list has 666,035 words (mostly words without any Wikipedia pages). To reduce memory usage and improve the load time, it will only show a maximum of 50K words but there is an option to change this for users who want to see the full list.

This page can be viewed at: https://tdulcet.github.io/Missing-Words/. The page itself has much more information about the data and how it was generated. Feedback is welcome! Note that the page may take a few seconds or more to fully load, as it has to download the fill 15.5 MiB wordlist and process it to generate the table. For anyone who preferred the TSV file, it can still be downloaded here in the same format as before (see bug 1808872 comment 17): Wiktionary words.tsv. The code for this page and the scripts to generate the data are on GitHub: https://github.com/tdulcet/Missing-Words. I licensed everything MPL so that Mozilla could possibly absorb or incorporate it in the future. I also setup a CI service to automatically run my scripts to update this data every month, to reflect changes made to the Mozilla dictionary and to Wiktionary. I am considering setting this up to update those other wordlists from comment 2, but Wiktionary seemed like the priority, as it is of course updated by far more frequently than those other dictionaries.

[In response to Francesco from bug 1808872 comment 16]

Other issues I've noticed with the list from Wiktionary:
[...]

  • Spelling as a single word when the accepted form is hyphenated.

I updated my scripts to attempt to resolve this issue. They now normalize both the Mozilla and Wiktionary wordlists before comparing them by removing any non-alphanumeric characters, including hyphens and other symbols, and converting the words to lowercase. This will resolve cases like you described where a hyphenated word is already in the Mozilla dictionary, but a non-hyphenated form is in Wiktionary. This will also resolve cases where a word is in Wiktionary with different capitalization or casing than the Mozilla dictionary. In all these cases, those words will no longer be listed on the page or in the resulting TSV file. I believe this should now resolve all of the issues you listed.

Note that this will no longer include words like Wi-Fi from bug 1808872 comment 4 when the wrong form (WiFi) is already in the Mozilla dictionary. The script currently has no way to determine which the "accepted form" is, so it now assumes if one of the forms is already in the Mozilla dictionary, that must be the correct form. It also does not consider cases where multiple forms of the word are accepted and should be in the dictionary, but not all of those forms are currently in the dictionary. There are 2,563 such words with multiple forms already in the Mozilla dictionary, mostly ones with different capitalization, but also some with different hyphenation.

Any words that are identical after this normalization are now put on the same row. These are typically the same word, but with different hyphenation or capitalization, but in some cases they may have completely different meanings. You or someone from Mozilla would of course have to make an executive decision as to which form(s) should be included in the Mozilla dictionary. With the default options, there are now a total of 7,192 words for consideration, while the full list has 633,387 words. For reference, the links for the page and TSV file are in comment 11 above. The TSV file now has an additional column with this normalized form of the word, so it is slightly larger at 22 MiB even though it has fewer rows. Feedback is welcome!

(I was going to submit a new bug about improving the performance and load time of the page, but instead I found an issue with the Firefox Profiler. See Bug 1833147.)

(In reply to 石庭豐 (Seak, Teng-Fong) from comment #13)

Request to add
commentor
commentors

https://www.merriam-webster.com/dictionary/commenter
https://www.collinsdictionary.com/dictionary/english/commentor

As a general rule, it's not a good idea to add variant spellings to a dictionary, for two reasons. First, it means that a person who is inconsistent about how they spell a word will be able to produce documents that have both spellings without realizing that they are doing so. Second, IMHO one of the most important purposes of a spelling checker is to encourage a "standard" version of the language. Without standardization, peepl wood rite neerly unreedabul sentunsez. ;-)

In principle, I agree with you. But in practice, there ARE already variants in this dictionary. Like

  • OK / okay
  • acknowledgement / acknowledgment
  • cancelling / canceling
  • catalogue / catalog
  • dialogue / dialog
  • cypher / cipher
  • grey / gray
  • judgement / judgment
  • routeing / routing
  • whisky / whiskey

So, what is the criteria to accept those variants but not my suggestions?

kleptocracies - plural of kleptocracy, which is already in the dictionary.

Hairstyling. What a hairstylist does to produce a hairstyle. The dictionary has the latter two but considers the former only acceptable as two words.

Captchas, plural of Captcha, which is already in the dictionary.

I note the absence of evil incarnate, aka Cholla (cactus).

desalinator -- the machine that desalinates water. The latter is in the dictionary, the former is not.

From 1864046, "PolySpace" should be now spelled as "Polyspace"

"Undebatable" and it's adverbial form "undebatably". For one source, see https://www.merriam-webster.com/dictionary/undebatable.

"-natured" - Always used as a suffix, like good-natured, sweet-natured. Not sure if our dictionary handles suffixes directly. Microsoft, for example, simply allows the standalone word "natured" and its use as a suffix. See https://dictionary.cambridge.org/us/dictionary/english/natured.

hypotheticals - plural of hypothetical which is in the dictionary. Typically an adjective but it's also used as a noun, omitting the noun that it would otherwise be modifying. Dictionary.com (which shows a plural usage): https://www.dictionary.com/browse/hypothetical

laggy -- I find myself very surprised that Firefox doesn't realize things can be slow. https://www.merriam-webster.com/dictionary/laggy

Klansmen -- plural of Klansman which is in the dictionary.

speciation - the process of a species splitting into multiple species: https://en.wikipedia.org/wiki/Speciation

(In reply to drubino@mozilla.com from comment #24)

"-natured" - Always used as a suffix, like good-natured, sweet-natured. Not sure if our dictionary handles suffixes directly. Microsoft, for example, simply allows the standalone word "natured" and its use as a suffix. See https://dictionary.cambridge.org/us/dictionary/english/natured.

Not that I'm aware.

Note: I'll try to update the dictionary, since this bug has been around for over a year. But the main goal of revising scripts and updating the instructions was to make the process easy for other people to contribute ;-)
https://firefox-source-docs.mozilla.org/extensions/spellcheck/index.html

Assignee: nobody → francesco.lodolo
Status: NEW → ASSIGNED
Pushed by flodolo@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4bebf7f19695 Update en-US dictionary with new words, r=bolsson
Alias: enus-dictionary

Removing dependency from bug 1864046, because that doesn't look like a bug in the en-US dictionary.

No longer depends on: 1864046
Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
Target Milestone: --- → 128 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: