Closed Bug 1806793 Opened 1 year ago Closed 1 year ago

Update en-US dictionary documentation and scripts

Categories

(Core :: Spelling Checker: en-US Dictionary, task)

task

Tracking

()

RESOLVED FIXED
110 Branch
Tracking Status
firefox110 --- fixed

People

(Reporter: flod, Assigned: flod)

References

Details

Attachments

(5 files)

Starting with a few issues, before trying to actually update the dictionary to the most recent version of SCOWL:

  • The edit-dictionary doesn't work for me, as it breaks encoding. Converting to utf-8 before sorting seems to work.
  • There's a documentation page which is outdated, and the main scripts should be documented.
  • The dictionaries are using ISO-8859-1, which means Phabricator thinks they're binary and won't show a diff. It makes sense to me to keep a copy of the utf-8 version in tree, at least there's a diff for those (they won't be packaged, and don't take away a ton of space).
  • The 5-* files are, again, in ISO-8859-1 without good reason, since they're not used to generate the actual dictionaries. Makes sense to have them in UTF-8.

5-mozilla-* files are a by-product of the dictionary generation, they're not used to generate the dictionaries, so we can safely use utf-8 encoding.

Also, since they are wordlists, it makes sense to use the TXT extensions.

This works around the limitation in Phabricator, where ISO-8859-1 files are seen as binary.
Files have been converted from the existing dictionary using iconv, and the SET manually updated in the affix file to UTF8

Depends on D165302

edit-dictionary:

  • Convert to utf-8 before editing, and back to iso-8859-1 before saving
  • Place a copy of the utf-8 dictionary inside the utf8 folder, and store the iso-8859-1 in place

make-new-dict:

  • Use .txt extension for support wordlists, and place them in a subfolder
  • Exclude words in mozilla-exclusions.txt from the generated dictionary
  • Save 5-mozilla-*.txt files to utf-8

Depends on D165304

Pushed by flodolo@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ce107857320c
en-US dictionary: use .txt extensions for 5-mozilla-* files, convert to utf-8 encoding, r=sylvestre
https://hg.mozilla.org/integration/autoland/rev/84651f241b19
Keep a utf-8 encoded version of the dictionary files in the tree, r=sylvestre
https://hg.mozilla.org/integration/autoland/rev/df27be19a593
Update documentation on how to manage en-US dictionary, r=sylvestre
https://hg.mozilla.org/integration/autoland/rev/ff83cea010d8
Update scripts for en-US dictionary, r=sylvestre
https://hg.mozilla.org/integration/autoland/rev/8cf460979ce3
Run scripts using the same dataset (SCOWL scowl-2019.10.06), r=sylvestre
Blocks: 1686285
Component: Spelling checker → Spelling Checker: en-US Dictionary
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: