Update en-US dictionary documentation and scripts
Categories
(Core :: Spelling Checker: en-US Dictionary, task)
Tracking
()
Tracking | Status | |
---|---|---|
firefox110 | --- | fixed |
People
(Reporter: flod, Assigned: flod)
References
Details
Attachments
(5 files)
Starting with a few issues, before trying to actually update the dictionary to the most recent version of SCOWL:
- The
edit-dictionary
doesn't work for me, as it breaks encoding. Converting to utf-8 before sorting seems to work. - There's a documentation page which is outdated, and the main scripts should be documented.
- The dictionaries are using ISO-8859-1, which means Phabricator thinks they're binary and won't show a diff. It makes sense to me to keep a copy of the utf-8 version in tree, at least there's a diff for those (they won't be packaged, and don't take away a ton of space).
- The
5-*
files are, again, in ISO-8859-1 without good reason, since they're not used to generate the actual dictionaries. Makes sense to have them in UTF-8.
Assignee | ||
Comment 1•1 year ago
|
||
5-mozilla-* files are a by-product of the dictionary generation, they're not used to generate the dictionaries, so we can safely use utf-8 encoding.
Also, since they are wordlists, it makes sense to use the TXT extensions.
Assignee | ||
Comment 2•1 year ago
|
||
This works around the limitation in Phabricator, where ISO-8859-1 files are seen as binary.
Files have been converted from the existing dictionary using iconv, and the SET manually updated in the affix file to UTF8
Depends on D165302
Assignee | ||
Comment 3•1 year ago
|
||
Depends on D165303
Assignee | ||
Comment 4•1 year ago
|
||
edit-dictionary:
- Convert to utf-8 before editing, and back to iso-8859-1 before saving
- Place a copy of the utf-8 dictionary inside the utf8 folder, and store the iso-8859-1 in place
make-new-dict:
- Use .txt extension for support wordlists, and place them in a subfolder
- Exclude words in mozilla-exclusions.txt from the generated dictionary
- Save 5-mozilla-*.txt files to utf-8
Depends on D165304
Assignee | ||
Comment 5•1 year ago
|
||
Depends on D165305
Pushed by flodolo@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/ce107857320c en-US dictionary: use .txt extensions for 5-mozilla-* files, convert to utf-8 encoding, r=sylvestre https://hg.mozilla.org/integration/autoland/rev/84651f241b19 Keep a utf-8 encoded version of the dictionary files in the tree, r=sylvestre https://hg.mozilla.org/integration/autoland/rev/df27be19a593 Update documentation on how to manage en-US dictionary, r=sylvestre https://hg.mozilla.org/integration/autoland/rev/ff83cea010d8 Update scripts for en-US dictionary, r=sylvestre https://hg.mozilla.org/integration/autoland/rev/8cf460979ce3 Run scripts using the same dataset (SCOWL scowl-2019.10.06), r=sylvestre
Comment 7•1 year ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/ce107857320c
https://hg.mozilla.org/mozilla-central/rev/84651f241b19
https://hg.mozilla.org/mozilla-central/rev/df27be19a593
https://hg.mozilla.org/mozilla-central/rev/ff83cea010d8
https://hg.mozilla.org/mozilla-central/rev/8cf460979ce3
Assignee | ||
Updated•1 year ago
|
Description
•