Closed
Bug 1238031
Opened 8 years ago
Closed 8 years ago
Refresh en-US dictionary from SCOWL, include common variants and accented words.
Categories
(Core :: Spelling checker, defect)
Core
Spelling checker
Tracking
()
RESOLVED
FIXED
mozilla46
People
(Reporter: jorgk-bmo, Assigned: kevin.bugzilla)
References
Details
Attachments
(1 file, 2 obsolete files)
217.67 KB,
patch
|
ehsan.akhgari
:
review+
jorgk-bmo
:
feedback+
Sylvestre
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
As per bug 1235506 comment #54 and bug 301712 comment #25 (and earlier) we're planning to refresh the en-US dictionary from upstream (SCOWL) and also include common variants ("advisor") and accented characters (which can be ISO 8859-1 encoded). Kevin Atkinson will provide an updated version of make-hunspell-dict to facilitate this. We will make the appropriate changes to: https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/make-new-dict
Reporter | ||
Updated•8 years ago
|
Reporter | ||
Comment 1•8 years ago
|
||
Quoting Kevin Atkinson from bug 1235506 comment #59: Attached is a patch (add-variants.patch) that does this. I ended up not needing to modify make-hunspell-dict so it will work with the 2015.08.24 version of SCOWL. It adds common variants and accented words. Note that there is now a mixture of iso-8859-1 and utf-8 encoding. I attempted to keep it straight but be on the lookout for encoding errors when you add non-ASCII words. Also the SCOWL dictionary got renamed from en_US to en_US-custom. The final Mozilla dictionary is still named en-US.
Assignee | ||
Comment 2•8 years ago
|
||
Attached is a patch (add-variants.patch) that does this. I ended up not needing to modify make-hunspell-dict so it will work with the 2015.08.24 version of SCOWL. It adds common variants and accented words. Note that there is now a mixture of iso-8859-1 and utf-8 encoding. I attempted to keep it straight but be on the lookout for encoding errors when you add non-ASCII words. Also the SCOWL dictionary got renamed from en_US to en_US-custom. The final Mozilla dictionary is still named en-US.
Reporter | ||
Comment 3•8 years ago
|
||
Thanks a lot, Kevin. I've briefly looked at your patch. There is indeed a mix of ISO-8859-1 and UTF-8 encoding. For example UTF-8 encoded are: +compère/DSG +confrère/SM +consommé/M Looking for Ã, I found 144 words in total. ISO-8859-1 encoded are: +débutante/SM +décolletage/SM +décolleté +démodé +dérailleur/MS +déshabillé/M +détente/M and many more. This would need to be cleaned up, everything needs to be ISO-8859-1 encoded for now.
Reporter | ||
Comment 4•8 years ago
|
||
Comment on attachment 8706093 [details] [diff] [review] add-variants.patch from bug 1235506 Wires crossed here and the same patch and comment got added twice.
Attachment #8706093 -
Attachment is obsolete: true
Assignee | ||
Comment 5•8 years ago
|
||
Jork K: There are two dictionaries that are being updated in this patch: 1) The dictionary used, en-US.dic should be all iso-8859-1 this is the dictionary Mozilla uses. 2) orig/en_US-custom.dic is in UTF-8. This can not be avoided. If you notice mixed encoding in the same dictionary please let me know, but I double checked and didn't notice any problems in en-US.dic.
Assignee | ||
Comment 6•8 years ago
|
||
I noticed a bug in my script. I will fix that tomorrow.
Reporter | ||
Comment 7•8 years ago
|
||
Where in the US are you located? Are you a night owl or an early riser? ;-) It's night there now in all parts. +Asunción/M (when viewed in ANSI) was added to "the dictionary", so I'm glad you could identify the problem.
Reporter | ||
Comment 8•8 years ago
|
||
I noticed that this patch will give us "advisor", but sadly not "infeasible" (http://app.aspell.net/lookup?dict=en_US;words=infeasible) See bug bug 1183512, bug 1198052. Should that not be automatically included as a variant? It's in en_US-large which has common variants (level 1).
Assignee | ||
Comment 9•8 years ago
|
||
Encoding bug fix.
Attachment #8706094 -
Attachment is obsolete: true
Assignee | ||
Comment 10•8 years ago
|
||
(In reply to Jorg K (GMT+1) from comment #7) > +Asunción/M (when viewed in ANSI) was added to "the dictionary", so I'm > glad you could identify the problem. I downloaded the patch and double checked. The dictionary Mozilla uses, "en-US.dic", is in iso-8859-1. The dictionary generated by SCOWL, "dictionary-sources/orig/en_US-custom.dic" is in UTF-8. As I said before this can not be helped. This dictionary (as a hack) is converted to iso-8859-1 for Mozilla. I am sure everything is encoded as it should be. The "bug fix" I just uploaded did not affect the output as it fixed a corner case. Ehsan: Can you review this patch and apply it ASAP. This patch was generated from changeset 279295:9e172c20b8f9. Because it changed the dictionary it will not apply cleanly if any changes to the Mozilla dictionary are made after that changeset.
Flags: needinfo?(ehsan)
Assignee | ||
Comment 11•8 years ago
|
||
(In reply to Jorg K (GMT+1) from comment #8) > I noticed that this patch will give us "advisor", but sadly not "infeasible" > (http://app.aspell.net/lookup?dict=en_US;words=infeasible) > See bug bug 1183512, bug 1198052. Should that not be automatically included > as a variant? It's in en_US-large which has common variants (level 1). By variant I only mean spelling variants. "infeasible" is not a spelling variant of "unfeasible" it is a separate word. My tool will tell you if it thinks it is a (spelling) variant. Try looking up both "advisor" and "infeasible" to see the difference: http://app.aspell.net/lookup?dict=en_US;words=advisor%0D%0Ainfeasible%0D%0A Fell free to file a bug upstream on this.
Assignee | ||
Comment 12•8 years ago
|
||
(In reply to Kevin Atkinson from comment #11) > (In reply to Jorg K (GMT+1) from comment #8) > > Fell free to file a bug upstream on this. I researched "infeasible" vs "unfeasible" and decided to go ahead and add "infeasible" so no need to file an bug report.
Reporter | ||
Comment 13•8 years ago
|
||
Comment on attachment 8706145 [details] [diff] [review] add-variants.patch (In reply to Kevin Atkinson from comment #10) > The dictionary generated by SCOWL, > "dictionary-sources/orig/en_US-custom.dic" is in UTF-8. As I said before > this can not be helped. This dictionary (as a hack) is converted to > iso-8859-1 for Mozilla. > > I am sure everything is encoded as it should be. I am so sorry, I stand corrected, you are right. I got confused with the two encodings in the two versions of the file. In the patch, for example "+Asunción/M" gets added in two different encodings to two different files. > Ehsan: Can you review this patch and apply it ASAP. This patch was > generated from changeset 279295:9e172c20b8f9. Because it changed the > dictionary it will not apply cleanly if any changes to the Mozilla > dictionary are made after that changeset. After refreshing my local repository, I applied the patch locally and it applies just fine. I don't think anyone will get in the way of this patch. We all know who's working on what. Once this lands, I will get back to the proper name clean-up in bug 301712.
Attachment #8706145 -
Flags: review?(ehsan)
Attachment #8706145 -
Flags: feedback+
Reporter | ||
Comment 14•8 years ago
|
||
(In reply to Kevin Atkinson from comment #12) > I researched "infeasible" vs "unfeasible" and decided to go ahead and add > "infeasible" so no need to file an bug report. Our research in bug 1198052 comment #1 was found this: http://www.economist.com/blogs/johnson/2012/07/variation Kevin, thank you so much for your work. The Mozilla dictionary has become much better. I hope we can refresh with SCOWL 2016-01 data soon. As mentioned, I will also clean up the proper names, thanks for your help with getting a grip on those, too. After so much praise, a little nag ;-) How about https://github.com/kevina/wordlist/issues/138? Many "useful" words have disappeared from "size 60". Any chance to see them again soon?
Comment 15•8 years ago
|
||
Comment on attachment 8706145 [details] [diff] [review] add-variants.patch Review of attachment 8706145 [details] [diff] [review]: ----------------------------------------------------------------- Very nice, thanks a lot Kevin and Jorg, and sorry for the lag in the review!
Attachment #8706145 -
Flags: review?(ehsan) → review+
Updated•8 years ago
|
Flags: needinfo?(ehsan)
Keywords: checkin-needed
Assignee | ||
Comment 16•8 years ago
|
||
The commit message is not in the right format. Do you want me to fix that or will that just slow things down?
Flags: needinfo?(ehsan)
Reporter | ||
Comment 17•8 years ago
|
||
(Kevin: The sheriff can do that when checking in.) Dear Sheriff, The commit message is missing some information. Can you please fix that when checking in: Change this Fix make-new-dict to use a custom en_US dictionary that adds common variants and accented words. to this: Bug 1238031 - Fix make-new-dict to use a custom en_US dictionary that adds common variants and accented words. r=ehsan. Thank you so much. Also, this is a patch that affects the spellcheck dictionary only. It shouldn't break any tests, so please combine with other patches when landing. Thanks again.
Flags: needinfo?(ehsan)
Comment 18•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/0494743eda30
Keywords: checkin-needed
Comment 19•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/0494743eda30
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
Reporter | ||
Comment 20•8 years ago
|
||
Comment on attachment 8706145 [details] [diff] [review] add-variants.patch Approval Request Comment [Feature/regressing bug #]: No regression. [User impact if declined]: Dictionary with less words. [Describe test coverage new/current, TreeHerder]: N/A. [Risks and why]: No risk, en-US dictionary change only. [String/UUID change made/needed]: None. It would be good to include a richer dictionary in ESR 45 (also, dare I say it, for the benefit of Thunderbird users). See also bug 1235506 comment #63 and bug 1235506 comment #64. As I said, one more is coming in the next few days: bug 301712.
Attachment #8706145 -
Flags: approval-mozilla-aurora?
Updated•8 years ago
|
status-firefox45:
--- → affected
Updated•8 years ago
|
Assignee: nobody → mozilla
Comment 21•8 years ago
|
||
Comment on attachment 8706145 [details] [diff] [review] add-variants.patch Sure, Happy to help thunderbird too :)
Attachment #8706145 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Reporter | ||
Comment 22•8 years ago
|
||
> Assignee: nobody@mozilla.org → mozilla@jorgk.com
Thanks, but Kevin did all the work here.
Assignee: mozilla → kevin.bugzilla
Comment 23•8 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-aurora/rev/01eeea86dfa8
You need to log in
before you can comment on or make changes to this bug.
Description
•