Closed
Bug 1238031
Opened 10 years ago
Closed 10 years ago
Refresh en-US dictionary from SCOWL, include common variants and accented words.
Categories
(Core :: Spelling checker, defect)
Core
Spelling checker
Tracking
()
RESOLVED
FIXED
mozilla46
People
(Reporter: jorgk-bmo, Assigned: kevin.bugzilla)
References
Details
Attachments
(1 file, 2 obsolete files)
|
217.67 KB,
patch
|
ehsan.akhgari
:
review+
jorgk-bmo
:
feedback+
Sylvestre
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
As per bug 1235506 comment #54 and bug 301712 comment #25 (and earlier) we're planning to refresh the en-US dictionary from upstream (SCOWL) and also include common variants ("advisor") and accented characters (which can be ISO 8859-1 encoded).
Kevin Atkinson will provide an updated version of make-hunspell-dict to facilitate this. We will make the appropriate changes to:
https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/make-new-dict
| Reporter | ||
Updated•10 years ago
|
| Reporter | ||
Comment 1•10 years ago
|
||
Quoting Kevin Atkinson from bug 1235506 comment #59:
Attached is a patch (add-variants.patch) that does this. I ended up not needing to modify make-hunspell-dict so it will work with the 2015.08.24 version of SCOWL.
It adds common variants and accented words.
Note that there is now a mixture of iso-8859-1 and utf-8 encoding. I attempted to keep it straight but be on the lookout for encoding errors when you add non-ASCII words.
Also the SCOWL dictionary got renamed from en_US to en_US-custom. The final Mozilla dictionary is still named en-US.
| Assignee | ||
Comment 2•10 years ago
|
||
Attached is a patch (add-variants.patch) that does this. I ended up not needing to modify make-hunspell-dict so it will work with the 2015.08.24 version of SCOWL.
It adds common variants and accented words.
Note that there is now a mixture of iso-8859-1 and utf-8 encoding. I attempted to keep it straight but be on the lookout for encoding errors when you add non-ASCII words.
Also the SCOWL dictionary got renamed from en_US to en_US-custom. The final Mozilla dictionary is still named en-US.
| Reporter | ||
Comment 3•10 years ago
|
||
Thanks a lot, Kevin.
I've briefly looked at your patch.
There is indeed a mix of ISO-8859-1 and UTF-8 encoding. For example UTF-8 encoded are:
+compère/DSG
+confrère/SM
+consommé/M
Looking for Ã, I found 144 words in total.
ISO-8859-1 encoded are:
+débutante/SM
+décolletage/SM
+décolleté
+démodé
+dérailleur/MS
+déshabillé/M
+détente/M
and many more.
This would need to be cleaned up, everything needs to be ISO-8859-1 encoded for now.
| Reporter | ||
Comment 4•10 years ago
|
||
Comment on attachment 8706093 [details] [diff] [review]
add-variants.patch from bug 1235506
Wires crossed here and the same patch and comment got added twice.
Attachment #8706093 -
Attachment is obsolete: true
| Assignee | ||
Comment 5•10 years ago
|
||
Jork K: There are two dictionaries that are being updated in this patch:
1) The dictionary used, en-US.dic should be all iso-8859-1 this is the dictionary Mozilla uses.
2) orig/en_US-custom.dic is in UTF-8. This can not be avoided.
If you notice mixed encoding in the same dictionary please let me know, but I double checked and didn't notice any problems in en-US.dic.
| Assignee | ||
Comment 6•10 years ago
|
||
I noticed a bug in my script. I will fix that tomorrow.
| Reporter | ||
Comment 7•10 years ago
|
||
Where in the US are you located? Are you a night owl or an early riser? ;-)
It's night there now in all parts.
+Asunción/M (when viewed in ANSI) was added to "the dictionary", so I'm glad you could identify the problem.
| Reporter | ||
Comment 8•10 years ago
|
||
I noticed that this patch will give us "advisor", but sadly not "infeasible"
(http://app.aspell.net/lookup?dict=en_US;words=infeasible)
See bug bug 1183512, bug 1198052. Should that not be automatically included as a variant? It's in en_US-large which has common variants (level 1).
| Assignee | ||
Comment 9•10 years ago
|
||
Encoding bug fix.
Attachment #8706094 -
Attachment is obsolete: true
| Assignee | ||
Comment 10•10 years ago
|
||
(In reply to Jorg K (GMT+1) from comment #7)
> +Asunción/M (when viewed in ANSI) was added to "the dictionary", so I'm
> glad you could identify the problem.
I downloaded the patch and double checked. The dictionary Mozilla uses, "en-US.dic", is in iso-8859-1.
The dictionary generated by SCOWL, "dictionary-sources/orig/en_US-custom.dic" is in UTF-8. As I said before this can not be helped. This dictionary (as a hack) is converted to iso-8859-1 for Mozilla.
I am sure everything is encoded as it should be.
The "bug fix" I just uploaded did not affect the output as it fixed a corner case.
Ehsan: Can you review this patch and apply it ASAP. This patch was generated from changeset 279295:9e172c20b8f9. Because it changed the dictionary it will not apply cleanly if any changes to the Mozilla dictionary are made after that changeset.
Flags: needinfo?(ehsan)
| Assignee | ||
Comment 11•10 years ago
|
||
(In reply to Jorg K (GMT+1) from comment #8)
> I noticed that this patch will give us "advisor", but sadly not "infeasible"
> (http://app.aspell.net/lookup?dict=en_US;words=infeasible)
> See bug bug 1183512, bug 1198052. Should that not be automatically included
> as a variant? It's in en_US-large which has common variants (level 1).
By variant I only mean spelling variants. "infeasible" is not a spelling variant of "unfeasible" it is a separate word.
My tool will tell you if it thinks it is a (spelling) variant. Try looking up both "advisor" and "infeasible" to see the difference: http://app.aspell.net/lookup?dict=en_US;words=advisor%0D%0Ainfeasible%0D%0A
Fell free to file a bug upstream on this.
| Assignee | ||
Comment 12•10 years ago
|
||
(In reply to Kevin Atkinson from comment #11)
> (In reply to Jorg K (GMT+1) from comment #8)
>
> Fell free to file a bug upstream on this.
I researched "infeasible" vs "unfeasible" and decided to go ahead and add "infeasible" so no need to file an bug report.
| Reporter | ||
Comment 13•10 years ago
|
||
Comment on attachment 8706145 [details] [diff] [review]
add-variants.patch
(In reply to Kevin Atkinson from comment #10)
> The dictionary generated by SCOWL,
> "dictionary-sources/orig/en_US-custom.dic" is in UTF-8. As I said before
> this can not be helped. This dictionary (as a hack) is converted to
> iso-8859-1 for Mozilla.
>
> I am sure everything is encoded as it should be.
I am so sorry, I stand corrected, you are right.
I got confused with the two encodings in the two versions of the file.
In the patch, for example "+Asunción/M" gets added in two different encodings to two different files.
> Ehsan: Can you review this patch and apply it ASAP. This patch was
> generated from changeset 279295:9e172c20b8f9. Because it changed the
> dictionary it will not apply cleanly if any changes to the Mozilla
> dictionary are made after that changeset.
After refreshing my local repository, I applied the patch locally and it applies just fine.
I don't think anyone will get in the way of this patch. We all know who's working on what.
Once this lands, I will get back to the proper name clean-up in bug 301712.
Attachment #8706145 -
Flags: review?(ehsan)
Attachment #8706145 -
Flags: feedback+
| Reporter | ||
Comment 14•10 years ago
|
||
(In reply to Kevin Atkinson from comment #12)
> I researched "infeasible" vs "unfeasible" and decided to go ahead and add
> "infeasible" so no need to file an bug report.
Our research in bug 1198052 comment #1 was found this:
http://www.economist.com/blogs/johnson/2012/07/variation
Kevin, thank you so much for your work. The Mozilla dictionary has become much better. I hope we can refresh with SCOWL 2016-01 data soon. As mentioned, I will also clean up the proper names, thanks for your help with getting a grip on those, too.
After so much praise, a little nag ;-)
How about https://github.com/kevina/wordlist/issues/138?
Many "useful" words have disappeared from "size 60". Any chance to see them again soon?
Comment 15•10 years ago
|
||
Comment on attachment 8706145 [details] [diff] [review]
add-variants.patch
Review of attachment 8706145 [details] [diff] [review]:
-----------------------------------------------------------------
Very nice, thanks a lot Kevin and Jorg, and sorry for the lag in the review!
Attachment #8706145 -
Flags: review?(ehsan) → review+
Updated•10 years ago
|
Flags: needinfo?(ehsan)
Keywords: checkin-needed
| Assignee | ||
Comment 16•10 years ago
|
||
The commit message is not in the right format. Do you want me to fix that or will that just slow things down?
Flags: needinfo?(ehsan)
| Reporter | ||
Comment 17•10 years ago
|
||
(Kevin: The sheriff can do that when checking in.)
Dear Sheriff,
The commit message is missing some information.
Can you please fix that when checking in:
Change this
Fix make-new-dict to use a custom en_US dictionary that adds common variants and accented words.
to this:
Bug 1238031 - Fix make-new-dict to use a custom en_US dictionary that adds common variants and accented words. r=ehsan.
Thank you so much.
Also, this is a patch that affects the spellcheck dictionary only. It shouldn't break any tests, so please combine with other patches when landing. Thanks again.
Flags: needinfo?(ehsan)
Comment 18•10 years ago
|
||
Keywords: checkin-needed
Comment 19•10 years ago
|
||
| bugherder | ||
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
| Reporter | ||
Comment 20•10 years ago
|
||
Comment on attachment 8706145 [details] [diff] [review]
add-variants.patch
Approval Request Comment
[Feature/regressing bug #]: No regression.
[User impact if declined]: Dictionary with less words.
[Describe test coverage new/current, TreeHerder]: N/A.
[Risks and why]: No risk, en-US dictionary change only.
[String/UUID change made/needed]: None.
It would be good to include a richer dictionary in ESR 45 (also, dare I say it, for the benefit of Thunderbird users).
See also bug 1235506 comment #63 and bug 1235506 comment #64.
As I said, one more is coming in the next few days: bug 301712.
Attachment #8706145 -
Flags: approval-mozilla-aurora?
Updated•10 years ago
|
status-firefox45:
--- → affected
Updated•10 years ago
|
Assignee: nobody → mozilla
Comment 21•10 years ago
|
||
Comment on attachment 8706145 [details] [diff] [review]
add-variants.patch
Sure, Happy to help thunderbird too :)
Attachment #8706145 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
| Reporter | ||
Comment 22•10 years ago
|
||
> Assignee: nobody@mozilla.org → mozilla@jorgk.com
Thanks, but Kevin did all the work here.
Assignee: mozilla → kevin.bugzilla
Comment 23•10 years ago
|
||
| bugherder uplift | ||
You need to log in
before you can comment on or make changes to this bug.
Description
•