Closed Bug 1238031 Opened 8 years ago Closed 8 years ago

Refresh en-US dictionary from SCOWL, include common variants and accented words.

Categories

(Core :: Spelling checker, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla46
Tracking Status
firefox45 --- fixed
firefox46 --- fixed

People

(Reporter: jorgk-bmo, Assigned: kevin.bugzilla)

References

Details

Attachments

(1 file, 2 obsolete files)

As per bug 1235506 comment #54 and bug 301712 comment #25 (and earlier) we're planning to refresh the en-US dictionary from upstream (SCOWL) and also include common variants ("advisor") and accented characters (which can be ISO 8859-1 encoded).

Kevin Atkinson will provide an updated version of make-hunspell-dict to facilitate this. We will make the appropriate changes to:
https://dxr.mozilla.org/mozilla-central/source/extensions/spellcheck/locales/en-US/hunspell/dictionary-sources/make-new-dict
Quoting Kevin Atkinson from bug 1235506 comment #59:

Attached is a patch (add-variants.patch) that does this.  I ended up not needing to modify make-hunspell-dict so it will work with the 2015.08.24 version of SCOWL.

It adds common variants and accented words.

Note that there is now a mixture of iso-8859-1 and utf-8 encoding.  I attempted to keep it straight but be on the lookout for encoding errors when you add non-ASCII words.

Also the SCOWL dictionary got renamed from en_US to en_US-custom.  The final Mozilla dictionary is still named en-US.
Attached patch add-variants.patch (obsolete) — Splinter Review
Attached is a patch (add-variants.patch) that does this.  I ended up not needing to modify make-hunspell-dict so it will work with the 2015.08.24 version of SCOWL.

It adds common variants and accented words.

Note that there is now a mixture of iso-8859-1 and utf-8 encoding.  I attempted to keep it straight but be on the lookout for encoding errors when you add non-ASCII words.

Also the SCOWL dictionary got renamed from en_US to en_US-custom.  The final Mozilla dictionary is still named en-US.
Thanks a lot, Kevin.

I've briefly looked at your patch.

There is indeed a mix of ISO-8859-1 and UTF-8 encoding. For example UTF-8 encoded are:
+compère/DSG
+confrère/SM
+consommé/M
Looking for Ã, I found 144 words in total.

ISO-8859-1 encoded are:
+débutante/SM
+décolletage/SM
+décolleté
+démodé
+dérailleur/MS
+déshabillé/M
+détente/M
and many more.

This would need to be cleaned up, everything needs to be ISO-8859-1 encoded for now.
Comment on attachment 8706093 [details] [diff] [review]
add-variants.patch from bug 1235506

Wires crossed here and the same patch and comment got added twice.
Attachment #8706093 - Attachment is obsolete: true
Jork K: There are two dictionaries that are being updated in this patch:

1) The dictionary used, en-US.dic should be all iso-8859-1 this is the dictionary Mozilla uses.
2) orig/en_US-custom.dic is in UTF-8.  This can not be avoided.

If you notice mixed encoding in the same dictionary please let me know, but I double checked and didn't notice any problems in en-US.dic.
I noticed a bug in my script.  I will fix that tomorrow.
Where in the US are you located? Are you a night owl or an early riser? ;-)
It's night there now in all parts.

+Asunción/M (when viewed in ANSI) was added to "the dictionary", so I'm glad you could identify the problem.
I noticed that this patch will give us "advisor", but sadly not "infeasible"
(http://app.aspell.net/lookup?dict=en_US;words=infeasible)
See bug bug 1183512, bug 1198052. Should that not be automatically included as a variant? It's in en_US-large which has common variants (level 1).
Encoding bug fix.
Attachment #8706094 - Attachment is obsolete: true
(In reply to Jorg K (GMT+1) from comment #7)

> +Asunción/M (when viewed in ANSI) was added to "the dictionary", so I'm
> glad you could identify the problem.

I downloaded the patch and double checked.  The dictionary Mozilla uses, "en-US.dic", is in iso-8859-1.

The dictionary generated by SCOWL, "dictionary-sources/orig/en_US-custom.dic" is in UTF-8.  As I said before this can not be helped.  This dictionary (as a hack) is converted to iso-8859-1 for Mozilla.

I am sure everything is encoded as it should be.

The "bug fix" I just uploaded did not affect the output as it fixed a corner case.

Ehsan: Can you review this patch and apply it ASAP.  This patch was generated from changeset 279295:9e172c20b8f9.  Because it changed the dictionary it will not apply cleanly if any changes to the Mozilla dictionary are made after that changeset.
Flags: needinfo?(ehsan)
(In reply to Jorg K (GMT+1) from comment #8)

> I noticed that this patch will give us "advisor", but sadly not "infeasible"
> (http://app.aspell.net/lookup?dict=en_US;words=infeasible)
> See bug bug 1183512, bug 1198052. Should that not be automatically included
> as a variant? It's in en_US-large which has common variants (level 1).

By variant I only mean spelling variants.  "infeasible" is not a spelling variant of "unfeasible" it is a separate word.

My tool will tell you if it thinks it is a (spelling) variant.  Try looking up both "advisor" and "infeasible" to see the difference: http://app.aspell.net/lookup?dict=en_US;words=advisor%0D%0Ainfeasible%0D%0A

Fell free to file a bug upstream on this.
(In reply to Kevin Atkinson from comment #11)
> (In reply to Jorg K (GMT+1) from comment #8)
>
> Fell free to file a bug upstream on this.

I researched "infeasible" vs "unfeasible" and decided to go ahead and add "infeasible" so no need to file an bug report.
Comment on attachment 8706145 [details] [diff] [review]
add-variants.patch

(In reply to Kevin Atkinson from comment #10)
> The dictionary generated by SCOWL,
> "dictionary-sources/orig/en_US-custom.dic" is in UTF-8.  As I said before
> this can not be helped.  This dictionary (as a hack) is converted to
> iso-8859-1 for Mozilla.
> 
> I am sure everything is encoded as it should be.
I am so sorry, I stand corrected, you are right.
I got confused with the two encodings in the two versions of the file.
In the patch, for example "+Asunción/M" gets added in two different encodings to two different files.

> Ehsan: Can you review this patch and apply it ASAP.  This patch was
> generated from changeset 279295:9e172c20b8f9.  Because it changed the
> dictionary it will not apply cleanly if any changes to the Mozilla
> dictionary are made after that changeset.
After refreshing my local repository, I applied the patch locally and it applies just fine.
I don't think anyone will get in the way of this patch. We all know who's working on what.

Once this lands, I will get back to the proper name clean-up in bug 301712.
Attachment #8706145 - Flags: review?(ehsan)
Attachment #8706145 - Flags: feedback+
(In reply to Kevin Atkinson from comment #12)
> I researched "infeasible" vs "unfeasible" and decided to go ahead and add
> "infeasible" so no need to file an bug report.
Our research in bug 1198052 comment #1 was found this:
http://www.economist.com/blogs/johnson/2012/07/variation

Kevin, thank you so much for your work. The Mozilla dictionary has become much better. I hope we can refresh with SCOWL 2016-01 data soon. As mentioned, I will also clean up the proper names, thanks for your help with getting a grip on those, too.

After so much praise, a little nag ;-)
How about https://github.com/kevina/wordlist/issues/138?
Many "useful" words have disappeared from "size 60". Any chance to see them again soon?
Comment on attachment 8706145 [details] [diff] [review]
add-variants.patch

Review of attachment 8706145 [details] [diff] [review]:
-----------------------------------------------------------------

Very nice, thanks a lot Kevin and Jorg, and sorry for the lag in the review!
Attachment #8706145 - Flags: review?(ehsan) → review+
Flags: needinfo?(ehsan)
Keywords: checkin-needed
The commit message is not in the right format.  Do you want me to fix that or will that just slow things down?
Flags: needinfo?(ehsan)
(Kevin: The sheriff can do that when checking in.)

Dear Sheriff,
The commit message is missing some information.

Can you please fix that when checking in:
Change this
Fix make-new-dict to use a custom en_US dictionary that adds common variants and accented words.
to this:
Bug 1238031 - Fix make-new-dict to use a custom en_US dictionary that adds common variants and accented words. r=ehsan.

Thank you so much.

Also, this is a patch that affects the spellcheck dictionary only. It shouldn't break any tests, so please combine with other patches when landing. Thanks again.
Flags: needinfo?(ehsan)
https://hg.mozilla.org/mozilla-central/rev/0494743eda30
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
Comment on attachment 8706145 [details] [diff] [review]
add-variants.patch

Approval Request Comment
[Feature/regressing bug #]: No regression.
[User impact if declined]: Dictionary with less words.
[Describe test coverage new/current, TreeHerder]: N/A.
[Risks and why]: No risk, en-US dictionary change only.
[String/UUID change made/needed]: None.

It would be good to include a richer dictionary in ESR 45 (also, dare I say it, for the benefit of Thunderbird users).

See also bug 1235506 comment #63 and bug 1235506 comment #64.

As I said, one more is coming in the next few days: bug 301712.
Attachment #8706145 - Flags: approval-mozilla-aurora?
Assignee: nobody → mozilla
Comment on attachment 8706145 [details] [diff] [review]
add-variants.patch

Sure, Happy to help thunderbird too :)
Attachment #8706145 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
> Assignee: nobody@mozilla.orgmozilla@jorgk.com
Thanks, but Kevin did all the work here.
Assignee: mozilla → kevin.bugzilla
No longer blocks: 1198052
You need to log in before you can comment on or make changes to this bug.