Open Bug 468779 Opened 16 years ago Updated 2 years ago

Consider integrating Chromium's changes to Hunspell

Categories

(Core :: Spelling checker, enhancement)

enhancement

Tracking

()

People

(Reporter: brettw, Unassigned)

References

(Blocks 2 open bugs)

Details

(Keywords: memory-footprint, parity-chrome, perf)

This is mostly a FYI bug so you know what Chromium did with Hunspell. Mozilla may want to consider similar changes.

Chromium rewrote the storage layer of Hunspell to use a binary data structure that can be loaded much more quickly. Parsing and mallocing (on the order of 100,000 mallocs if I recall) the data for the dictionary list takes a noticeable amount of time the first tine you type a word. Furthermore, there is a lot of duplication of the affix rules (it's uniquely copied into every word) which takes a lot of extra space.

Chromium uses am optimized binary trie for storing the word list. It doesn't require much work to initialize so eliminates the quick hang when you type the first word (mostly a memory map of the dictionary file so the OS pages it in as needed). It uses significantly less memory at runtime, between 5 and 20 MB less depending on the language, because the affix data isn't duplicated in each word and there are not thousands of individual mallocs.

It's not perfect as-is. I think Mozilla uses an newer version of Hunspell (but Chromium will presumably merge the changes to a future one at some point). The binary trie is also not currently big-endian safe. But this is an area where Mozilla can get a lot of additional performance.

Chromium's changed copy of Hunspell is here:
http://src.chromium.org/viewvc/chrome/trunk/src/chrome/third_party/hunspell/

The custom binary storage layer and converter for the current hunspell format is in the "google" subdirectory. The rest of the changes are controlled by the HUNSPELL_CHROME_CLIENT ifdef added in the src directory in various places. Feel free to contact me if you have any questions.
Severity: minor → enhancement
Version: unspecified → Trunk
Thanks for the info, Brett!  Sounds like a worthwhile direction to look into...
Whiteboard: [parity-Chrome]
Yeah, thanks Brett --- appreciate it. We just need to find someone to do the actual work. It would probably be reasonably straightforward for someone new to the project.
how are the chromium changes licensed? Do they follow LGPL+MPL?
Flags: wanted1.9.2?
Keywords: footprint, perf
Is there a bug to upstream those changes into hunspell itself?
Assignee: nobody → ehsan.akhgari
Chromium's hunspell 1.2.8 port (which is the version we use) now lives here:

<http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell128/>

From what I could decipher from Chromium's build scripts, they don't use the 1.2.8 port yet (Brett, please correct me if I'm wrong.)  Also, we can't do a simple copy & paste port to our tree because the Chromium specific uses STL, which we can't use (but I don't expect porting the STL data structures to NSPR's equivalents would be too difficult.)

So, I think we should wait until Chromium moves to the 1.2.8 port, and then start our own port.  We would probably want to maintain the changes made by our port as a patch inside our tree, so that we can make future upgrades from Chromium code base easier.

Comments/suggestions welcome.
I looked it up and we are using the hunspell128 directory.
(In reply to comment #8)
> I looked it up and we are using the hunspell128 directory.

Oh, thanks for correcting me.

So, how are the dictionaries distributed?  Do binary versions ship with Chrome or do you distribute the .dic and .aff files and build the binary version on first run or something?
Flags: wanted1.9.2?
When a dictionary is needed, the binary dictionary is downloaded in the background. This allows us to keep the download smaller since it doesn't include the dictionary, and also magically supports changing languages: when the user changes to a new language, we'll just do the same thing and download the dictionary for that language.
What I really meant to ask was how you generate the binary dictionary for the first time.  Is there code for that in the Chromium tree?  Because we would probably want to stick with shipping the text based dictionary format, and generate binary formats on first run...
The converter code is here:
http://src.chromium.org/viewvc/chrome/trunk/src/chrome/tools/convert_dict/

It wasn't meant to run on enduser computers so may need a bit more error checking or recovery. You can see it just asserts in a number of places which is all we need when manually converting a dictionary.
Some Hunspell dictionaries haven't worked with the modified Hunspell, yet. For example, the Hungarian dictionary: http://code.google.com/p/chromium/issues/detail?id=15558. The solution is supporting the original dictionary format for the problematic dictionaries.
How can you detect that they are "not working"?  Is the problem which causes them not to work known?  Is there anything which can be done about it?
(In reply to comment #13)
> Some Hunspell dictionaries haven't worked with the modified Hunspell, yet. For
> example, the Hungarian dictionary:
> http://code.google.com/p/chromium/issues/detail?id=15558. The solution is
> supporting the original dictionary format for the problematic dictionaries.

was closed fixed Apr 2010
maybe I can work on this bug, if it's still needed.
(In reply to comment #16)
> maybe I can work on this bug, if it's still needed.

Yes, that would be very much appreciated!
There's a problem: we use version 1.3.2, they 1.2.12.

Here is the list of the changes they did to Hunspell: https://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell/README.chromium?revision=78071

It's not a huge list, we can do the changes in 1.3.2. But how will we do when Chromium devs make other changes? I don't see a way to automatically sync with them.
It would be possible to create a filtered RSS feed of commits to chromium that affect the hunspell module, and then at least one person could subscribe and ensure that we didn't miss anything. Alternatively, somebody makes it their job to regularly check in and see if any changes have occurred.
I've been the unofficial Hunspell upstream babysitter in recent history. I could keep an eye on Chromium as well. Of course, it would be ideal if Chromium's changes could be upstreamed to avoid the issue entirely.

Nemeth/Caolan, I'm assuming that these changes are worth bringing into Hunspell for all consumers to use, yes? Based on comment 15, it sounds like progress has been made with dictionary compatibility?
I don't know if Chromium ever submitted any of their changes ?
I strongly agree that Chromium's changes should be upstreamed into Hunspell.
Otherwise keeping track of changes would be chaotic.

Upstreaming would be better also for Chromium itself!
(In reply to comment #22)
> I strongly agree that Chromium's changes should be upstreamed into Hunspell.
> Otherwise keeping track of changes would be chaotic.
> 
> Upstreaming would be better also for Chromium itself!

So are you going to work on upstreaming them?  This way we can just take a new release from hunspell and we can alert the Chromium guys to do the same thing too.
(In reply to Ehsan Akhgari [:ehsan] from comment #23)
> (In reply to comment #22)
> So are you going to work on upstreaming them?  This way we can just take a
> new release from hunspell and we can alert the Chromium guys to do the same
> thing too.

Yes, I'll try. Should we ask to Chromium devs what's their opinion?

However we should wait for Caolan or Nemeth's reply (if they want or not this upstream change).
Probably more a question for Nemeth than me if there would be push-back against supporting basically an additional "binary cache" format, though it seems reasonable to me anyway, especially as its not a case of throwing away support for all the pre-existing text-based dicts.

If the Chromium guys have kept track of, or can regenerate, a series of feature-by-feature or fix-by-fix changesets I can have a go at merging any uncontroversial bits if Nemeth's flooded.
Caolan, could you contact Néméth to know if he wants these changes upstreamed?

I've talked with chromium developers, they didn't think hunspell was mantained, but they agree to upstream their changes.

I've asked if they think the upstreaming work will be difficult:
<+brettw> depends on what the Hunspell people want to do
<+brettw> if they're OK with an ifdef (basically what we've done) then it's probably easy, just copying it in
<+brettw> (maybe some updates for the new version of Hunspell)
<+brettw> but they may want changes which could make it harder
Caolan, Néméth, ping?
I think its a good idea, can't seem to contact Nemeth about it. Might be on e.g. vacation or something. If the chromium guys are happy with the hunspell dual license then they can fire their changes/latest work or whatever over to me and I'll have a look.
(In reply to Caolan McNamara from comment #28)
I only support a new binary format for loading time + memory optimization, if it keeps the flexiblity of the dictionary format of Hunspell (it contains arbitrary fields for different functions, eg. stemming and morphological generation). I believe, supporting web standards and new functions is more important task. Mozilla/Firefox needs improved spell checking (CSS3, moreover grammar checking, like in Webkit/Safari), hyphenation (CSS3, IE10: http://blogs.msdn.com/b/ie/archive/2011/10/10/building-rich-text-centric-pages-in-ie10.aspx, maybe next Opera). There are annoying problems in spell checking of Firefox, especially tokenization (abbrevations with dots [now words with hyphen are recognized well, many thanks for it], spell checking of text area/buffering [it seems, Firefox can't help in checking of large Wikimedia articles] or missing functions, like automatic language detection in text areas. I also plan to improve stemming and morphological generation and add hyphenation, so Firefox will be able to use these Hunspell functions for better indexing of web pages, (not only for grammar checking) and for hyphenation via SPELLML API of Hunspell.
We've added hyphenation support by importing lihyphen or whatever it's called:
http://mxr.mozilla.org/mozilla-central/source/intl/hyphenation/src/README
We have also implemented language detection in bug 338427.

The problem with big textareas is that currently Firefox has a hard limit of content to spell check before it gives up.  I suspect that was done because of efficiency concerns, but I'm sure we can improve the situation.

Is hunspell going to get grammar checking support?  If not, we'd need to fall back to whatever WebKit is using for that...
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #30)

Thanks for the good news and your work. I’m very glad of it, because it is a huge improvement for several languages. Unfortunately, your integration is not perfect yet, because Firefox doesn’t support special hyphenation, a long-standing request from the Web (from 1996: http://www.nada.kth.se/i18n/html/hyph.html). For example, you were be able to eliminate the bad Hungarian hyphenation in this area by the recent replacement of the Hungarian LibreOffice hyphenation dictionary with a simplified variant, but it is not the ideal solution. Interestingly, Adobe InDesign 5.5 has got similar problem with its Hyphen integration, this bug report asks similar support for Catalan hyphenation, like in OpenOffice.org/LibreOffice: http://forums.adobe.com/thread/896911. Other affected languages: Dutch, Greek, Norwegian, Swedish and German with its old orthography.
(In reply to Ehsan Akhgari [:ehsan] from comment #31)

Thanks for your information.

> Is hunspell going to get grammar checking support?  If not, we'd need to
> fall back to whatever WebKit is using for that...

I believe, this is only a non-free feature of Safari.

Hunspell dictionaries can contain POS and morphological data for grammar checkers, see English, French and Hungarian LibreOffice dictionaries. LibreOffice uses them for English and Hungarian morphological generation in its thesaurus, and (yet with extensions) for French and Hungarian grammar checking.
See Also: → 535623
Blocks: 470471
Mass bug change to replace various 'parity' whiteboard flags with the new canonical keywords. (See bug 1443764 comment 13.)
Keywords: parity-chrome
Whiteboard: [parity-Chrome]
Assignee: ehsan → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.