468779 - Consider integrating Chromium's changes to Hunspell

Reporter

Description

•

17 years ago

This is mostly a FYI bug so you know what Chromium did with Hunspell. Mozilla may want to consider similar changes. Chromium rewrote the storage layer of Hunspell to use a binary data structure that can be loaded much more quickly. Parsing and mallocing (on the order of 100,000 mallocs if I recall) the data for the dictionary list takes a noticeable amount of time the first tine you type a word. Furthermore, there is a lot of duplication of the affix rules (it's uniquely copied into every word) which takes a lot of extra space. Chromium uses am optimized binary trie for storing the word list. It doesn't require much work to initialize so eliminates the quick hang when you type the first word (mostly a memory map of the dictionary file so the OS pages it in as needed). It uses significantly less memory at runtime, between 5 and 20 MB less depending on the language, because the affix data isn't duplicated in each word and there are not thousands of individual mallocs. It's not perfect as-is. I think Mozilla uses an newer version of Hunspell (but Chromium will presumably merge the changes to a future one at some point). The binary trie is also not currently big-endian safe. But this is an area where Mozilla can get a lot of additional performance. Chromium's changed copy of Hunspell is here: http://src.chromium.org/viewvc/chrome/trunk/src/chrome/third_party/hunspell/ The custom binary storage layer and converter for the current hunspell format is in the "google" subdirectory. The rest of the changes are controlled by the HUNSPELL_CHROME_CLIENT ifdef added in the src directory in various places. Feel free to contact me if you have any questions.

Tyler Downer [He/Him]

Updated

•

17 years ago

Severity: minor → enhancement

Version: unspecified → Trunk

Dan Mosedale (:dmosedale, :dmose)

Comment 1

•

17 years ago

Thanks for the info, Brett! Sounds like a worthwhile direction to look into...

Tyler Downer [He/Him]

Updated

•

17 years ago

Whiteboard: [parity-Chrome]

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 2

•

17 years ago

Yeah, thanks Brett --- appreciate it. We just need to find someone to do the actual work. It would probably be reasonably straightforward for someone new to the project.

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 3

•

17 years ago

how are the chromium changes licensed? Do they follow LGPL+MPL?

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 4

•

17 years ago

Seems to be our tri-license.

u88484

Updated

•

16 years ago

Flags: wanted1.9.2?

Wayne Mery (:wsmwk)

Updated

•

16 years ago

Keywords: footprint, perf

Axel Hecht [:Pike]

Comment 6

•

16 years ago

Is there a bug to upstream those changes into hunspell itself?

(no longer active)

Updated

•

16 years ago

Assignee: nobody → ehsan.akhgari

(no longer active)

Comment 7

•

16 years ago

Chromium's hunspell 1.2.8 port (which is the version we use) now lives here: <http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell128/> From what I could decipher from Chromium's build scripts, they don't use the 1.2.8 port yet (Brett, please correct me if I'm wrong.) Also, we can't do a simple copy & paste port to our tree because the Chromium specific uses STL, which we can't use (but I don't expect porting the STL data structures to NSPR's equivalents would be too difficult.) So, I think we should wait until Chromium moves to the 1.2.8 port, and then start our own port. We would probably want to maintain the changes made by our port as a patch inside our tree, so that we can make future upgrades from Chromium code base easier. Comments/suggestions welcome.

Brett Wilson

Reporter

Comment 8

•

16 years ago

I looked it up and we are using the hunspell128 directory.

(no longer active)

Comment 9

•

16 years ago

(In reply to comment #8) > I looked it up and we are using the hunspell128 directory. Oh, thanks for correcting me. So, how are the dictionaries distributed? Do binary versions ship with Chrome or do you distribute the .dic and .aff files and build the binary version on first run or something?

Flags: wanted1.9.2?

Brett Wilson

Reporter

Comment 10

•

16 years ago

When a dictionary is needed, the binary dictionary is downloaded in the background. This allows us to keep the download smaller since it doesn't include the dictionary, and also magically supports changing languages: when the user changes to a new language, we'll just do the same thing and download the dictionary for that language.

(no longer active)

Comment 11

•

16 years ago

What I really meant to ask was how you generate the binary dictionary for the first time. Is there code for that in the Chromium tree? Because we would probably want to stick with shipping the text based dictionary format, and generate binary formats on first run...

Brett Wilson

Reporter

Comment 12

•

16 years ago

The converter code is here: http://src.chromium.org/viewvc/chrome/trunk/src/chrome/tools/convert_dict/ It wasn't meant to run on enduser computers so may need a bit more error checking or recovery. You can see it just asserts in a number of places which is all we need when manually converting a dictionary.

Németh László

Comment 13

•

16 years ago

Some Hunspell dictionaries haven't worked with the modified Hunspell, yet. For example, the Hungarian dictionary: http://code.google.com/p/chromium/issues/detail?id=15558. The solution is supporting the original dictionary format for the problematic dictionaries.

(no longer active)

Comment 14

•

16 years ago

How can you detect that they are "not working"? Is the problem which causes them not to work known? Is there anything which can be done about it?

Wayne Mery (:wsmwk)

Comment 15

•

15 years ago

(In reply to comment #13) > Some Hunspell dictionaries haven't worked with the modified Hunspell, yet. For > example, the Hungarian dictionary: > http://code.google.com/p/chromium/issues/detail?id=15558. The solution is > supporting the original dictionary format for the problematic dictionaries. was closed fixed Apr 2010

Marco Castelluccio [:marco]

Comment 16

•

14 years ago

maybe I can work on this bug, if it's still needed.

(no longer active)

Comment 17

•

14 years ago

(In reply to comment #16) > maybe I can work on this bug, if it's still needed. Yes, that would be very much appreciated!

Marco Castelluccio [:marco]

Comment 18

•

14 years ago

There's a problem: we use version 1.3.2, they 1.2.12. Here is the list of the changes they did to Hunspell: https://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell/README.chromium?revision=78071 It's not a huge list, we can do the changes in 1.3.2. But how will we do when Chromium devs make other changes? I don't see a way to automatically sync with them.

Josh Matthews [:jdm]

Comment 19

•

14 years ago

It would be possible to create a filtered RSS feed of commits to chromium that affect the hunspell module, and then at least one person could subscribe and ensure that we didn't miss anything. Alternatively, somebody makes it their job to regularly check in and see if any changes have occurred.

Ryan VanderMeulen [:RyanVM]

Comment 20

•

14 years ago

I've been the unofficial Hunspell upstream babysitter in recent history. I could keep an eye on Chromium as well. Of course, it would be ideal if Chromium's changes could be upstreamed to avoid the issue entirely. Nemeth/Caolan, I'm assuming that these changes are worth bringing into Hunspell for all consumers to use, yes? Based on comment 15, it sounds like progress has been made with dictionary compatibility?

Caolan McNamara

Comment 21

•

14 years ago

I don't know if Chromium ever submitted any of their changes ?

Marco Castelluccio [:marco]

Comment 22

•

14 years ago

I strongly agree that Chromium's changes should be upstreamed into Hunspell. Otherwise keeping track of changes would be chaotic. Upstreaming would be better also for Chromium itself!

(no longer active)

Comment 23

•

14 years ago

(In reply to comment #22) > I strongly agree that Chromium's changes should be upstreamed into Hunspell. > Otherwise keeping track of changes would be chaotic. > > Upstreaming would be better also for Chromium itself! So are you going to work on upstreaming them? This way we can just take a new release from hunspell and we can alert the Chromium guys to do the same thing too.

Marco Castelluccio [:marco]

Comment 24

•

14 years ago

(In reply to Ehsan Akhgari [:ehsan] from comment #23) > (In reply to comment #22) > So are you going to work on upstreaming them? This way we can just take a > new release from hunspell and we can alert the Chromium guys to do the same > thing too. Yes, I'll try. Should we ask to Chromium devs what's their opinion? However we should wait for Caolan or Nemeth's reply (if they want or not this upstream change).

Caolan McNamara

Comment 25

•

14 years ago

Probably more a question for Nemeth than me if there would be push-back against supporting basically an additional "binary cache" format, though it seems reasonable to me anyway, especially as its not a case of throwing away support for all the pre-existing text-based dicts. If the Chromium guys have kept track of, or can regenerate, a series of feature-by-feature or fix-by-fix changesets I can have a go at merging any uncontroversial bits if Nemeth's flooded.

Marco Castelluccio [:marco]

Comment 26

•

14 years ago

Caolan, could you contact Néméth to know if he wants these changes upstreamed? I've talked with chromium developers, they didn't think hunspell was mantained, but they agree to upstream their changes. I've asked if they think the upstreaming work will be difficult: <+brettw> depends on what the Hunspell people want to do <+brettw> if they're OK with an ifdef (basically what we've done) then it's probably easy, just copying it in <+brettw> (maybe some updates for the new version of Hunspell) <+brettw> but they may want changes which could make it harder

(no longer active)

Comment 27

•

14 years ago

Caolan, Néméth, ping?

Caolan McNamara

Comment 28

•

14 years ago

I think its a good idea, can't seem to contact Nemeth about it. Might be on e.g. vacation or something. If the chromium guys are happy with the hunspell dual license then they can fire their changes/latest work or whatever over to me and I'll have a look.

Németh László

Comment 29

•

14 years ago

(In reply to Caolan McNamara from comment #28) I only support a new binary format for loading time + memory optimization, if it keeps the flexiblity of the dictionary format of Hunspell (it contains arbitrary fields for different functions, eg. stemming and morphological generation). I believe, supporting web standards and new functions is more important task. Mozilla/Firefox needs improved spell checking (CSS3, moreover grammar checking, like in Webkit/Safari), hyphenation (CSS3, IE10: http://blogs.msdn.com/b/ie/archive/2011/10/10/building-rich-text-centric-pages-in-ie10.aspx, maybe next Opera). There are annoying problems in spell checking of Firefox, especially tokenization (abbrevations with dots [now words with hyphen are recognized well, many thanks for it], spell checking of text area/buffering [it seems, Firefox can't help in checking of large Wikimedia articles] or missing functions, like automatic language detection in text areas. I also plan to improve stemming and morphological generation and add hyphenation, so Firefox will be able to use these Hunspell functions for better indexing of web pages, (not only for grammar checking) and for hyphenation via SPELLML API of Hunspell.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 30

•

14 years ago

We've added hyphenation support by importing lihyphen or whatever it's called: http://mxr.mozilla.org/mozilla-central/source/intl/hyphenation/src/README

(no longer active)

Comment 31

•

14 years ago

We have also implemented language detection in bug 338427. The problem with big textareas is that currently Firefox has a hard limit of content to spell check before it gives up. I suspect that was done because of efficiency concerns, but I'm sure we can improve the situation. Is hunspell going to get grammar checking support? If not, we'd need to fall back to whatever WebKit is using for that...

Németh László

Comment 32

•

14 years ago

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #30) Thanks for the good news and your work. I’m very glad of it, because it is a huge improvement for several languages. Unfortunately, your integration is not perfect yet, because Firefox doesn’t support special hyphenation, a long-standing request from the Web (from 1996: http://www.nada.kth.se/i18n/html/hyph.html). For example, you were be able to eliminate the bad Hungarian hyphenation in this area by the recent replacement of the Hungarian LibreOffice hyphenation dictionary with a simplified variant, but it is not the ideal solution. Interestingly, Adobe InDesign 5.5 has got similar problem with its Hyphen integration, this bug report asks similar support for Catalan hyphenation, like in OpenOffice.org/LibreOffice: http://forums.adobe.com/thread/896911. Other affected languages: Dutch, Greek, Norwegian, Swedish and German with its old orthography.

Németh László

Comment 33

•

14 years ago

(In reply to Ehsan Akhgari [:ehsan] from comment #31) Thanks for your information. > Is hunspell going to get grammar checking support? If not, we'd need to > fall back to whatever WebKit is using for that... I believe, this is only a non-free feature of Safari. Hunspell dictionaries can contain POS and morphological data for grammar checkers, see English, French and Hungarian LibreOffice dictionaries. LibreOffice uses them for English and Hungarian morphological generation in its thesaurus, and (yet with extensions) for French and Hungarian grammar checking.

Marco Castelluccio [:marco]

Updated

•

14 years ago

Blocks: 447581

Marco Castelluccio [:marco]

Updated

•

13 years ago

Updated

•

9 years ago

Blocks: 470471

Jonathan Watt [:jwatt]

Comment 34

•

7 years ago

Mass bug change to replace various 'parity' whiteboard flags with the new canonical keywords. (See bug 1443764 comment 13.)

Keywords: parity-chrome

Whiteboard: [parity-Chrome]

(no longer active)

Updated

•

6 years ago

Assignee: ehsan → nobody

BMO Automation

Updated

•

3 years ago

Severity: normal → S3