Closed Bug 481884 Opened 15 years ago Closed 6 years ago

Spellchecker should allow several language dictionaries at the same time

Categories

(Core :: Spelling checker, enhancement, P3)

enhancement

Tracking

()

RESOLVED DUPLICATE of bug 1402822

People

(Reporter: BenB, Unassigned)

Details

(Whiteboard: no comments apart from implementation, please)

Request:
Please allow the user not only to select exactly one dictionary for the spellchecker (plus the personal dict), but several. Any word from any of the selected dictionaries would be considered correctly spelled.

Use case:
I speak and write in English and German. I can't change the dictionary every time. Even if the Composer had a widget for it, it would be too much hassle to switch it all the time.

Rationale:
Merging dictionaries works fairly well, because the overlap between languages is relatively small. It's not terribly likely that I make a typing or spelling error and it creates a word that's legal in the other language.

OTOH, it is quite common that I mix languages in one sentence. E.g. Technical German uses a lot of English terms, for very specialized concepts. So, allowing any word from English and German is correct.
how is this different from bug 69687?
I didn't find that one (I searched for "dict"). That is also not clearly written and more a discussion than a specific feature request.
The patch there seems to do something similar to what I propose here, though.
That bug is not clear at all, that why I filed https://bugzilla.mozilla.org/show_bug.cgi?id=476623 which has been marked duplicate.

I believe the summary of that bug should be changed or other more specific bugs should be allowed
I forgot to mention that there two extensions that implement automatic
language detection in Thunderbird and Firefox reasonably well:

http://en.design-noir.de/mozilla/dictionary-switcher-tb/

http://en.design-noir.de/mozilla/dictionary-switcher/

It'd be a good idea to contact the author and ask him if he wants to work on
integrating them in the code base.
Thanks for the hint, but this is not about language detection. See last paragraph of initial description. It's merely about using several dictionaries, and allowing the user to specify them (in prefs, permanently).
As a matter of fact these extensions do what you say: they check what the user types against the installed dictionaries and change dictionary on the fly based on the last few words typed.

I am in your same situation, probably even worse, as I type in English, Italian, French and Spanish, very often mixing them; the Thunderbird extension used to do a pretty good job in detecting the 4 different ones (now I use Mail.app, which is light years ahead of Thunderbird in terms of features and usability).
Andrea, that is not what I said. Please read the initial description carefully again. This bug is a very well defined implementation bug. I don't want this bug to turn into something like bug 69687.
I basically like the idea of auto-detection for most things (there are cases where it's not perfect to have a single language used per textfield even, but those are edge-cases), but I'm not sure how easily it would be possible to hold apart language variants there (e.g. I want to use the Austrian German dictionary for German text as it allows some Austrian words in addition to all usual German words).
(In reply to comment #4)
> I forgot to mention that there two extensions that implement automatic
> language detection in Thunderbird and Firefox reasonably well:
> 
> http://en.design-noir.de/mozilla/dictionary-switcher-tb/
> 
> http://en.design-noir.de/mozilla/dictionary-switcher/
> 
> It'd be a good idea to contact the author and ask him if he wants to work on
> integrating them in the code base.

Dao, although now a very active Firefox developer, hasn't shown much interest in development of Dictionary Switcher (in what, almost a year?).

Dictionary Switcher leaves a lot to be desired, from my point of view. It detects the language too soon (after the first word?), leading to wrong selections. At least it should doublecheck the result after the first sentence or paragraph, and correct it if necessary. (Outlook 2007 does this, fwiw.) Also its "remember dictionary selection for page" feature is misguided, e.g. for networks like flickr or twitter, where you pick the language of your contact, not the site.

But the extensions don't address the main point here: it's not sufficient to select *one* dictionary, as we often mix words from two or more languages in one text.
Guys, this bug has a very clearly defined purpose, that is the implementation of what is described in the initial description. We already have one bug, bug 69687, that is entirely useless due to all the discussion. Therefore:

NO DISCUSSION here, please. Thank you.
Whiteboard: no comments apart from implementation, please
OS: Linux → All
Hardware: x86 → All
Check this for checking several languages simultaneously
https://bugzilla.mozilla.org/show_bug.cgi?id=660506
It looks like most problems that users report with repeatedly choosing ONE spell checking language would be satisfactorily solved if the user could once for all setup SEVERAL dictionaries to be used SIMULTANEOUSLY for the languages he can write, among those installed on his system.  That is especially needed for e-mail since many people write them in several languages (and that's the way Evolution does it).
It seems fairly simple to do, see  https://bugzilla.mozilla.org/show_bug.cgi?id=676500
So isn't this a duplicate of bug 676500? What is different?
Depends on: 660506
No longer depends on: 660506
It is a pity that bug 676500 was made a duplicate of Bug 481884 and not the opposite because bug 676500 is more fundamental and complete:
- it presents the problem as specifically having the hunspell multi-dictionary feature work
- it considers the problem in more general terms of dictionaries (eg computer science and medical) instead of only arguing about the appropriateness of multi-language
- hence pinpointing the weakness of having to fit any dictionary into a language-only structure
- it contains practical ideas and material to help
Unfortunately, bug 676500 was attacked by someone who later changed his mind and that made noise.
It looks like I have much worked in vain again.
I modified (locally) the right-click spellchecking menu to use checkboxes instead of radio buttons, and pass a comma-separated list of dictionaries ("en-US,es-ES") to hunspell. This is what I found.

Hunspell itself is to be blamed for the existence of this bug (the spellchecking engine, not firefox's use of it).
When used as a standalone (such as in a terminal), it can be a multi-language spell checker. (english aff/dic, spanish aff/dic, french aff/dic, etc.)
When used as a library (such as in firefox), it is only a multi-dictionary spell checker. (english aff/dic, english medical .dic, personal .dic(/aff))

I quote hunspell manpage 1:

       -d en_US,en_geo,en_med,de_DE,de_med

       en_US and de_DE are base dictionaries, they consist of aff and dic file
       pairs: en_US.aff, en_US.dic and de_DE.aff, de_DE.dic.  En_geo,  en_med,
       de_med  are special dictionaries: dictionaries without affix file. Spe-
       cial dictionaries are optional extension[sic] of the base dictionaries  usu-
       ally  with  special (medical, law etc.)  terms. There is no naming con-
       vention for special dictionaries, only the ".dic" extension: dictionar-
       ies  without affix file will be an extension of the preceding base dic-
       tionary (right order of the parameter list needs for good suggestions).
       First item of -d parameter list must be a base dictionary.

Each language has a two-part dictionary. The .dic file contains exact words. The .aff file contains rules, such as suffixes.
An english medical .dic(tionary) contains extra exact words, and uses the english .aff file for any suffixes it might need.
  If it needs to create its own, such as for latin, it shall come as a .dic/.aff pair.
A personal .dic(tionary) contains extra exact words. *see "Other functions"

I shall use the command `echo "brothers, hachazo" | hunspell -d en_US,es_ES` as an example of hunspell's standalone behavior.
The above command searches through SEARCH PATH (hunspell -D), and loads /usr/share/myspell/en_US.aff, /usr/share/myspell/en_US.dic, /usr/share/myspell/es_ES.aff, and /usr/share/myspell/es_ES.dic.
"brothers" is matched using the word "brother" from en_US.dic, plus the suffix -s from en_US.aff.
"hachazo" is matched using the word "hacha" from es_ES.dic, plus the suffix -(a)zo from es_ES.aff.

When hunspell is used as a library, first it must go through a constructor:
  Hunspell(const char * affpath, const char * dpath, const char * key = NULL);
It cannot just be passed a language name, as the standalone can. It must be given the paths to a single aff/dic pair for a single language.
To copy my example, I would call it like this:
  Hunspell("/usr/share/myspell/en_US.aff", "/usr/share/myspell/en_US.dic")
  which leaves the problem of loading es_ES.

I quote hunspell manpage 3:
   Extra dictionaries
       The add_dic() function  load[sic erat scriptum]  an  extra  dictionary  file.   The  extra
       dictionaries  use  the  affix  file  of  the allocated Hunspell object.
       Maximal number of the extra dictionaries is limited in the source  code
       (20).

If I construct hunspell with en_US.aff and en_US.dic, then add_dic(with es_ES.dic), the word "hachazo" will NOT "be matched as "hacha" from es_ES.dic plus -zo from es_ES.aff" because I did not load es_ES.aff. Instead, it might search en_US.aff for a suffix with the same id. I actually renamed the second .aff file.
add_dic() works for a personal dictionary. It'll match words from the language's dictionary with the language's suffixes, and words from the personal dictionary with no suffixes. However,

   Other functions
       The add(), add_with_affix() and remove()  are  helper  functions  of  a
       personal  dictionary  implementation  to  add and remove words from the
       base dictionary in run-time. The add_with_affix() uses a second word as
       a model of the enabled affixation of the new word.

int add_with_affix(const char * word, const char * example); is the function I re-quoted manpage 3 for.
add_with_affix() is for a single word with an affixed word, such as add_with_affix("fakeword", "fakewordsuffixed"). This function is useful for a personal dictionary, but not for iterating through es_ES.dic and es_ES.aff, looking for matches to pass one word at a time.

The options are:
  * Make firefox parse the output directly from running `hunspell -d en_US,es_ES` instead of using it as a library
  * Construct Hunspell() using the first language (en_US), then use add_dic() to load es_ES.dic and ignore es_ES.aff, breaking secondary languages.
    * Do hunspell's job, process es_ES.dic and es_ES.aff, then pass EACH affixed word that doesn't appear in es_ES.dic through add_with_affix()
  * Do hunspell's job, construct Hunspell() with en_US, get english misspelled word list, reconstruct Hunspell() with es_ES, get spanish misspelled word list, then use any matches between the two lists as the real word list.
  * Fix hunspell to allow using .aff files for secondary dictionaries WHEN used as a library (standalone already works)
        such as an add_dic_with_affix() function
    * Make firefox use hunspell's new features
    * Make other programs that use hunspell do the same
  * Point out that there's already an alternate Hunspell() constructor or after-construction function that allows using secondary .aff files, that I have missed.

The last two are the sanest options.
Kyle, thanks for your investigation!

> When used as a standalone (such as in a terminal), it can be a multi-language spell checker.
> (english aff/dic, spanish aff/dic, french aff/dic, etc.)

Did you look at the source code of hunspell standalone, how it loads the 2 languages? If it's reasonably written, the standalone would use the Hunspell library API.
If not, it should at least be clear from the standalone example code how to add the necessary constructor to the API.
Thanks for working on this!  Please see bug 676500, same goal but a number of interesting remarks.

First, I think that the dictionary type stuff can be more simply stated "if Hunspell finds (in the filesystem) no .aff file pairing some .dic file, it will use the .aff file of the same language that must already be loaded (must precede in the dictionary list)".
I only tested the standalone working.  I also browsed the code but my understanding met limits.
I seem to recall that, surprisingly indeed, the standalone code does not use the API and even duplicates it in some places.
Second (I already mentioned that somewhere), my tests seemed to show (beware/check) that Hunspell seems to get suggestions from only one dictionary.  Which, I didn't fully grasp. But I might have failed to understand how to make further keyboard actions to get the other suggestions for the other dictionaries one by one like Evolution presents them. This, obviously, only happens when you feed in spelling mistakes and it may be that the (first?) dictionary producing suggestions is where the best match is found.
The API method simply adds the dictionaries one by one and I am surprised that you you say that a file path must be given instead of a language name.  Indeed, the API user is not supposed to know about file locations. Are you sure that you're not looking at a lower level call that could be used when the user provides his own dictionaries?  A higher level would accept dictionary names, lookup the path and call the lower level with the resulting file paths.
In any case, I suggest you look at how the DICTIONARY envar (which is exclusive with -d) is processed and you'll see a list of dictionaries being added by name, but is it via the API or through internal calls.
This led me to an idea.  Instead of, or rather in addition to, modifying Mozilla you might modify Hunspell to process an additional MY_DICTIONARY envar. It would have the same syntax and code as DICTIONARY. But it would always be called either before or after DICTIONARY or -d or API calls.
That way, someone satisfied with this feature would permanently set MY_DICTIONARY in his session to always check spelling with his favorite languages/dictionaries in addition to the specified one(s) for all the applications using Hunspell.  I opened a hunspell ticket for that.
A problem raised in my mind whether MY_DICTIONARY should be before or after the others, and that choice may depend on my second remark here above. But is it possible to add it after all API calls if which is the last one is unpredictable?
Hoping this can help you.  I sure would experiment carefully DEBs for Ubuntu (12.04 presently).
Best of luck and thanks again!
The hunspell library is missing some of the features of the hunspell standalone.
Reference: hunspell/src/tools/hunspell.cxx:1781-1856 (if (! dicname) { to exit(1) }) http://hunspell.cvs.sourceforge.net/viewvc/hunspell/hunspell/src/tools/hunspell.cxx?view=markup I can't figure out how to jump to line 1781.
So the Hunspell() constructor has to be kept in an array, dictionaries loaded in batch, and all relevant functions have to be aware that there are multiple.
I see a problem, though. Hunspell takes an argument "-d en_US,en_geo,en_med,de_DE,de_med", which can have both full-dictionaries (.aff and .dic) and half-dictionaries (.aff only).
If I copy this code "} else if (dic) pMS[dmax-1]->add_dic(dic);" exactly, a half-dictionary will be added to the full-dictionary that precedes it.
What if the dictionaries are out of order? (a medical term getting a suffix from the wrong language or some similar oddity) On the command line, I can make sure they're in the right order, by typing them in that way. How do I do the same with the right-click Languages context menu in Firefox? And how would a half-dictionary force a full-dictionary to be selected first?
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → DUPLICATE
If someone has reason to reverse the dependency that's fine, but this bug is the original report of the issue, with useful context and therefore should remain open.
Blocks: 1402822
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Priority: -- → P3
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → DUPLICATE
No longer blocks: 1402822
You need to log in before you can comment on or make changes to this bug.