Closed Bug 1121730 Opened 9 years ago Closed 8 years ago

Add Lingala Wordlist/Dictionary

Categories

(Firefox OS Graveyard :: Gaia::Keyboard, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: delphine, Unassigned)

References

Details

Please add Lingala Wordlist and Dictionary to Fx OS
[Blocking Requested - why for this release]:
Meeting with Bus Dev this morning confirmed that this would be needed in 2.0 and onwards
thanks
blocking-b2g: --- → 2.0?
[Triage]

BD has confirmed the need, recorded in Program Management mana page.
[Triage]

BD has confirmed the need, recorded in Program Management mana page.
blocking-b2g: 2.0? → 2.0+
Do we know more about for when this is needed exactly?
No one is assigned or working on this. Just want to make this clear if this is urgent....
Flags: needinfo?(wehuang)
Tony/Nisha .. I see from the Mana page that the partner request for this language has moved to TBD date status.  Please provide guidance regarding the urgency/timing of this deliverable so that the team can prioritize their work accordingly.
Flags: needinfo?(nmalhan)
Flags: needinfo?(aappleton)
Hi Delphine:

We need BD's information as comment#5, currently partner's R&D team is working on other region's SW so I assume this some early information via business discussion. You can check bug 1129845 (Mozilla only) for the regions partner is working on currently.
Flags: needinfo?(wehuang)
[Blocking Requested - why for this release]:
Per comment 5 and comment 6, denominate this to 2.0?
blocking-b2g: 2.0+ → 2.0?
[Triage]

For partner working with FxOS 2.0, TAM has communicated to them to handle "languages other then what are already supported in FxOS 2.0" by their own, so this will not block 2.0.

De-mon it from 2.0.
blocking-b2g: 2.0? → ---
Since we have no identified community yet for Lingala, asking Peiying if she can plug in Rubric for help with input and review here, as well as KEvin Scannell
Flags: needinfo?(pmo)
Flags: needinfo?(kscanne)
+ Devon and Ian 

We need Rubric's help in this as well.
I've done a good bit of work on Lingala.  I created a spell checker here, for example:
http://sourceforge.net/p/lingala/code/HEAD/tree/hunspell/

Two issues to overcome:
(1) Lingala is a Bantu language like Zulu, Xhosa, etc. so we'll run into the same issues that a simple word list doesn't offer good enough coverage (see Comment 3 on bug 1119426) 
(2) The proper orthography uses the open vowels ɛ and ɔ  (U+025B, U+0254) and tone marks. But these are rarely used on the web.  So it's easy to create an ASCII frequency list (in fact, you'll find one in the spell checker repo: http://sourceforge.net/p/lingala/code/HEAD/tree/hunspell/ASCIIFREQ) but virtually impossible to create one with the proper orthography.  

It might be possible to handle the latter issue with a small tweak to the autocorrect engine.  The user could enter words in ASCII and the engine could then suggest any correctly-spelled words that match the ASCII word after stripping tone marks and converting ɛ->e, ɔ->o.

Is something like that doable?
Flags: needinfo?(kscanne)
(In reply to Kevin Scannell from comment #11)
> I've done a good bit of work on Lingala.  I created a spell checker here,
> for example:
> http://sourceforge.net/p/lingala/code/HEAD/tree/hunspell/
> 
> Two issues to overcome:
> (1) Lingala is a Bantu language like Zulu, Xhosa, etc. so we'll run into the
> same issues that a simple word list doesn't offer good enough coverage (see
> Comment 3 on bug 1119426) 
> (2) The proper orthography uses the open vowels ɛ and ɔ  (U+025B, U+0254)
> and tone marks. But these are rarely used on the web.  So it's easy to
> create an ASCII frequency list (in fact, you'll find one in the spell
> checker repo:
> http://sourceforge.net/p/lingala/code/HEAD/tree/hunspell/ASCIIFREQ) but
> virtually impossible to create one with the proper orthography.  
> 
> It might be possible to handle the latter issue with a small tweak to the
> autocorrect engine.  The user could enter words in ASCII and the engine
> could then suggest any correctly-spelled words that match the ASCII word
> after stripping tone marks and converting ɛ->e, ɔ->o.
> 
> Is something like that doable?

I suggest we break the Pootle corpus on spaces and run it through the spell checkers as suggested in (Comment 3 on bug 1119426). We hand the 20% de-duped list to the translator who can weed out non-sense terminology.
What do you think?
(In reply to ian.henderson from comment #12)
> (In reply to Kevin Scannell from comment #11)
> > I've done a good bit of work on Lingala.  I created a spell checker here,
> > for example:
> > http://sourceforge.net/p/lingala/code/HEAD/tree/hunspell/
> > 
> > Two issues to overcome:
> > (1) Lingala is a Bantu language like Zulu, Xhosa, etc. so we'll run into the
> > same issues that a simple word list doesn't offer good enough coverage (see
> > Comment 3 on bug 1119426) 
> > (2) The proper orthography uses the open vowels ɛ and ɔ  (U+025B, U+0254)
> > and tone marks. But these are rarely used on the web.  So it's easy to
> > create an ASCII frequency list (in fact, you'll find one in the spell
> > checker repo:
> > http://sourceforge.net/p/lingala/code/HEAD/tree/hunspell/ASCIIFREQ) but
> > virtually impossible to create one with the proper orthography.  
> > 
> > It might be possible to handle the latter issue with a small tweak to the
> > autocorrect engine.  The user could enter words in ASCII and the engine
> > could then suggest any correctly-spelled words that match the ASCII word
> > after stripping tone marks and converting ɛ->e, ɔ->o.
> > 
> > Is something like that doable?
> 
> I suggest we break the Pootle corpus on spaces and run it through the spell
> checkers as suggested in (Comment 3 on bug 1119426). We hand the 20%
> de-duped list to the translator who can weed out non-sense terminology.
> What do you think?

Sure, we can do that.  Every little bit helps!

But the focus needs to be on improving the engine if this is going to be at all usable/useful.  Sorry to be a broken record.
Kevin, if the open vowels are added as alternatives for the letters 'e' and 'o', they are handled similar to letters with diacritics: if you type 'kobongisa', it will consider 'ɔ' as a possible alternative for the typed 'o' and should rank 'kobɔngisa' as a likely correction (there are some weights that determine the ranking of suggestions). This would work quite ok if it is a reasonable user experience to have the open vowels as alternatives of 'e' and 'o'. However, we might want to simply add them to the keyboard layout for Lingala as separate keys. That way the problem disappears, and we can actually limit the key alternatives to the diacritics. If we do it this way, we don't need for example both 'é', and 'ɛ́' as alternatives for 'e', but can keep them separate. That way the predictions and corrections should be better. For completeness' sake I should point you to bug 1128905 that I hit when trying to add the open vowels with diacritics. This only affects the cases with no precomposed form in Unicode. I also haven't tested how well the current "latin" input method works with decomposed sequences as alternatives for a character. So I'm not sure if the current code would see 'yɔ́nsɔ' as a very obvious replacement for 'yɔnsɔ', but I think it will (in English it would suggest something like 'e-mail' when you type 'email'), so there is some concept of "cheap" insertions.

From my understanding from your previous work on Lingala, I understood that the morphology is at least as complex as Zulu. So of course we need a different engine. We decided for the related languages that we don't even need to try (for the time being). I have ideas of what we can do here based on some research we're doing for Zulu machine translation. When you have some time we should talk. I think it can also to some extent alleviate the lack of big datasets.
Thanks Friedel, this is all new and useful information to me.  Whether the open vowels get their own keys or are alternatives to 'e' and 'o' is, I suppose, a UX decision that we'd leave up to the community.  In either case, I'm still concerned about the ranking of suggestions, especially in cases where a word exists with both open vowels and as pure ASCII (something like sɔkɔtɔ vs. sokoto) since virtually all of the corpus data that's available is ASCII.  Maybe these cases are unusual enough that it doesn't impact usability.

Would be happy to talk about some of your ideas re: the morphology problem via email.
(clearing the ni on Peiying since she's already answered in comment 10)
Flags: needinfo?(pmo)
(In reply to Kevin Scannell from comment #13)
> [snip]
> But the focus needs to be on improving the engine if this is going to be at
> all usable/useful.  Sorry to be a broken record.

ni on Bruce so he can advise about improving the engine (also see comment 11). thanks!
Flags: needinfo?(bhuang)
(In reply to Kevin Scannell from comment #15):
Kevin, I believe there might be some existing art in terms of keyboard layout, although we shouldn't unnecessarily bind ourselves by it. If adding the open vowels can improve what we offer, I believe we should consider it regardless of existing designs. I doubt that anything has wide recognition or use.

If I understand correctly, you are concerned about the order of suggestions in cases where you don't trust frequency data built on your corpus evidence (did I understand that correctly?). We could simply not provide frequency data (by for example giving equal weight to all words). That way suggestions are mostly influenced by length and distance from the keys pressed (again an advantage we get if we have separate keys for the open vowels). However, only a single few suggestions are visible for anything longer than a 5 or so characters, so we need to know that we are actually helping by having the relevant suggestion visible in most cases.

I think the first step would be to create a good layout that makes it easy to type correctly. Maybe that already helps in getting higher quality contributions onto the web. While we don't have a different engine for these languages, we can't do much more anyway.
(In reply to Friedel Wolff from comment #18)
> 
> If I understand correctly, you are concerned about the order of suggestions
> in cases where you don't trust frequency data built on your corpus evidence
> (did I understand that correctly?). 

Yes.

> We could simply not provide frequency
> data (by for example giving equal weight to all words). That way suggestions
> are mostly influenced by length and distance from the keys pressed (again an
> advantage we get if we have separate keys for the open vowels). However,
> only a single few suggestions are visible for anything longer than a 5 or so
> characters, so we need to know that we are actually helping by having the
> relevant suggestion visible in most cases.
> 

I see, that could work. 

Maybe this suggests a quick solution to the morphology problem also; simply add hunspell support to FxOS for languages that need it, and for which there's a good open source dictionary/affix file, like Lingala.  Autocomplete wouldn't work, but hunspell could validate the spelling of words as they're entered, and offer suggestions if needed.
(In reply to Kevin Scannell from comment #11)
> 
> It might be possible to handle the latter issue with a small tweak to the
> autocorrect engine.  The user could enter words in ASCII and the engine
> could then suggest any correctly-spelled words that match the ASCII word
> after stripping tone marks and converting ɛ->e, ɔ->o.
> 
> Is something like that doable?

Does this become a Lingala specific fix, or is it one of the general improvements we need to make for languages with morphology problems?
Flags: needinfo?(bhuang)
(In reply to Bruce Huang [:bhuang] <bhuang@mozilla.com> from comment #20)
> (In reply to Kevin Scannell from comment #11)
> > 
> > It might be possible to handle the latter issue with a small tweak to the
> > autocorrect engine.  The user could enter words in ASCII and the engine
> > could then suggest any correctly-spelled words that match the ASCII word
> > after stripping tone marks and converting ɛ->e, ɔ->o.
> > 
> > Is something like that doable?
> 
> Does this become a Lingala specific fix, or is it one of the general
> improvements we need to make for languages with morphology problems?

I think we need to address this from a generic perspective.  There are also some non-Bantu African languages that would have this issue e.g. Yoruba.  Like the Bantu morphology we need to address the high level issue and then allow these languages to be supported in less of an adhoc approach.
(In reply to Bruce Huang [:bhuang] <bhuang@mozilla.com> from comment #20)
> (In reply to Kevin Scannell from comment #11)
> > 
> > It might be possible to handle the latter issue with a small tweak to the
> > autocorrect engine.  The user could enter words in ASCII and the engine
> > could then suggest any correctly-spelled words that match the ASCII word
> > after stripping tone marks and converting ɛ->e, ɔ->o.
> > 
> > Is something like that doable?
> 
> Does this become a Lingala specific fix, or is it one of the general
> improvements we need to make for languages with morphology problems?

These are two independent problems needing independent solutions.  The "special character" issue is relevant for Lingala, Akan, Hausa, Yoruba, Igbo, etc. where it is very difficult to produce a good frequency list with the correct special characters.
Tony, Nisha please provide an update on the priority of Lingala.
Tony, Nisha, Please provide an update on the priority of Lingala.
In answer to Comment 5 and 24, there is no firm requirement at this this time for Lingala i.e. no specific launch yet scheduled where Lingala is a mandatory requirements. As and when this changes, the mana page will be updated.
Flags: needinfo?(aappleton)
Closing this bug because Lingala no longer requested for FxOS.
Flags: needinfo?(nmalhan)
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.