Closed Bug 1126076 Opened 9 years ago Closed 8 years ago

Add Hausa (ha) Wordlist/Dictionary

Categories

(Firefox OS Graveyard :: Gaia::Keyboard, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: delphine, Unassigned, NeedInfo)

References

Details

Attachments

(3 files)

156.40 KB, application/x-zip-compressed
Details
3.82 KB, application/x-zip-compressed
Details
9.76 KB, application/x-zip-compressed
Details
Please add Hausa Wordlist and Dictionary to Firefox OS
Adding localizer to see if he can help with feedback here. thanks
Flags: needinfo?(mcsteann)
No update from localizer, so asking:
* Peiying can you get Rubric plugged in to help with this?
* Kevin: could you help out also?
thanks all!
Flags: needinfo?(pmo)
Flags: needinfo?(kscanne)
+ Devon and Ian, we need Rubric's advice on this.
Flags: needinfo?(pmo)
(In reply to Peiying Mo [:CocoMo] from comment #3)
> + Devon and Ian, we need Rubric's advice on this.

Let's try the same approach as my Comment 12 on Bug 1121730
There's a good amount of Hausa online, but like Lingala, only a small percentage (about 4% by my best estimate) uses the correct "special" characters (in this case ɓ, ƙ, ɗ).  

The other issue is that there is no clean word list that uses the correct characters. The Firefox addon here:

https://addons.mozilla.org/en-us/firefox/addon/hausa-spelling-dictionary/

is virtually all ASCII.

From earlier work of mine on a spellchecker, I have what I think is a pretty comprehensive list of (~500) pairs of words that are correct as either ASCII or with special characters (ƙasa/kasa, saƙo/sako, etc.)   

With that in mind, here's what I'd propose:

(1) I use the full web corpus to produce an ASCII-only frequency list, maybe validating against the Firefox spellchecker (I haven't checked its coverage, so I don't know if that's worthwhile)
(2) Use the 4% of properly-encoded web texts to produce a word list of words containing special characters, everything appearing more than, say, 2 or 3 times.
(3) Use the frequency of the ASCII version from (1) as a proxy for the frequency of the presumed-correct words from step (2)...
(4) *except* if the word is in my list of special cases (ƙasa/kasa, saƙo/sako, etc.)  Here it's not clear what to do. I suppose could split the frequency of the ASCII version from (1) according to the relative proportions that I see in the good (4%) corpus. 

I'd be grateful for some feedback from the Hausa team on this.  If they don't care about preserving the special characters then I suppose I don't need to bother with any of this!
Flags: needinfo?(kscanne)
Kevin, you can also have a look at the PO files for Gaia itself:
https://github.com/translate/mozilla-gaia/commits/2.0/ha

Here are also a few from old GNOME translations:
https://l10n.gnome.org/POT/gnome-panel.master/gnome-panel.master.ha.po
https://l10n.gnome.org/POT/metacity.master/metacity.master.ha.po
https://l10n.gnome.org/POT/nautilus.master/nautilus.master.ha.po
(be sure to include obsolete messages if you can - there seems to be quite some text there).

Maybe this is already in your web corpus, but if not, hopefully gives you a little bit more text (horribly biased, unfortunately). I saw at least some non-ASCII characters in these files. I don't know how frequent it is supposed to be.

About your plan: If your list of 500 is fairly complete, it mostly sounds good. I guess you can also augment the 500 with what you see in the 4% corpus. Is the 4% big enough, though? Any issues of balance in the 4%?
(In reply to Friedel Wolff from comment #6)
> Kevin, you can also have a look at the PO files for Gaia itself:
> https://github.com/translate/mozilla-gaia/commits/2.0/ha
> 
> Here are also a few from old GNOME translations:
> https://l10n.gnome.org/POT/gnome-panel.master/gnome-panel.master.ha.po
> https://l10n.gnome.org/POT/metacity.master/metacity.master.ha.po
> https://l10n.gnome.org/POT/nautilus.master/nautilus.master.ha.po
> (be sure to include obsolete messages if you can - there seems to be quite
> some text there).
> 
> Maybe this is already in your web corpus, but if not, hopefully gives you a
> little bit more text (horribly biased, unfortunately). I saw at least some
> non-ASCII characters in these files. I don't know how frequent it is
> supposed to be.

Thanks.

> 
> About your plan: If your list of 500 is fairly complete, it mostly sounds
> good. I guess you can also augment the 500 with what you see in the 4%
> corpus. Is the 4% big enough, though? Any issues of balance in the 4%?

It's about 250k words total, ~19k unique words.   Heavily biased towards religious texts, so I'd rather not use frequencies from it if possible.
Attached file firefoxos_2.0-ha.zip
Here is the FFOS
Attached file spartacus-ha.zip
Attached file fireplace-ha.zip
Hello Kevin,

Another bit of corpus:

https://localize.mozilla.org/ha/masterfirefoxos/
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: