Closed Bug 896363 Opened 11 years ago Closed 11 years ago

[ca] Catalan wordlist of text prediction

Categories

(Firefox OS Graveyard :: Gaia::Keyboard, defect)

defect
Not set
normal

Tracking

(blocking-b2g:leo+, b2g18 verified, b2g-v1.1hd fixed)

RESOLVED FIXED
blocking-b2g leo+
Tracking Status
b2g18 --- verified
b2g-v1.1hd --- fixed

People

(Reporter: Pike, Assigned: djf)

References

Details

Attachments

(1 file)

Kevin, can you help with this?

We're looking to add Catalan to fx os, and need the text prediction stuff, of course.
Only to remark that prediction text is not mandatory for 1.1, we can include it in later reelases.
Glad to help - I can probably have something by the end of the week.
Here's a draft of the file needed for predictive text in Catalan:

http://borel.slu.edu/obair/ca.zip

It's based on a corpus of ~45 million words of Catalan crawled from the web.  I only kept words that are accepted by v. 2.5.0 of the Firefox spellchecking addon:

https://addons.mozilla.org/en-us/firefox/addon/general-catalan-dictionary/

Let me know how this looks.
(In reply to Kevin P. Scannell from comment #3)
> Here's a draft of the file needed for predictive text in Catalan:
> 
> http://borel.slu.edu/obair/ca.zip
> 
> It's based on a corpus of ~45 million words of Catalan crawled from the web.
> I only kept words that are accepted by v. 2.5.0 of the Firefox spellchecking
> addon:
> 
> https://addons.mozilla.org/en-us/firefox/addon/general-catalan-dictionary/
> 
> Let me know how this looks.

That's awesome, Kevin. Joan, can you test it? I'll try it as well.
Hi David, as you mentioned on the mailing list that you'd take this, I'm assigning it to you.
Assignee: nobody → dflanagan
(In reply to Toni Hermoso Pulido from comment #4)
> 
> That's awesome, Kevin. Joan, can you test it? I'll try it as well.


Toni and Joan: note that there isn't yet anything to test, unless you want to just look at Kevin's word list.  I now need to take that wordlist, convert it to a binary dictionary and create a patch to add the dictionary to Gaia.

Nominating this bug for 1.1 because I've heard rumors that Leo will turn off auto-correction by default unless Catalan is supported.
blocking-b2g: --- → leo?
Kavin, thanks for this great word list!!! I will check it, but reading in plain text, I'm sure you've done a really good job, :) Can we use it in other opensource projects like Android?

David, thanks for info. I will do some minor tests for its use in Catalan (l·l digraph, apostrophe and hyphen), but this frequency word list is the best available under an open-source licence and it's a very good starting point.
(In reply to Joan Montané from comment #7)
> Kavin, thanks for this great word list!!! I will check it, but reading in
> plain text, I'm sure you've done a really good job, :) Can we use it in
> other opensource projects like Android?
> 

Yes, feel free to use under any open source license you like.
(In reply to David Flanagan [:djf] from comment #6)
> (In reply to Toni Hermoso Pulido from comment #4)
> > 
> > That's awesome, Kevin. Joan, can you test it? I'll try it as well.
> 
> 
> Toni and Joan: note that there isn't yet anything to test, unless you want
> to just look at Kevin's word list.  I now need to take that wordlist,
> convert it to a binary dictionary and create a patch to add the dictionary
> to Gaia.
> 
> Nominating this bug for 1.1 because I've heard rumors that Leo will turn off
> auto-correction by default unless Catalan is supported.

Hi David, 
just in case, before you start preparing the binary, Kevin is generating new versions from Joan (and Jaume, not in Cc) feedback.
We will comment back, hopefully soon, when there is a new version.
Since we're attempting to ship Catalan as part of 1.1, this is leo+ for now at least.
blocking-b2g: leo? → leo+
Hi,

Kevin has build a new version for Catalan predictive list:

http://borel.slu.edu/obair/ca-v3.zip

It's much better than the 1st one. So, if possible, replace 1st list with this last one.
Rudy,

This patch adds a Catalan wordlist and dictionary, and includes a trivial change to layout.js to associate the dictionary with the already-existing Catalan keyboard layout.
Attachment #782714 - Flags: review?(rlu)
(In reply to Joan Montané from comment #11)
> Hi,
> 
> Kevin has build a new version for Catalan predictive list:
> 
> http://borel.slu.edu/obair/ca-v3.zip
> 
> It's much better than the 1st one. So, if possible, replace 1st list with
> this last one.

The patch above is based on this latest version of the wordlist.
blocking-b2g: leo+ → leo?
After some testing (I generated a ca.dic) and uploaded in a Unagi, I must tell that experience is really good and I'd say that is suitable to be included. 
The only issue is with words with l·l (goril·la, tranquil·litat, paral·lel), which seem not to be suggested if 'l·l' is entered from 'alt l' (3 chars in one). No problem if this is entered as 3 chars one after the other (· is alt of .)
Comment on attachment 782714 [details]
link to patch on github

Looks good, r=me.

I have seen what Toni mentioned in Comment 14, but I think that could be handled by a follow-up bug.
Attachment #782714 - Flags: review?(rlu) → review+
Toni,

Thanks for reporting the issues with l·l.  It looks like there is an issue with all alternate keys that have more than one character: none of them get sent to the input method at all, do not interact with auto-correct, and put the input method into an inconsistent state, breaking future auto-correct.
 
I'm going to fix it as part of this bug because it already has leo+, and it is a serious bug that needs to be fixed.
I notice that at the beginning of a sentence, l·l gets capitalized to L·L, but Wikipedia tells me that L·l is correct. I'll make sure this gets fixed, too.
(In reply to David Flanagan [:djf] from comment #17)
> I notice that at the beginning of a sentence, l·l gets capitalized to L·L,
> but Wikipedia tells me that L·l is correct. I'll make sure this gets fixed,
> too.

I'm not fully sure about this. Joan could tell more. Where is it said in Wikipedia? Actually, in Catalan wikipedia, the main Wikipedia entry is 'L·L' http://ca.wikipedia.org/wiki/L%C2%B7L 
In any case, there is not any single word starting with 'l·l', so the dilemma of L·l vs L·L would never happen.
blocking-b2g: leo? → leo+
Comment on attachment 782714 [details]
link to patch on github

Rudy,

I've added a new commit to the PR to correctly handle l.l (and other multi-character alternatives) and to correctly capitalize them.

l.l will capitalize to L.l normally, but to L.L if caps lock is on.  This seems like the right thing to me. I don't think any of our other keyboard layouts have similar cases.  Other multi-character alternatives are already in uppercase (like R$) or begin with a digit (like 3rd) or are in the alt layout without a shift key and can't be upper-cased.

You may notice that this patch does not affect the ".com" key on the URL keyboard. That one emits lowercase ".com" regardless of the uppercase or caps lock state of the keyboard.  That is because of line 888 in getUpperCaseValue(). Do you think I should change it so that if caps lock is on the .com key emits .COM?
Attachment #782714 - Flags: review+ → review?(rlu)
(In reply to David Flanagan [:djf] from comment #19)
> Comment on attachment 782714 [details]
> link to patch on github
> 
> Rudy,
> 
> I've added a new commit to the PR to correctly handle l.l (and other
> multi-character alternatives) and to correctly capitalize them.
> 
> l.l will capitalize to L.l normally, but to L.L if caps lock is on.  This
> seems like the right thing to me. I don't think any of our other keyboard
> layouts have similar cases.  Other multi-character alternatives are already
> in uppercase (like R$) or begin with a digit (like 3rd) or are in the alt
> layout without a shift key and can't be upper-cased.
> 
> You may notice that this patch does not affect the ".com" key on the URL
> keyboard. That one emits lowercase ".com" regardless of the uppercase or
> caps lock state of the keyboard.  That is because of line 888 in
> getUpperCaseValue(). Do you think I should change it so that if caps lock is
> on the .com key emits .COM?

I think we don't have to.
I checked my iphone and it won't output .COM even when the uppercase/capsLock is on. 

Thanks for handling this.
Comment on attachment 782714 [details]
link to patch on github

This looks really great, r+.
Thanks again.
Attachment #782714 - Flags: review?(rlu) → review+
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
This patch does not apply cleanly to v1-train. It looks like we've got to at least uplift some previous fix that added the Catalan keyboard layout. I didn't realize that wasn't already in v1-train.

Setting needinfo on myself so I don't forget about uplifting this bug now that it has been closed.
Flags: needinfo?(dflanagan)
Depends on: 866746
I've uplifted bug 866746 to v1-train, adding the Catalan keyboard layout, so this patch should uplift much more cleanly now.
Checking in a unagi build. This works nice!
Depends on: 900355
Whiteboard: [LeoVB+]
Whiteboard: [LeoVB+]
Verified on Leo V1.1 MOZ RIL,
Catalan text prediction is working as expected

Environmental  Variables:
Build ID: 20130806071254
Gecko: http://hg.mozilla.org/releases/mozilla-b2g18/rev/a2a9b89ef5ee
Gaia: 4c1a20570e20f64782ba170c14604395c48f7381
Platform Version: 18.1
v1.1.0hd: d98a10641f1c6d87b5eb9914cde23e836d1d03c7
v1.1.0hd: 5c2bf86ec9fde0c52a92abf4afdc0575c01389a7
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: