validChars() in predictions.js should consider different casing

RESOLVED FIXED in 2.2 S9 (3apr)

Status

Firefox OS
Gaia::Keyboard
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: mnjul, Assigned: mnjul)

Tracking

unspecified
2.2 S9 (3apr)
ARM
Gonk (Firefox OS)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [p=2])

Attachments

(2 attachments, 1 obsolete attachment)

This came as frustration during my work in bug 1102835: validChars() in predictions.js currently doesn't accept variant forms (different casing or accentedness) of characters in the input.

As input may be substituted with variant forms during process(), we should accept them.

This issue is especially prominent when dealing with user dictionaries: Suppose that there is only one single word in user dictionary: levicorpus[*], the character table will only contain the lowercase version of L. If user types Levic with the intention to type Levicorpus (such as at the beginning of a sentence), the predictions engine for user dictionary rejects the input right away with validChars() even though Levicorpus will be suggested should it get passed to processCandidates() -- and it's the only one that gets suggested and that matches the user's intention, at that.

Of course this change will modify the behavior of built-in dictionary too, but:
1. Built-in dictionary character table is much much less likely to have only a specific casing of a character [†].
2. I actually think it has benefits for built-in dictionary that we don't reject potentially viable candidates here at validChars().

[*] A spell from Harry Potter for that matter
[†] Since our substitution rule says that we can't substitute an accented version of a char with a plain version of a char without counting it as a correction, we'll actually only consider different casing here, as "ç" should not be valid for "c".
(In reply to John Lu [:mnjul] [MoCoTPE] from comment #0)
> This came as frustration during my work in bug 1102835: validChars() in
> predictions.js currently doesn't accept variant forms (different casing or
> accentedness) of characters in the input.
> 
> As input may be substituted with variant forms during process(), we should
> accept them.
> 
> This issue is especially prominent when dealing with user dictionaries:
> Suppose that there is only one single word in user dictionary:
> levicorpus[*], the character table will only contain the lowercase version
> of L. If user types Levic with the intention to type Levicorpus (such as at
> the beginning of a sentence), the predictions engine for user dictionary
> rejects the input right away with validChars() even though Levicorpus will
> be suggested should it get passed to processCandidates() -- and it's the
> only one that gets suggested and that matches the user's intention, at that.
> 
> Of course this change will modify the behavior of built-in dictionary too,
> but:
> 1. Built-in dictionary character table is much much less likely to have only
> a specific casing of a character [†].
> 2. I actually think it has benefits for built-in dictionary that we don't
> reject potentially viable candidates here at validChars().
> 
> [*] A spell from Harry Potter for that matter
> [†] Since our substitution rule says that we can't substitute an accented
> version of a char with a plain version of a char without counting it as a
> correction, we'll actually only consider different casing here, as "ç"
> should not be valid for "c".

Neh, that's wrong. If user dictionary has only "Çelik", we shouldn't reject "celik". This adds complexity :/
Created attachment 8571792 [details] [review]
[gaia] mnjul:bug_1138782_validchar_consider_variants > mozilla-b2g:master
Thanks for investigating, I noticed this behavior before but didn't know this was the cause. I thought it was low frequency or whatever.
Comment on attachment 8571792 [details] [review]
[gaia] mnjul:bug_1138782_validchar_consider_variants > mozilla-b2g:master

Jan, please check the patch out. Thanks!
Attachment #8571792 - Flags: review?(janjongboom)
Wow, I just started looking through the code today to see how I could fix this for Greek. I'm so happy that it's already being taken care of!
I'm trying to see this problem on action on gaia master but I cannot find any case where the suggestions go wrong. F.e. based on your description I'd figure that (en_us) <w f="83" flags="">émigré</w> would be problematic but it gets shown as suggestion for any input like 'emigre', 'Emigre', 'Emigré', etc.

Can you find a use case reproducible on master?
Flags: needinfo?(jlund)
I can:
1) enable the Greek keyboard
2) open the contacts app and click the + icon to add a contact
3) click on the first name and type: Κάθ
4) see how Κάθε (every) is suggested
5) now delete that and type Καθ instead
6) observe that Κάθε is no longer suggested. Even if you type Καθε, it still won't appear in the list

I admit I haven't tried this patch, but hopefully it takes care of this issue.

Comment 8

3 years ago
I'm guessing you mean to ni jlu@mozilla.com :)

/me 302's on that assumption
Flags: needinfo?(jlund) → needinfo?(jlu)
(In reply to Jan Jongboom [:janjongboom] (Telenor) from comment #6)
> I'm trying to see this problem on action on gaia master but I cannot find
> any case where the suggestions go wrong. F.e. based on your description I'd
> figure that (en_us) <w f="83" flags="">émigré</w> would be problematic but
> it gets shown as suggestion for any input like 'emigre', 'Emigre', 'Emigré',
> etc.
> 
> Can you find a use case reproducible on master?

Uhm...not really. since é, e, and E are all in the dictionary's character table, the inputs you said will be accepted normally. It can become problematic if dictionary does not have 'E' or 'e' at all, and only has 'é'.

Welllllll....this is a little bit embarrassing. I dug through all the latin dictionaries and layouts and I believe all of them are "well-behaved" and cannot exhibit this buggy behavior. ("Well-behaved" is that all words in the dict file can be input'ed by corresponding layout definition's keys.)

And this bug might only be relevant when we need to predict from user dictionary; with that said, though, applying the bug's patch doesn't affect the existing behavior with built-in dictionaries either (since all dicts+layouts are "well-behaved".))

So I guess this fix should be folded into bug 1102835. Jan, what do you think about this?
Flags: needinfo?(jlu) → needinfo?(janjongboom)
(In reply to Panos Astithas [:past] from comment #7)
> I can:
> 1) enable the Greek keyboard
> 2) open the contacts app and click the + icon to add a contact
> 3) click on the first name and type: Κάθ
> 4) see how Κάθε (every) is suggested
> 5) now delete that and type Καθ instead
> 6) observe that Κάθε is no longer suggested. Even if you type Καθε, it still
> won't appear in the list
> 
> I admit I haven't tried this patch, but hopefully it takes care of this
> issue.

Unfortunately that's a different issue. But rest assured, I've got the cause identified. Please see bug 1145955.
Nah, I don't mind that it's only relevant to user dictionaries. Just need a reproducible path. Let's keep it in this bug :-)
Created attachment 8581278 [details]
2015-03-20-18-31-28.png

Alright, you need to reproduce the issue with bug 1102835's patch at https://github.com/mozilla-b2g/gaia/pull/28648/commits . The patch has got four commits, one being from this bug.

A1) Please first flash your Gaia at the tip commit of that PR branch.
A2) With user dictionary settings (Settings -> Keyboards -> Built-in Keyboard -> User dictionary) add a user word, say, "émigra". 
A3) In SMS app's message field, type "Emig" or "Emigr", observe that Émigra is suggested (see attached screenshot).

B1) Now revert the bug 1138782's commit at that branch, which is currently c1fae95. I don't know why I got a revert conflict there but there is only one conflicting chunk, for which you can just paste what's there in master: https://github.com/mozilla-b2g/gaia/blob/master/apps/keyboard/js/imes/latin/predictions.js#L367-L391.
B2) Flash your Gaia with |make install-gaia APP=keyboard| such that the user word doesn't get erased (as with reset-gaia).
B3) In SMS app's message field, type "Emig" or "Emigr", observe that Émigra is no longer suggested.
B4) ...and this is because e is not in the user dictionary's character table so it gets rejected by |validChars()|.
Comment on attachment 8571792 [details] [review]
[gaia] mnjul:bug_1138782_validchar_consider_variants > mozilla-b2g:master

See GitHub, idea is good.
Flags: needinfo?(janjongboom)
Attachment #8571792 - Flags: review?(janjongboom) → review-
Comment on attachment 8571792 [details] [review]
[gaia] mnjul:bug_1138782_validchar_consider_variants > mozilla-b2g:master

Hi Jan,

Thanks for the comments. I have drastically revised the patch and we now have a new Set() to contain all valid chars, inferred from the dictionary’s character table, the layout’s nearbykeys, and variants forms of them. During prediction we just check if all the characters in input string are present in the Set.

There is however something that’s not been done very nicely. As the inference of the validChars Set requires knowledge of both |nearbyKeys| and |variants|, we want to generate the Set after both setNearByKeys() and setDictionary() are called. This is currently ensured by placing the generation code at setDict(), because from the client codes we always call setNBK before setDict, and setNBK may be optional, from latin.js#activate() to latin.js#setLanguageSync() (and from unit testing codes too). Sadly, it’s quite hard to lessen this ordering constraint, as predictions.js doesn’t really have knowledge on whether it’s to be re-initialized, and whether it should wait for any potential incoming setNBK, when setDict is called (or any combination of them), since we don’t really have a clear-cut “reinitialization” command for it. IMO, any attempt to reason such reinitialization concept out of current codes wouldn’t be architecturally better than my current assumption about the order of setNBK and setDict. How do you think about this?

Anyway, please check out the patch again. Thanks!
Attachment #8571792 - Flags: review- → review?(janjongboom)
Assignee: nobody → jlu
Status: NEW → ASSIGNED
Whiteboard: [p=2]
Target Milestone: --- → 2.2 S9 (3apr)
Comment on attachment 8571792 [details] [review]
[gaia] mnjul:bug_1138782_validchar_consider_variants > mozilla-b2g:master

Good, way cleaner.
Attachment #8571792 - Flags: review?(janjongboom) → review+
Keywords: checkin-needed

Updated

3 years ago
Keywords: checkin-needed
http://docs.taskcluster.net/tools/task-graph-inspector/#DYgR-OagTSmJry0FNUORVg

The pull request failed to pass integration tests. It could not be landed, please try again.
Turns out all the errors are not related. Let me try again...
Keywords: checkin-needed

Updated

3 years ago
Keywords: checkin-needed
http://docs.taskcluster.net/tools/task-graph-inspector/#4CV9geYGQJmC-EhHrqojAA

The pull request failed to pass integration tests. It could not be landed, please try again.
Another try after rebase.
Keywords: checkin-needed

Updated

3 years ago
Keywords: checkin-needed

Updated

3 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Thanks Jan and Autolander(?) !
Flags: needinfo?(jlu)
Created attachment 8583607 [details] [review]
[gaia] mnjul:bug_1138782_validchar_consider_variants > mozilla-b2g:master
Attachment #8571792 - Attachment is obsolete: true
Flags: needinfo?(jlu)
Timing issue. Added a guard on that. Rudy, as offline discussed, could you carry over the review+ from attachment 8571792 [details] [review]? Thanks!
Flags: needinfo?(rlu)
Comment on attachment 8583607 [details] [review]
[gaia] mnjul:bug_1138782_validchar_consider_variants > mozilla-b2g:master

Rubber stamp to carry over the r+, this is for autolander to work.
Flags: needinfo?(rlu)
Attachment #8583607 - Flags: review+
All those Gip tests and retriggers look happy.
Keywords: checkin-needed

Updated

3 years ago
Keywords: checkin-needed

Updated

3 years ago
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.