Closed
Bug 1226981
Opened 10 years ago
Closed 10 years ago
Romanian keyboard is missing prediction for many words with '-'
Categories
(Firefox OS Graveyard :: Gaia::Keyboard, defect)
Firefox OS Graveyard
Gaia::Keyboard
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mihaibn, Assigned: jobaval10n, Mentored)
Details
(Keywords: foxfood, Whiteboard: [bzlite])
Attachments
(6 files, 16 obsolete files)
User-Agent: Mozilla/5.0 (Mobile; rv:45.0) Gecko/45.0 Firefox/45.0
I installed the Romanian keyboard from Settings->Keyboard and while using it I noticed that the dictionary behind it is very poor and it's missing many words that contain the character '-'. For this reason using it is a very cumbersome process.
| Comment hidden (obsolete) |
| Reporter | ||
Comment 2•10 years ago
|
||
| Comment hidden (obsolete) |
| Comment hidden (obsolete) |
| Comment hidden (obsolete) |
| Comment hidden (obsolete) |
| Comment hidden (obsolete) |
| Comment hidden (obsolete) |
| Comment hidden (obsolete) |
| Comment hidden (obsolete) |
| Comment hidden (obsolete) |
| Comment hidden (obsolete) |
| Comment hidden (obsolete) |
| Comment hidden (obsolete) |
| Comment hidden (obsolete) |
| Comment hidden (obsolete) |
| Reporter | ||
Updated•10 years ago
|
Component: Gaia::Feedback → Gaia::Keyboard
| Reporter | ||
Comment 17•10 years ago
|
||
Sorry for the unnecessary attachments. I created the bug using Bugzilla Lite and I missed removing them.
| Reporter | ||
Updated•10 years ago
|
Flags: needinfo?(cristian.silaghi)
Updated•10 years ago
|
Flags: needinfo?(cristian.silaghi)
Comment 18•10 years ago
|
||
I have no idea about this, nor what can we do.
Maybe dictionary is outdated? Maybe we need another dictionary?
I don't even use T9, to be honest.
Best Regards,
Cristian Silaghi
| Reporter | ||
Comment 19•10 years ago
|
||
well I use it a lot when I write. The prediction is very primitive and able to guess the romanian words that contain '-' like (mi-au,si-au,ti-au, v-am...) this makes it very cumbersome to type.
Maybe the issue is because these words are missing from the dictionary. Can anybody check this pls?
Comment 20•10 years ago
|
||
I can confirm this. Asking for NI to Flod, Delphine as they might let us know how we can proceed.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(lebedel.delphine)
Flags: needinfo?(francesco.lodolo)
| Reporter | ||
Comment 21•10 years ago
|
||
small correction to what I said in Comment 19: The prediction is very primitive and NOT able to guess the romanian words that contain '-'...I missed the negation
Comment 22•10 years ago
|
||
Dictionaries are hosted here
https://github.com/mozilla-b2g/gaia/tree/master/apps/keyboard/js/imes/latin/dictionaries
For the hyphen question, Tim is probably a better pick than me or Delphine.
@reported
You can hide attachments by going into the details (Details->Edit details) and marking them as obsolete (if they are).
Flags: needinfo?(timdream)
Flags: needinfo?(lebedel.delphine)
Flags: needinfo?(francesco.lodolo)
| Reporter | ||
Updated•10 years ago
|
Attachment #8690548 -
Attachment is obsolete: true
| Reporter | ||
Updated•10 years ago
|
Attachment #8690549 -
Attachment is obsolete: true
| Reporter | ||
Updated•10 years ago
|
Attachment #8690550 -
Attachment is obsolete: true
| Reporter | ||
Updated•10 years ago
|
Attachment #8690551 -
Attachment is obsolete: true
| Reporter | ||
Updated•10 years ago
|
Attachment #8690552 -
Attachment is obsolete: true
| Reporter | ||
Updated•10 years ago
|
Attachment #8690553 -
Attachment is obsolete: true
| Reporter | ||
Updated•10 years ago
|
Attachment #8690554 -
Attachment is obsolete: true
| Reporter | ||
Updated•10 years ago
|
Attachment #8690555 -
Attachment is obsolete: true
| Reporter | ||
Updated•10 years ago
|
Attachment #8690556 -
Attachment is obsolete: true
| Reporter | ||
Updated•10 years ago
|
Attachment #8690557 -
Attachment is obsolete: true
| Reporter | ||
Updated•10 years ago
|
Attachment #8690558 -
Attachment is obsolete: true
| Reporter | ||
Updated•10 years ago
|
Attachment #8690559 -
Attachment is obsolete: true
| Reporter | ||
Updated•10 years ago
|
Attachment #8690560 -
Attachment is obsolete: true
| Reporter | ||
Updated•10 years ago
|
Attachment #8690561 -
Attachment is obsolete: true
| Reporter | ||
Updated•10 years ago
|
Attachment #8690562 -
Attachment is obsolete: true
| Reporter | ||
Updated•10 years ago
|
Attachment #8690563 -
Attachment is obsolete: true
Comment 23•10 years ago
|
||
@flod, do you know if can we use other dictionary for Romanian locale?
E.g. Google/Android dictionary?
Because their dictionary is superior than ours.
Or do they use some non-permissive license?
Best Regards,
Cristian Silaghi
| Reporter | ||
Comment 24•10 years ago
|
||
FYI, I looked a bit at how Android is doing it and I found out this https://github.com/omnirom/android_packages_inputmethods_LatinIME/blob/android-5.1/dictionaries/ro_wordlist.combined.gz
There are 1.125.287 romanian words there compared to only 115832 in the Mozilla dictionary!
Comment 25•10 years ago
|
||
Note that it's not a standard dictionary
https://github.com/mozilla-b2g/gaia/blob/master/apps/keyboard/js/imes/latin/dictionaries/README.md
I don't know about Licenses for dictionaries, Gaia is release with Apache License
https://github.com/mozilla-b2g/gaia/blob/master/LICENSE
| Assignee | ||
Comment 26•10 years ago
|
||
Anyone happen to know what format the .dict files are in?
Listed here:
https://github.com/mozilla-b2g/gaia/tree/master/apps/keyboard/js/imes/latin/dictionaries
We could convert the wordlist in Android if the license makes that possible: https://github.com/omnirom/android_packages_inputmethods_LatinIME/blob/android-5.1/dictionaries/ro_wordlist.combined.gz
What is the relationship between the .xml and .dict files?
Basically for Romanian, a few words contain a dash, like "într-o/într-un" (en: "in a [something]"), but most uses of the dash is to indicate contractions with prepositions or pronouns: "să îți -> să-ți", so typing prediction from a list of single words will have this problem.
Comment 27•10 years ago
|
||
Clearing needinfo for Tim for now.
I found that the hungarian wordlist does contain words with hyphens, like "szeretet-teli".
Could you verify that those work? (Don't have a current build).
The wordlist itself seems to date back to a mass-add by dflanagan from bug 908286, hard to tell what went into the actual word lists back then.
Also CCing Kevin, in case he has a good idea.
Flags: needinfo?(timdream)
| Comment hidden (obsolete) |
| Reporter | ||
Comment 29•10 years ago
|
||
(In reply to Axel Hecht [:Pike] from comment #27)
> Clearing needinfo for Tim for now.
>
> I found that the hungarian wordlist does contain words with hyphens, like
> "szeretet-teli".
>
> Could you verify that those work? (Don't have a current build).
I just installed the hungarian keyboard and typed: szeretetteli and I get the prediction.
See the screenshot here: http://i.imgur.com/SGxkUvh.png
Comment 30•10 years ago
|
||
I created the autocorrect dictionary, and I think I see the problem. The frequencies came from a big web-crawled corpus of Romanian. But since there are many misspellings in web texts, I usually only keep the words validated by some open source spell checker. I used the Mozilla one for Romanian, here:
https://addons.mozilla.org/en-US/firefox/addon/romanian-spellchecking-diction/
But the affix file in this addon doesn't declare the hyphen as a word character at all, presumably choosing instead to just break on hyphens and spell check the pieces.
| Reporter | ||
Comment 31•10 years ago
|
||
I see. So the result is a file which is 10x smaller than the one from https://github.com/omnirom/android_packages_inputmethods_LatinIME/blob/android-5.1/dictionaries/ro_wordlist.combined.gz
Can we re-use this list, or any other android romanian dictionary which is big and update the Mozilla romanian dictionary?
| Assignee | ||
Comment 32•10 years ago
|
||
So after better reading comprehension of this:
https://github.com/mozilla-b2g/gaia/blob/master/apps/keyboard/js/imes/latin/dictionaries/README.md
It looks like the the xx.dict file is generated from the xx_wordlist.xml file using xml2dict.js.
However, my run of xml2dict.js produces a different output, although the same file size.
193kB of the 961KB are different:
cmp -l ro.dict ro_test.dict | gawk '{printf "%08X %02X %02X\n", $1, strtonum(0$2), strtonum(0$3)}' | wc -l
Attached are the two different files for comparison.
ro.dict is the original file from the repo
ro_test.dict is the file generated by running xml2dict.js against ro_wordlist.xml with the command:
node --harmony xml2dict.js -o lang.dict lang_wordlist.xml
My question: is ro.dict generated automatically in the build process of gaia, or does it have to be added manually after each change to the corresponding xml file? If it's automatic, then we don't have to deal with that at all and just propose changes to ro_wordlist.xml
Jobava
| Assignee | ||
Comment 33•10 years ago
|
||
| Assignee | ||
Comment 34•10 years ago
|
||
| Reporter | ||
Comment 35•10 years ago
|
||
so what's the plan now that we have all this information? Can anyone with more knowledge reply on this?
Comment 36•10 years ago
|
||
(In reply to Mihai Barbat from comment #35)
> so what's the plan now that we have all this information? Can anyone with
> more knowledge reply on this?
What's the question to answer here? I am seeing the question to me was answered in comment 27.
(In reply to Jobava from comment #32)
> My question: is ro.dict generated automatically in the build process of
> gaia, or does it have to be added manually after each change to the
> corresponding xml file? If it's automatic, then we don't have to deal with
> that at all and just propose changes to ro_wordlist.xml
Assuming this is the question to answer: no, dict files are not generated in the build process (although it should). They are checked-in. So if you propose a change please also rebuild the dict file.
Also, per finding documented in [1] the sorting of dict build process is not stable, so it is possible to generate a different dict even if the list is unchanged.
[1] https://wiki.mozilla.org/Gaia/System/Keyboard/IME/Latin/Prediction_%26_Auto_Correction
| Reporter | ||
Comment 37•10 years ago
|
||
ok, so do we have a green light to create a new dictionary then, starting maybe from the android one?
| Reporter | ||
Comment 38•10 years ago
|
||
ping!
Comment 39•10 years ago
|
||
Hey Mihai - I think :timdream already answered in comment 36, see: "Assuming this is the question to answer: no, dict files are not generated in the build process (although it should). They are checked-in. So if you propose a change please also rebuild the dict file.
Also, per finding documented in [1] the sorting of dict build process is not stable, so it is possible to generate a different dict even if the list is unchanged."
Just go ahead and make the suggested changes.
| Assignee | ||
Comment 40•10 years ago
|
||
First, I don't understand the purpose for the field "flags", it's not present in either the Chrome or the Gaia dictionary. I will just ignore it for now.
Discoveries so far:
In the Chrome dictionary, there are only a few words where "freq" and "originalFreq" are different:
"aia" freq: 30, origFreq: 72
"alea" freq: 30, origFreq: 72
"asta" freq: 30, origFreq: 108
"astea" freq: 30, origFreq: 72
"de-a" freq: 30, origFreq: 72
"s-o" freq: 30, origFreq: 106
"v-a" freq: 30, origFreq: 72
"v-ar" freq: 30, origFreq: 72
"ăia" freq: 30, origFreq: 72
"ăla" freq: 30, origFreq: 72
"ăsta" freq: 30, origFreq: 72
"ăștia" freq: 30, origFreq: 72
In the Chrome dictionary:
the minimum freq is 30 and maximum is 210.
of the words:
77.44% are at frequency 72
12.57% are at frequency 80
5.1% are at frequency 85
1.4% are at frequency 89
the rest are below 1%
In the Gaia dictionary:
minFreq is 2, maxFreq is 255
(frequency, proportion of words)
2 30,16%
14 13,50%
22 7,97%
28 5,52%
32 4,11%
36 3,34%
39 2,70%
42 2,17%
44 1,87%
46 1,60%
48 1,44%
50 1,26%
51 1,05%
53 1,05%
54 0,94%
So we note that Gaia has a different distribution of frequency.
Should I normalize the chrome dictionary's freqs to have the same distribution as Gaia? Should I just keep Chrome's freqs?
| Assignee | ||
Comment 41•10 years ago
|
||
Another finding: there are 6177 words in the Gaia dictionary not present in Chrome. Some of those words lack diacritics or typos. These words tend to be names of persons, politicians, place names etc.
It'll take a while to go through the list and correct the typos, the Gaia list looks lower quality and could even be scrapped (IMHO). Attaching that file to the bug report so you can look at it as well.
| Assignee | ||
Comment 42•10 years ago
|
||
| Assignee | ||
Comment 43•10 years ago
|
||
So apparently it takes a very long time (hours) for the .dicts to be generated and now I see why that process may not be automatic in the build of gaia.
Anyway, I have for now a patch containing just a dumb transfer of the Chrome dictionary, discarding the Gaia one: https://github.com/Jobava/gaia/commit/124420602a63235aec4dca10e9a0e0d71f95cadb
The other choice for now is adding just the dash-containing words from Chrome to Gaia, to get a smaller word list.
If the current Gaia wordlist is to be preserved, it needs some corrections and manual curation (for example, names of politicians don't really belong there).
@Kevin do you know how other languages do it?
Flags: needinfo?(kscanne)
Comment 44•10 years ago
|
||
I haven't worked on the .dict files - I've just created the XML word frequency lists.
Whatever you end up deciding in terms of the autocorrect list, it would be worth submitting your notes + corrections to the upstream maintainer of the spellchecker I referenced in Comment 30.
Flags: needinfo?(kscanne)
Comment 45•10 years ago
|
||
This is an interesting bug! To start I'll note that things would have been easier if anyone had asked for input from keyboard peers early on in the process. You wouldn't have had to figure everything out on your own.
Jobava: nice work on figuring out the wordlist format and how to create dictionary files. If you've got more questions about this, I can probably answer them. Note that the wordlist xml format we use was copied from Google's android code. But then they changed to their new .combined format, which is why you had to convert the wordlist format.
In order to get a formal review of your patch, you'll need to create an attachment that links to github, and then request review by setting r?. Rudy might be the best reviewer for this, though I'm not certain about that.
So here are some specific comments:
1) The frequency distribution for this wordlist is odd. As the author of the algorithm, I'd expect to get best results when the distribution is spread out to use the full 8-bit range. Also, the .dict format compresses the frequencies down into a 5-bit range, so the words with frequency 72 and the words with frequency 80 might get the same frequency in the .dict file which would make them indistinguishable. I don't know if that is happening, but you might check on it. In general, it would be nice to be able to use the google wordlists unmodified, but in this case, it might be safer to adjust some of those frequency levels to spread them out a bit more.
2) The wordlist and .dict file are both about twice as big as the next-largest, for Hungarian. And we could never have shipped files this large in the earlier versions of FirefoxOS designed for relatively low-end devices. Just listing all possible hyphenated combinations of words seems like a brute-force approach to autocorrect for Romanian. I guess google is doing it, so it must work okay. And we do have more powerful devices now, so maybe we can get away with this. But as part of the review process, I think we need some data about how well it works. Bringing up a Romanian keyboard will load the 8mb .dict file. Does this introduce a noticeable lag? Do word suggestions show up quickly enough as you type? Also, with a dictionary this large, it could be that some of the search parameters (how deeply to search, how many candidates to retain, etc.) might need to be tweaked to get best results.
3) How obscure are the lowest-frequency words in the google wordlist? If you cut all of the words at f=72 and below, you'd have a wordlist that was 1/4 the size and still contained 250K words, which is twice as many as we have now. How good is the autocorrect performance of that reduced wordlist? I'm guessing it is still better than what we have now, and should support hyphens. If you think that is good enough, it gets us down to the size level of the german dictionary, I think.
4) I've only just now realized that the size of the dictionary is not due to hypenated words. This huge 1M word wordlist only has 7K hyphenated words in it. I think that the issue is that Romanian has lots of possible suffixes for words and that our autocorrect technology doesn't have any alternative except to just list them all. (I think I remember having a discussion with Kevin about the problem with Romanian and the need for different algorithms for autocorrect and spellcheck in languages like it.)
Anyway, I see three ways to proceed here:
1) Just use the giant dictionary from Google. We'd want to check that the performance is okay first. And you'd want to consider adjusting the word frequencies to spread them out more.
2) Use a truncated version of the Google wordlist to get more words without having to worry so much about performance. (Still probably worth rescaling the frequencies, though) If you do this, you'll want to update dictionaries/README.md to describe the origin of the wordlist and how it was modified. I think that this might be the best approach, though it would depend on whether actual Romanian speakers think that it is good enough.
3) Just add the hyphenated words from this new wordlist to the existing wordlist as mentioned in comment #43. It would probably be tricky to add the new words at the appropriate place in the old list, but you could probably figure something out. I think that option 2 is probably better than this one, though.
Flags: needinfo?(jobaval10n)
Comment 46•10 years ago
|
||
I said above that Rudy might be the best reviewer, but I forgot that he is not at Mozilla anymore. So when you're ready, the best thing is probably to ask :timdream for a review. He can transfer the review request to the right person on his team or to me.
| Assignee | ||
Comment 47•10 years ago
|
||
Thank you for the review, David. I will look into rescaling/normalization, also at pruning the list a little.
For Romanian, there are about 1500 inflection rules and it may be possible to auto-generate inflections based on just "base forms" (like the people at Dexonline: https://dexonline.ro/modele-flexiune?modelType=V&locVersion=5.0 -- just the verb forms, for example) , but at added complexity and it's language-specific.
Will get back to you.
Flags: needinfo?(jobaval10n)
Comment 48•10 years ago
|
||
If you produce the clean wordlist you want to use, I can then add the frequencies from a biggish (20 million word) web corpus, scaled appropriately. This is essentially David's approach 2) in comment 45, truncating according to actual usage on the web. It should produce something of reasonable size.
| Assignee | ||
Comment 49•10 years ago
|
||
Kevin, your word list does not contain hyphenated words, does it? Can you make that public? We may need to have frequencies from one list compatible with the other, I guess (if it's indeed an issue).
Flags: needinfo?(kscanne)
| Assignee | ||
Comment 50•10 years ago
|
||
David, can you give us a few details on the actual binary format of the .dict file, to get an idea of how the frequencies actually look like when used by Gaia?
Flags: needinfo?(dflanagan)
| Assignee | ||
Comment 51•10 years ago
|
||
Ah sorry, David, just found the resource I need. Will get back to you after more progress.
https://wiki.mozilla.org/Gaia/System/Keyboard/IME/Latin/Dictionary_Blob
Flags: needinfo?(dflanagan)
Comment 52•10 years ago
|
||
I can tokenize the texts and include hyphens, so yes it will have the hyphens. The previous iteration did not only because the Mozilla spellchecker did not.
Top 50000 words from the web corpus are downloadable here:
http://crubadan.org/languages/ro
but I'd do a fresh crawl to add more texts probably.
Flags: needinfo?(kscanne)
| Assignee | ||
Comment 53•10 years ago
|
||
Kevin, I am adding a list of words which is a combination of the Chrome wordlist and that from dexonline.ro. The Dexonline list is comprehensive and reasonably complete with inflection forms of Romanian and the Chrome one includes a few extra and also the hyphenated words.
Can you add frequencies to this list?
We'll then truncate the list at some acceptable frequency limit so that it can work in Gaia.
Flags: needinfo?(kscanne)
| Assignee | ||
Comment 54•10 years ago
|
||
I hosted the file here, since it can't be added as an attachment:
https://raw.githubusercontent.com/Jobava/mozilla-localization/master/chrome_plus_dexonline_sorted_uniq.txt
(right-click, save as, about 13MB, one word per line)
Comment 55•10 years ago
|
||
Flags: needinfo?(kscanne)
Comment 56•10 years ago
|
||
Thanks :djf for helping out! I am going to put myself on the mentor field so I don't forget to reply here, but I might not be as helpful as :djf on the prediction part of the Keyboard app. David, maybe you should consider adding yourself there too assuming your mail filter work the same way as mine?
(In reply to David Flanagan [:djf] from comment #45)
> 2) The wordlist and .dict file are both about twice as big as the
> next-largest, for Hungarian. And we could never have shipped files this
> large in the earlier versions of FirefoxOS designed for relatively low-end
> devices. Just listing all possible hyphenated combinations of words seems
> like a brute-force approach to autocorrect for Romanian. I guess google is
> doing it, so it must work okay. And we do have more powerful devices now, so
> maybe we can get away with this. But as part of the review process, I think
> we need some data about how well it works. Bringing up a Romanian keyboard
> will load the 8mb .dict file. Does this introduce a noticeable lag? Do word
> suggestions show up quickly enough as you type? Also, with a dictionary
> this large, it could be that some of the search parameters (how deeply to
> search, how many candidates to retain, etc.) might need to be tweaked to get
> best results.
The only one small correction I have here: bug 1128396 have since implemented and enable users to download a dictionary from Amazon S3, instead of always requiring a dictionary to be included in the build. Size is still a concern though but it's not a major concern now.
Once you update the .dict file in the Gaia tree, please file another bug to update the Amazon S3 as well.
Thanks!
Assignee: nobody → jobaval10n
Mentor: timdream
Status: NEW → ASSIGNED
| Assignee | ||
Comment 57•10 years ago
|
||
Kevin, the list looks good. You just arbitrarily chopped off words at frequency 1? (least frequent, most numerous)
@Mihai I am unable to build the .dict since that requires cloning or downloading the repo here https://github.com/mozilla-b2g/gaia/ and my internet connection to github is too slow to be workable in console mode
basically, you download the zip, unpack it
cd gaia/apps/keyboard/js/imes/latin/dictionaries/
npm install
rm -f ro.dict
rm -f ro_wordlist.xml
wget https://bugzilla.mozilla.org/attachment.cgi?id=8701611 -O ro_wordlist.xml
node --harmony xml2dict.js -o ro.dict ro_wordlist.xml
That should generate a new .dict from the attachment Kevin made
Flags: needinfo?(kscanne)
Comment 58•10 years ago
|
||
(In reply to Jobava from comment #57)
> Kevin, the list looks good. You just arbitrarily chopped off words at
> frequency 1? (least frequent, most numerous)
>
That's right - I only kept words that appeared at least once in a ~15 million word corpus crawled from the web.
Flags: needinfo?(kscanne)
Comment 60•10 years ago
|
||
Jobava, could you please explain how to run that script?
Which environment to use?
| Assignee | ||
Comment 61•10 years ago
|
||
OK, I updated the old pull request with the updated files.
https://github.com/mozilla-b2g/gaia/pull/33639
David, is this good for merging?
Flags: needinfo?(jobaval10n) → needinfo?(dflanagan)
Updated•10 years ago
|
Flags: needinfo?(timdream)
Comment 62•10 years ago
|
||
I am just going to LGTM (look good to me) this patch because there is no process exist to verify dictionary change outside the actual user from the community except making sure it's legal (it is since it comes from AOSP).
Thanks for contributing. The only think prevent this pull request from being merged is the lack of bug number on the commit message -- please amend the commit message so we could get this merged.
| Assignee | ||
Comment 63•10 years ago
|
||
Thanks, Tim, I amended the commit message and updated the PR.
Flags: needinfo?(timdream)
Comment 64•10 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: needinfo?(timdream)
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•