Closed
Bug 1137970
Opened 10 years ago
Closed 9 years ago
[b2g] Generate Ukrainian (uk) wordlist/dictionary
Categories
(Firefox OS Graveyard :: Gaia::Keyboard, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: delphine, Assigned: a.polivanchuk)
References
Details
Attachments
(8 files)
71.91 KB,
text/plain
|
Details | |
71.98 KB,
text/plain
|
Details | |
4.10 MB,
text/xml
|
Details | |
4.21 MB,
text/xml
|
Details | |
4.21 MB,
text/xml
|
Details | |
1.46 MB,
application/octet-stream
|
Details | |
46 bytes,
text/x-github-pull-request
|
timdream
:
feedback+
|
Details | Review |
46 bytes,
text/x-github-pull-request
|
Details | Review |
We are working on adding Ukrainian locale up on FxOS builds right now (see Bug 1137822)
Once that's done, we will need to get autcorrection/wordsuggestion up and running for this locale.
Artem: if you have any feedback or can give input on this, it would be helpful. thanks!
Assignee | ||
Comment 1•10 years ago
|
||
Is it possible to get the dictionary from uk locale of Firefox?
Comment 2•10 years ago
|
||
I don't think they contain frequency information, which we need for FxOS.
Assignee | ||
Comment 3•10 years ago
|
||
Which exactly information needed?
Reporter | ||
Comment 4•10 years ago
|
||
Hey Artem, for this bug someone will need to generate a wordlist so that there is autocorrection and wordsuggestion in Firefox OS, when you are typing with the keyboard.
Someone on CC here will probably be able to advise on how to go forwards, better than me ^^ thanks!
Assignee | ||
Comment 5•10 years ago
|
||
As I see in the requests for other locales, list of autcorrection/wordsuggestion should contain about 4-5 thousand frequently used words.
Is there any other requirements?
Reporter | ||
Comment 6•9 years ago
|
||
Hey Artem - sorry this fell off my radar as I was not need info'ed here :)
Asking Kevin if he can help you with this when he gets back. Thanks!
Flags: needinfo?(kscanne)
Assignee | ||
Comment 7•9 years ago
|
||
I already have the list of regular Ukrainian words.
Now I check them and will post here soon :)
Assignee | ||
Comment 8•9 years ago
|
||
Reporter | ||
Comment 9•9 years ago
|
||
Thanks Artem!
@Kevin: could you please have a look at the list attached? Do you think we need more corpus, or are we good to go? thanks!
Comment 10•9 years ago
|
||
Hi Artem, this list looks very good. How did you create it?
Is it worth including frequent proper names like "України", "Україні"? I see a few other common words on the web like "можна" and "цього" that could be added too. Would it worth working from a big spell checking word list and paring it down? This one is tri-licensed:
http://extensions.libreoffice.org/extension-center/ukrainian-spelling-dictionary-and-thesaurus
Or maybe that's overkill if you have most of the words you want in testing...
Flags: needinfo?(kscanne)
Assignee | ||
Comment 11•9 years ago
|
||
Hi Kevin, I just got the big list of common words and handled it in excel table with further check.
Of course it worth to include your proposed words. I already added them to the updated list.
How can I open this big spell checking word list in readable view? How many words are there?
I guess that 4-5 thousand common words should be enough for autocorrection and wordsuggestion.
Comment 12•9 years ago
|
||
Hi Artem,
It's a bit tricky to view the spelling list as a text file - the .oxt file can be renamed as .zip, and then unzipped. Inside the zip you'll see a uk_UA folder with .dic and .aff files that have the dictionary data. Fully expanded, there are 1.7 *million* words... I agree with you that this is overkill! :) I'd say it's OK to go ahead with the list you have, and if you feel like you want to add more common words based on testing, I can help with that.
Assignee | ||
Comment 13•9 years ago
|
||
OK, I see. So, let's go ahead :)
Thanks!
Assignee | ||
Comment 14•9 years ago
|
||
Is there any progress with lending wordlist to 2.2 and 2.5?
Maybe some additional action is needed to move further?
Assignee | ||
Comment 15•9 years ago
|
||
OK, here is the right list, which contains 98931 words.
Information about this list (http://u-mova.blogspot.com/2013/09/blog-post.html):
Frequency list of Ukrainian language.
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.
Copyright © 2013 Volodymyr Vlad
I converted original file into the .xml file.
One thing I'm in doubt is the date="1377342507" in header.
Please, take a look and check it.
Thanks!
Assignee | ||
Updated•9 years ago
|
Flags: needinfo?(kscanne)
Comment 16•9 years ago
|
||
Looks good to me. The date in the header is just a timestamp, seconds since the Unix epoch, "date +%s" in your favorite shell.
Flags: needinfo?(kscanne)
Comment 17•9 years ago
|
||
Artem,
Thanks for finding this wordlist! This looks great. Our dictionary creation script used to require frequences to be in the range 1 to 255, but I don't think it does anymore. However, note that the frequencies you give will be linearly scaled to fit in 5 bits, so that there are really just 32 frequency levels. This means that something 95% or more of your words will be at the lowest frequency level. This will mean that autocorrect can't make good decisions about which words are more frequent than others. So I'd recommend that you take these raw frequency numbers and rescale them on a logarithmic scale so that they are more evenly distributed between 1 and 255 or 1 and 1000 or something.
Also, note that you can use f="0" to mark obscene words that you do not want to be suggested by autocorrect, but that should be considered valid words and not be autocorrected. You can look at the english wordlist if you want to see the long list of english words that Google things are profane.
Once you've adjusted the frequencies, you need to run the xml2dict.js script to create a uk.dict file. See https://bugzilla.mozilla.org/show_bug.cgi?id=1226981#c32 for a description of how this was done for Romanian.
Test out the .dict file to make sure it works well. If it does, then prepare a pull request that includes the wordlist, and the .dict file, and also modifies the README file to include notes about how the Ukranian wordlist was generated. Explain the original source and its CC license and briefly explain how you modified the frequencies from that source.
Once you've done that, you can ask me or :timdream to review and land the patch.
After that, the final step is to get the uk.dict file hosted on the Amazon S3 server so that it can be downloaded by FirefoxOS devices. I don't know how to do that part, but :timdream can help you with that if you set the needinfo flag for him.
Comment 18•9 years ago
|
||
OK, at Artem's request I took a look at his word list. I added a list of relative frequencies scaled from the list found here http://u-mova.blogspot.com/2013/09/blog-post.html .
The formula for relative frequencies (in the xml, simply "freq") is:
relative_freq = ceil (log(maxFreq/freq) / log(1.05622) + 0.5)
The constant 1.05622 is found by: exp (log(1082616) / 254), where 1082616 is the maximum pre-normalized frequency.
This gives values between 1 and 255, with 255 the least frequent words and 1 most frequent.
Comment 19•9 years ago
|
||
If this looks alright I can also generate a .dict blob file for a potential patch.
Assignee | ||
Comment 20•9 years ago
|
||
Hi Jobava,
Thank you very much for help with adjusting the frequencies.
Is it normal that the higher frequency number became to the lower after adjusting?
I mean that original list has descended frequencies, but new list has ascended, while the words order kept without changes.
Looking at the lists for other locales, I guess it should be adjusted by the opposite way (255 the most frequent words and 1 least frequent).
Flags: needinfo?(jobaval10n)
Comment 21•9 years ago
|
||
Should be that way, initial frequencies are the raw number of occurrences in that corpus of text that Volodymyr compiled. Volodymyr's list also had "relative frequencies" on a base2 logarithm, but that ran out after just rank 21, since log2(1million) ≈ 20. I stretched that by picking a smaller base as described above.
The designation "freq" here are relative frequencies.
Flags: needinfo?(jobaval10n)
Assignee | ||
Comment 22•9 years ago
|
||
David, could you please review new list (Comment 19)...
Flags: needinfo?(dflanagan)
Assignee | ||
Comment 23•9 years ago
|
||
I recalculated frequency order adjusted by Jobava, using formula: 256-<frequency>
Also I contacted to Volodymyr, the author of this list, and he has recommended to recalculate this way.
Comment 24•9 years ago
|
||
Hello, I am attaching the .dict file as requested by Sergiy and Artem. I used the xml2dict.js script in apps/keyboard/js/imes/latin/dictionaries/
It looks like the .dict compilation instructions need a little attention as other people are having issues with using it.
Assignee | ||
Comment 25•9 years ago
|
||
Thanks a lot for your help, Jobava!
Next I'm going to create PR using last uk_wordlist.xml and uk.dict
Flags: needinfo?(dflanagan)
Comment 26•9 years ago
|
||
Assignee | ||
Comment 27•9 years ago
|
||
Comment on attachment 8714892 [details] [review]
[gaia] stenox:master > mozilla-b2g:master
Please review PR.
Attachment #8714892 -
Flags: review?(timdream)
Comment 28•9 years ago
|
||
Comment on attachment 8714892 [details] [review]
[gaia] stenox:master > mozilla-b2g:master
Looks good, thanks for the contribution. However you would need to fix the build tests.
https://treeherder.mozilla.org/#/jobs?repo=gaia&revision=dd7d25d44aa5946d6f54fea1ab42f507cbb41885&selectedJob=3491105
Attachment #8714892 -
Flags: review?(timdream) → feedback+
Updated•9 years ago
|
Assignee: nobody → a.polivanchuk
Comment 29•9 years ago
|
||
Is this a matter of broken tests, or the uploaded files broke some things they shouldn't have?
I don't understand the errors, or where in that web interface I should look, or how to mentally parse them.
Comment 31•9 years ago
|
||
Artem,
Please refer to this commit I added for bug 1033185. The keyboard build tests asserts the layout included in the build. You would need to include the newly-added layout in the expected lists here so the test code would know the addition is legit.
https://github.com/timdream/gaia/commit/8be51e8e5020c8855ad2722819c1942657b078b6
Thanks for helping out!
Flags: needinfo?(timdream) → needinfo?(a.polivanchuk)
Assignee | ||
Comment 32•9 years ago
|
||
Tim, I modified files and updated PR, but there are 5 failures again.
It might be something wrong with my changes.
Could you please take a look?
https://github.com/mozilla-b2g/gaia/pull/34020
Flags: needinfo?(a.polivanchuk)
Comment 33•9 years ago
|
||
The build test still fails
https://treeherder.mozilla.org/#/jobs?repo=gaia&revision=ea301aa16b3e1eebe1555e88ada9c748bcbc67a2&selectedJob=3523447
Flags: needinfo?(a.polivanchuk)
Assignee | ||
Comment 34•9 years ago
|
||
Yes, I see. But, unfortunately, I have no idea what's the problem and how to fix it.
Flags: needinfo?(a.polivanchuk)
Comment 35•9 years ago
|
||
I am not sure about your commitment, but if you have more time to deal with this issue, you could try to run the tests locally to see it's output. The command to run is:
> make build-test-integration TEST_FILES=apps/keyboard/test/build/integration/keyboard_test.js
Sorry about all the troubles.
Flags: needinfo?(a.polivanchuk)
Comment 36•9 years ago
|
||
https://treeherder.mozilla.org/logviewer.html#?job_id=3530106&repo=gaia#L472
This is
https://github.com/mozilla-b2g/gaia/blob/master/apps/keyboard/test/build/integration/keyboard_test.js#L120-L121
so you should add the newly added layout in the layoutIds array here
https://github.com/mozilla-b2g/gaia/blob/master/apps/keyboard/test/build/integration/keyboard_test.js#L43
Comment 37•9 years ago
|
||
Hey Tim, thanks for your continued guidance.
The layoutIds array already has a 'uk' element.
Flags: needinfo?(timdream)
Assignee | ||
Comment 38•9 years ago
|
||
Actually I already added new layout to layoutIds array with last PR update.
Flags: needinfo?(a.polivanchuk)
Comment 39•9 years ago
|
||
Hi, Artem
Thank for your patch. The reason of build fail is that we added a dict to uk, so our build system no longer preload the layout by default.
We need to delete this line to match manifest, then test should pass.
https://github.com/stenox/gaia/blob/370511577d3fccf66310c84d049a8c7b12ffb3bb/apps/keyboard/test/build/integration/keyboard_test.js#L72
Hope it helps.
Comment 41•9 years ago
|
||
Comment 42•9 years ago
|
||
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•