[b2g] Generate Ukrainian (uk) wordlist/dictionary

RESOLVED FIXED

Status

Firefox OS
Gaia::Keyboard
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: delphine, Assigned: Artem Polivanchuk)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(8 attachments)

We are working on adding Ukrainian locale up on FxOS builds right now (see Bug 1137822)
Once that's done, we will need to get autcorrection/wordsuggestion up and running for this locale.
Artem: if you have any feedback or can give input on this, it would be helpful. thanks!
(Assignee)

Comment 1

3 years ago
Is it possible to get the dictionary from uk locale of Firefox?
I don't think they contain frequency information, which we need for FxOS.
(Assignee)

Comment 3

3 years ago
Which exactly information needed?
Hey Artem, for this bug someone will need to generate a wordlist so that there is autocorrection and wordsuggestion in Firefox OS, when you are typing with the keyboard.
Someone on CC here will probably be able to advise on how to go forwards, better than me ^^ thanks!
(Assignee)

Comment 5

3 years ago
As I see in the requests for other locales, list of autcorrection/wordsuggestion should contain about 4-5 thousand frequently used words.
Is there any other requirements?
Hey Artem - sorry this fell off my radar as I was not need info'ed here :) 
Asking Kevin if he can help you with this when he gets back. Thanks!
Flags: needinfo?(kscanne)
(Assignee)

Comment 7

3 years ago
I already have the list of regular Ukrainian words.
Now I check them and will post here soon :)
(Assignee)

Comment 8

3 years ago
Created attachment 8638371 [details]
The list of regular Ukrainian words for Firefox OS!
Thanks Artem! 
@Kevin: could you please have a look at the list attached? Do you think we need more corpus, or are we good to go? thanks!

Comment 10

3 years ago
Hi Artem, this list looks very good. How did you create it? 

Is it worth including frequent proper names like "України", "Україні"?  I see a few other common words on the web like "можна" and "цього" that could be added too.  Would it worth working from a big spell checking word list and paring it down?  This one is tri-licensed:

http://extensions.libreoffice.org/extension-center/ukrainian-spelling-dictionary-and-thesaurus

Or maybe that's overkill if you have most of the words you want in testing...
Flags: needinfo?(kscanne)
(Assignee)

Comment 11

3 years ago
Created attachment 8646181 [details]
Words for Firefox OS.txt

Hi Kevin, I just got the big list of common words and handled it in excel table with further check.
Of course it worth to include your proposed words. I already added them to the updated list.

How can I open this big spell checking word list in readable view? How many words are there?
I guess that 4-5 thousand common words should be enough for autocorrection and wordsuggestion.

Comment 12

3 years ago
Hi Artem,

  It's a bit tricky to view the spelling list as a text file - the .oxt file can be renamed as .zip, and then unzipped.  Inside the zip you'll see a uk_UA folder with .dic and .aff files that have the dictionary data.   Fully expanded, there are 1.7 *million* words... I agree with you that this is overkill! :)  I'd say it's OK to go ahead with the list you have, and if you feel like you want to add more common words based on testing, I can help with that.
(Assignee)

Comment 13

3 years ago
OK, I see. So, let's go ahead :)
Thanks!
(Assignee)

Comment 14

3 years ago
Is there any progress with lending wordlist to 2.2 and 2.5?
Maybe some additional action is needed to move further?
(Assignee)

Comment 15

3 years ago
Created attachment 8693551 [details]
uk_wordlist.xml

OK, here is the right list, which contains 98931 words.

Information about this list (http://u-mova.blogspot.com/2013/09/blog-post.html):

Frequency list of Ukrainian language.
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/.
Copyright © 2013 Volodymyr Vlad

I converted original file into the .xml file.
One thing I'm in doubt is the date="1377342507" in header.
Please, take a look and check it.

Thanks!
(Assignee)

Updated

3 years ago
Flags: needinfo?(kscanne)

Comment 16

3 years ago
Looks good to me.  The date in the header is just a timestamp, seconds since the Unix epoch, "date +%s" in your favorite shell.
Flags: needinfo?(kscanne)
Artem,

Thanks for finding this wordlist! This looks great. Our dictionary creation script used to require frequences to be in the range 1 to 255, but I don't think it does anymore. However, note that the frequencies you give will be linearly scaled to fit in 5 bits, so that there are really just 32 frequency levels. This means that something 95% or more of your words will be at the lowest frequency level. This will mean that autocorrect can't make good decisions about which words are more frequent than others. So I'd recommend that you take these raw frequency numbers and rescale them on a logarithmic scale so that they are more evenly distributed between 1 and 255 or 1 and 1000 or something.

Also, note that you can use f="0" to mark obscene words that you do not want to be suggested by autocorrect, but that should be considered valid words and not be autocorrected. You can look at the english wordlist if you want to see the long list of english words that Google things are profane.

Once you've adjusted the frequencies, you need to run the xml2dict.js script to create a uk.dict file. See https://bugzilla.mozilla.org/show_bug.cgi?id=1226981#c32 for a description of how this was done for Romanian.

Test out the .dict file to make sure it works well.  If it does, then prepare a pull request that includes the wordlist, and the .dict file, and also modifies the README file to include notes about how the Ukranian wordlist was generated. Explain the original source and its CC license and briefly explain how you modified the frequencies from that source.

Once you've done that, you can ask me or :timdream to review and land the patch.

After that, the final step is to get the uk.dict file hosted on the Amazon S3 server so that it can be downloaded by FirefoxOS devices.  I don't know how to do that part, but :timdream can help you with that if you set the needinfo flag for him.

Comment 18

3 years ago
OK, at Artem's request I took a look at his word list. I added a list of relative frequencies scaled from the list found here http://u-mova.blogspot.com/2013/09/blog-post.html .

The formula for relative frequencies (in the xml, simply "freq") is: 
relative_freq = ceil (log(maxFreq/freq) / log(1.05622) + 0.5)

The constant 1.05622 is found by: exp (log(1082616) / 254), where 1082616 is the maximum pre-normalized frequency.
This gives values between 1 and 255, with 255 the least frequent words and 1 most frequent.

Comment 19

3 years ago
Created attachment 8705885 [details]
new ukrainian wordlist with adjusted frequencies

If this looks alright I can also generate a .dict blob file for a potential patch.
(Assignee)

Comment 20

3 years ago
Hi Jobava,
Thank you very much for help with adjusting the frequencies.

Is it normal that the higher frequency number became to the lower after adjusting?
I mean that original list has descended frequencies, but new list has ascended, while the words order kept without changes.
Looking at the lists for other locales, I guess it should be adjusted by the opposite way (255 the most frequent words and 1 least frequent).
Flags: needinfo?(jobaval10n)

Comment 21

3 years ago
Should be that way, initial frequencies are the raw number of occurrences in that corpus of text that Volodymyr compiled. Volodymyr's list also had "relative frequencies" on a base2 logarithm, but that ran out after just rank 21, since log2(1million) ≈ 20. I stretched that by picking a smaller base as described above.

The designation "freq" here are relative frequencies.
Flags: needinfo?(jobaval10n)
(Assignee)

Comment 22

3 years ago
David, could you please review new list (Comment 19)...
Flags: needinfo?(dflanagan)
(Assignee)

Comment 23

3 years ago
Created attachment 8706290 [details]
New wordlist with recalculated frequency order

I recalculated frequency order adjusted by Jobava, using formula: 256-<frequency>

Also I contacted to Volodymyr, the author of this list, and he has recommended to recalculate this way.

Comment 24

2 years ago
Created attachment 8713090 [details]
.dict generated from the uk_wordlist.xml file

Hello, I am attaching the .dict file as requested by Sergiy and Artem. I used the xml2dict.js script in apps/keyboard/js/imes/latin/dictionaries/

It looks like the .dict compilation instructions need a little attention as other people are having issues with using it.
(Assignee)

Comment 25

2 years ago
Thanks a lot for your help, Jobava!
Next I'm going to create PR using last uk_wordlist.xml and uk.dict
Flags: needinfo?(dflanagan)
Created attachment 8714892 [details] [review]
[gaia] stenox:master > mozilla-b2g:master
(Assignee)

Comment 27

2 years ago
Comment on attachment 8714892 [details] [review]
[gaia] stenox:master > mozilla-b2g:master

Please review PR.
Attachment #8714892 - Flags: review?(timdream)
Comment on attachment 8714892 [details] [review]
[gaia] stenox:master > mozilla-b2g:master

Looks good, thanks for the contribution. However you would need to fix the build tests.

https://treeherder.mozilla.org/#/jobs?repo=gaia&revision=dd7d25d44aa5946d6f54fea1ab42f507cbb41885&selectedJob=3491105
Attachment #8714892 - Flags: review?(timdream) → feedback+
Assignee: nobody → a.polivanchuk

Comment 29

2 years ago
Is this a matter of broken tests, or the uploaded files broke some things they shouldn't have?

I don't understand the errors, or where in that web interface I should look, or how to mentally parse them.
Will be back after the holiday here...
Flags: needinfo?(timdream)
Artem,

Please refer to this commit I added for bug 1033185. The keyboard build tests asserts the layout included in the build. You would need to include the newly-added layout in the expected lists here so the test code would know the addition is legit.

https://github.com/timdream/gaia/commit/8be51e8e5020c8855ad2722819c1942657b078b6

Thanks for helping out!
Flags: needinfo?(timdream) → needinfo?(a.polivanchuk)
(Assignee)

Comment 32

2 years ago
Tim, I modified files and updated PR, but there are 5 failures again.
It might be something wrong with my changes.
Could you please take a look?

https://github.com/mozilla-b2g/gaia/pull/34020
Flags: needinfo?(a.polivanchuk)
(Assignee)

Comment 34

2 years ago
Yes, I see. But, unfortunately, I have no idea what's the problem and how to fix it.
Flags: needinfo?(a.polivanchuk)
I am not sure about your commitment, but if you have more time to deal with this issue, you could try to run the tests locally to see it's output. The command to run is:

> make build-test-integration TEST_FILES=apps/keyboard/test/build/integration/keyboard_test.js

Sorry about all the troubles.
Flags: needinfo?(a.polivanchuk)

Comment 37

2 years ago
Hey Tim, thanks for your continued guidance.

The layoutIds array already has a 'uk' element.
Flags: needinfo?(timdream)
(Assignee)

Comment 38

2 years ago
Actually I already added new layout to layoutIds array with last PR update.
Flags: needinfo?(a.polivanchuk)

Comment 39

2 years ago
Hi, Artem

Thank for your patch. The reason of build fail is that we added a dict to uk, so our build system no longer preload the layout by default.

We need to delete this line to match manifest, then test should pass.
https://github.com/stenox/gaia/blob/370511577d3fccf66310c84d049a8c7b12ffb3bb/apps/keyboard/test/build/integration/keyboard_test.js#L72

Hope it helps.
Thanks Ray for helping out!
Flags: needinfo?(timdream)
Created attachment 8721318 [details] [review]
[gaia] raylin:uk-layout > mozilla-b2g:master
merged, master: https://github.com/mozilla-b2g/gaia/commit/a6ecae635719115aa72465efe522fdced3dd1d70
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.