Closed Bug 1226981 Opened 10 years ago Closed 10 years ago

Romanian keyboard is missing prediction for many words with '-'

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: mihaibn, Assigned: jobaval10n, Mentored)

Details

(Keywords: foxfood, Whiteboard: [bzlite])

Attachments

(6 files, 16 obsolete files)

logshake-about_memory-205.json.gz 10 years ago Mihai Barbat 134.51 KB, application/gzip		Details
cache-recovery-last_log.log 10 years ago Mihai Barbat 39.54 KB, application/octet-stream		Details
cache-recovery-last_install.log 10 years ago Mihai Barbat 43 bytes, application/octet-stream		Details
system-b2g-application.ini 10 years ago Mihai Barbat 602 bytes, application/octet-stream		Details
proc-vmstat.log 10 years ago Mihai Barbat 1.63 KB, application/octet-stream		Details
proc-vmallocinfo.log 10 years ago Mihai Barbat 14.22 KB, application/octet-stream		Details
proc-version.log 10 years ago Mihai Barbat 164 bytes, application/octet-stream		Details
proc-uptime.log 10 years ago Mihai Barbat 13 bytes, application/octet-stream		Details
proc-meminfo.log 10 years ago Mihai Barbat 1.01 KB, application/octet-stream		Details
proc-kmsg.log 10 years ago Mihai Barbat 64.42 KB, application/octet-stream		Details
proc-cmdline.log 10 years ago Mihai Barbat 312 bytes, application/octet-stream		Details
dev-log-radio.log 10 years ago Mihai Barbat 5.25 KB, application/octet-stream		Details
dev-log-system.log 10 years ago Mihai Barbat 6.07 KB, application/octet-stream		Details
dev-log-main.log 10 years ago Mihai Barbat 70.42 KB, application/octet-stream		Details
properties.log 10 years ago Mihai Barbat 9.50 KB, application/octet-stream		Details
screenshot.png 10 years ago Mihai Barbat 494.14 KB, image/png		Details
ro.dict -- the original binary dictionary for Romanian 10 years ago Jobava 961.76 KB, application/octet-stream		Details
ro_test.dict -- file generated by running xml2dict.js 10 years ago Jobava 961.76 KB, application/octet-stream		Details
Words in the Gaia dictionary not present in the Chrome dictionary 10 years ago Jobava 86.55 KB, text/plain		Details
ro_wordlist.xml 10 years ago Kevin Scannell 3.85 MB, application/xml		Details
ro.dict 10 years ago Mihai Barbat 998.50 KB, application/octet-stream		Details
https://github.com/mozilla-b2g/gaia/pull/33639 10 years ago Tim Guan-tin Chien [:timdream] (please needinfo) 46 bytes, text/x-github-pull-request	timdream : review+	Details \| Review

Mihai Barbat

Reporter

Description

•

10 years ago

User-Agent: Mozilla/5.0 (Mobile; rv:45.0) Gecko/45.0 Firefox/45.0 I installed the Romanian keyboard from Settings->Keyboard and while using it I noticed that the dictionary behind it is very poor and it's missing many words that contain the character '-'. For this reason using it is a very cumbersome process.

Comment hidden (obsolete)

Mihai Barbat

Reporter

Comment 2

•

10 years ago

Attached file cache-recovery-last_log.log (obsolete) — Details

Comment hidden (obsolete)

Mihai Barbat

Reporter

Updated

•

10 years ago

Component: Gaia::Feedback → Gaia::Keyboard

Mihai Barbat

Reporter

Comment 17

•

10 years ago

Sorry for the unnecessary attachments. I created the bug using Bugzilla Lite and I missed removing them.

Mihai Barbat

Reporter

Updated

•

10 years ago

Flags: needinfo?(cristian.silaghi)

Cristian Silaghi

Updated

•

10 years ago

Flags: needinfo?(cristian.silaghi)

Cristian Silaghi

Comment 18

•

10 years ago

I have no idea about this, nor what can we do. Maybe dictionary is outdated? Maybe we need another dictionary? I don't even use T9, to be honest. Best Regards, Cristian Silaghi

Mihai Barbat

Reporter

Comment 19

•

10 years ago

well I use it a lot when I write. The prediction is very primitive and able to guess the romanian words that contain '-' like (mi-au,si-au,ti-au, v-am...) this makes it very cumbersome to type. Maybe the issue is because these words are missing from the dictionary. Can anybody check this pls?

Ioana Chiorean

Comment 20

•

10 years ago

I can confirm this. Asking for NI to Flod, Delphine as they might let us know how we can proceed.

Status: UNCONFIRMED → NEW

Ever confirmed: true

Flags: needinfo?(lebedel.delphine)

Flags: needinfo?(francesco.lodolo)

Mihai Barbat

Reporter

Comment 21

•

10 years ago

small correction to what I said in Comment 19: The prediction is very primitive and NOT able to guess the romanian words that contain '-'...I missed the negation

Francesco Lodolo [:flod]

Comment 22

•

10 years ago

Dictionaries are hosted here https://github.com/mozilla-b2g/gaia/tree/master/apps/keyboard/js/imes/latin/dictionaries For the hyphen question, Tim is probably a better pick than me or Delphine. @reported You can hide attachments by going into the details (Details->Edit details) and marking them as obsolete (if they are).

Flags: needinfo?(timdream)

Flags: needinfo?(lebedel.delphine)

Flags: needinfo?(francesco.lodolo)

Mihai Barbat

Reporter

Updated

•

10 years ago

Attachment #8690548 - Attachment is obsolete: true

Mihai Barbat

Reporter

Updated

•

10 years ago

Attachment #8690549 - Attachment is obsolete: true

Mihai Barbat

Reporter

Updated

•

10 years ago

Attachment #8690550 - Attachment is obsolete: true

Mihai Barbat

Reporter

Updated

•

10 years ago

Attachment #8690551 - Attachment is obsolete: true

Mihai Barbat

Reporter

Updated

•

10 years ago

Attachment #8690552 - Attachment is obsolete: true

Mihai Barbat

Reporter

Updated

•

10 years ago

Attachment #8690553 - Attachment is obsolete: true

Mihai Barbat

Reporter

Updated

•

10 years ago

Attachment #8690554 - Attachment is obsolete: true

Mihai Barbat

Reporter

Updated

•

10 years ago

Attachment #8690555 - Attachment is obsolete: true

Mihai Barbat

Reporter

Updated

•

10 years ago

Attachment #8690556 - Attachment is obsolete: true

Mihai Barbat

Reporter

Updated

•

10 years ago

Attachment #8690557 - Attachment is obsolete: true

Mihai Barbat

Reporter

Updated

•

10 years ago

Attachment #8690558 - Attachment is obsolete: true

Mihai Barbat

Reporter

Updated

•

10 years ago

Attachment #8690559 - Attachment is obsolete: true

Mihai Barbat

Reporter

Updated

•

10 years ago

Attachment #8690560 - Attachment is obsolete: true

Mihai Barbat

Reporter

Updated

•

10 years ago

Attachment #8690561 - Attachment is obsolete: true

Mihai Barbat

Reporter

Updated

•

10 years ago

Attachment #8690562 - Attachment is obsolete: true

Mihai Barbat

Reporter

Updated

•

10 years ago

Attachment #8690563 - Attachment is obsolete: true

Cristian Silaghi

Comment 23

•

10 years ago

@flod, do you know if can we use other dictionary for Romanian locale? E.g. Google/Android dictionary? Because their dictionary is superior than ours. Or do they use some non-permissive license? Best Regards, Cristian Silaghi

Mihai Barbat

Reporter

Comment 24

•

10 years ago

FYI, I looked a bit at how Android is doing it and I found out this https://github.com/omnirom/android_packages_inputmethods_LatinIME/blob/android-5.1/dictionaries/ro_wordlist.combined.gz There are 1.125.287 romanian words there compared to only 115832 in the Mozilla dictionary!

Francesco Lodolo [:flod]

Comment 25

•

10 years ago

Note that it's not a standard dictionary https://github.com/mozilla-b2g/gaia/blob/master/apps/keyboard/js/imes/latin/dictionaries/README.md I don't know about Licenses for dictionaries, Gaia is release with Apache License https://github.com/mozilla-b2g/gaia/blob/master/LICENSE

Jobava

Assignee

Comment 26

•

10 years ago

Anyone happen to know what format the .dict files are in? Listed here: https://github.com/mozilla-b2g/gaia/tree/master/apps/keyboard/js/imes/latin/dictionaries We could convert the wordlist in Android if the license makes that possible: https://github.com/omnirom/android_packages_inputmethods_LatinIME/blob/android-5.1/dictionaries/ro_wordlist.combined.gz What is the relationship between the .xml and .dict files? Basically for Romanian, a few words contain a dash, like "într-o/într-un" (en: "in a [something]"), but most uses of the dash is to indicate contractions with prepositions or pronouns: "să îți -> să-ți", so typing prediction from a list of single words will have this problem.

Axel Hecht [:Pike]

Comment 27

•

10 years ago

Clearing needinfo for Tim for now. I found that the hungarian wordlist does contain words with hyphens, like "szeretet-teli". Could you verify that those work? (Don't have a current build). The wordlist itself seems to date back to a mass-add by dflanagan from bug 908286, hard to tell what went into the actual word lists back then. Also CCing Kevin, in case he has a good idea.

Flags: needinfo?(timdream)

Comment hidden (obsolete)

Mihai Barbat

Reporter

Comment 29

•

10 years ago

(In reply to Axel Hecht [:Pike] from comment #27) > Clearing needinfo for Tim for now. > > I found that the hungarian wordlist does contain words with hyphens, like > "szeretet-teli". > > Could you verify that those work? (Don't have a current build). I just installed the hungarian keyboard and typed: szeretetteli and I get the prediction. See the screenshot here: http://i.imgur.com/SGxkUvh.png

Kevin Scannell

Comment 30

•

10 years ago

I created the autocorrect dictionary, and I think I see the problem. The frequencies came from a big web-crawled corpus of Romanian. But since there are many misspellings in web texts, I usually only keep the words validated by some open source spell checker. I used the Mozilla one for Romanian, here: https://addons.mozilla.org/en-US/firefox/addon/romanian-spellchecking-diction/ But the affix file in this addon doesn't declare the hyphen as a word character at all, presumably choosing instead to just break on hyphens and spell check the pieces.

Mihai Barbat

Reporter

Comment 31

•

10 years ago

I see. So the result is a file which is 10x smaller than the one from https://github.com/omnirom/android_packages_inputmethods_LatinIME/blob/android-5.1/dictionaries/ro_wordlist.combined.gz Can we re-use this list, or any other android romanian dictionary which is big and update the Mozilla romanian dictionary?

Jobava

Assignee

Comment 32

•

10 years ago

So after better reading comprehension of this: https://github.com/mozilla-b2g/gaia/blob/master/apps/keyboard/js/imes/latin/dictionaries/README.md It looks like the the xx.dict file is generated from the xx_wordlist.xml file using xml2dict.js. However, my run of xml2dict.js produces a different output, although the same file size. 193kB of the 961KB are different: cmp -l ro.dict ro_test.dict | gawk '{printf "%08X %02X %02X\n", $1, strtonum(0$2), strtonum(0$3)}' | wc -l Attached are the two different files for comparison. ro.dict is the original file from the repo ro_test.dict is the file generated by running xml2dict.js against ro_wordlist.xml with the command: node --harmony xml2dict.js -o lang.dict lang_wordlist.xml My question: is ro.dict generated automatically in the build process of gaia, or does it have to be added manually after each change to the corresponding xml file? If it's automatic, then we don't have to deal with that at all and just propose changes to ro_wordlist.xml Jobava

Jobava

Assignee

Comment 33

•

10 years ago

Attached file ro.dict -- the original binary dictionary for Romanian — Details

Jobava

Assignee

Comment 34

•

10 years ago

Attached file ro_test.dict -- file generated by running xml2dict.js — Details

Mihai Barbat

Reporter

Comment 35

•

10 years ago

so what's the plan now that we have all this information? Can anyone with more knowledge reply on this?

Tim Guan-tin Chien [:timdream] (please needinfo)

Comment 36

•

10 years ago

(In reply to Mihai Barbat from comment #35) > so what's the plan now that we have all this information? Can anyone with > more knowledge reply on this? What's the question to answer here? I am seeing the question to me was answered in comment 27. (In reply to Jobava from comment #32) > My question: is ro.dict generated automatically in the build process of > gaia, or does it have to be added manually after each change to the > corresponding xml file? If it's automatic, then we don't have to deal with > that at all and just propose changes to ro_wordlist.xml Assuming this is the question to answer: no, dict files are not generated in the build process (although it should). They are checked-in. So if you propose a change please also rebuild the dict file. Also, per finding documented in [1] the sorting of dict build process is not stable, so it is possible to generate a different dict even if the list is unchanged. [1] https://wiki.mozilla.org/Gaia/System/Keyboard/IME/Latin/Prediction_%26_Auto_Correction

Mihai Barbat

Reporter

Comment 37

•

10 years ago

ok, so do we have a green light to create a new dictionary then, starting maybe from the android one?

Mihai Barbat

Reporter

Comment 38

•

10 years ago

ping!

Delphine Lebédel

Comment 39

•

10 years ago

Hey Mihai - I think :timdream already answered in comment 36, see: "Assuming this is the question to answer: no, dict files are not generated in the build process (although it should). They are checked-in. So if you propose a change please also rebuild the dict file. Also, per finding documented in [1] the sorting of dict build process is not stable, so it is possible to generate a different dict even if the list is unchanged." Just go ahead and make the suggested changes.

Jobava

Assignee

Comment 40

•

10 years ago

First, I don't understand the purpose for the field "flags", it's not present in either the Chrome or the Gaia dictionary. I will just ignore it for now. Discoveries so far: In the Chrome dictionary, there are only a few words where "freq" and "originalFreq" are different: "aia" freq: 30, origFreq: 72 "alea" freq: 30, origFreq: 72 "asta" freq: 30, origFreq: 108 "astea" freq: 30, origFreq: 72 "de-a" freq: 30, origFreq: 72 "s-o" freq: 30, origFreq: 106 "v-a" freq: 30, origFreq: 72 "v-ar" freq: 30, origFreq: 72 "ăia" freq: 30, origFreq: 72 "ăla" freq: 30, origFreq: 72 "ăsta" freq: 30, origFreq: 72 "ăștia" freq: 30, origFreq: 72 In the Chrome dictionary: the minimum freq is 30 and maximum is 210. of the words: 77.44% are at frequency 72 12.57% are at frequency 80 5.1% are at frequency 85 1.4% are at frequency 89 the rest are below 1% In the Gaia dictionary: minFreq is 2, maxFreq is 255 (frequency, proportion of words) 2 30,16% 14 13,50% 22 7,97% 28 5,52% 32 4,11% 36 3,34% 39 2,70% 42 2,17% 44 1,87% 46 1,60% 48 1,44% 50 1,26% 51 1,05% 53 1,05% 54 0,94% So we note that Gaia has a different distribution of frequency. Should I normalize the chrome dictionary's freqs to have the same distribution as Gaia? Should I just keep Chrome's freqs?

Jobava

Assignee

Comment 41

•

10 years ago

Another finding: there are 6177 words in the Gaia dictionary not present in Chrome. Some of those words lack diacritics or typos. These words tend to be names of persons, politicians, place names etc. It'll take a while to go through the list and correct the typos, the Gaia list looks lower quality and could even be scrapped (IMHO). Attaching that file to the bug report so you can look at it as well.

Jobava

Assignee

Comment 42

•

10 years ago

Attached file Words in the Gaia dictionary not present in the Chrome dictionary — Details

Jobava

Assignee

Comment 43

•

10 years ago

So apparently it takes a very long time (hours) for the .dicts to be generated and now I see why that process may not be automatic in the build of gaia. Anyway, I have for now a patch containing just a dumb transfer of the Chrome dictionary, discarding the Gaia one: https://github.com/Jobava/gaia/commit/124420602a63235aec4dca10e9a0e0d71f95cadb The other choice for now is adding just the dash-containing words from Chrome to Gaia, to get a smaller word list. If the current Gaia wordlist is to be preserved, it needs some corrections and manual curation (for example, names of politicians don't really belong there). @Kevin do you know how other languages do it?

Flags: needinfo?(kscanne)

Kevin Scannell

Comment 44

•

10 years ago

I haven't worked on the .dict files - I've just created the XML word frequency lists. Whatever you end up deciding in terms of the autocorrect list, it would be worth submitting your notes + corrections to the upstream maintainer of the spellchecker I referenced in Comment 30.

Flags: needinfo?(kscanne)

David Flanagan [:djf]

Comment 45

•

10 years ago

This is an interesting bug! To start I'll note that things would have been easier if anyone had asked for input from keyboard peers early on in the process. You wouldn't have had to figure everything out on your own. Jobava: nice work on figuring out the wordlist format and how to create dictionary files. If you've got more questions about this, I can probably answer them. Note that the wordlist xml format we use was copied from Google's android code. But then they changed to their new .combined format, which is why you had to convert the wordlist format. In order to get a formal review of your patch, you'll need to create an attachment that links to github, and then request review by setting r?. Rudy might be the best reviewer for this, though I'm not certain about that. So here are some specific comments: 1) The frequency distribution for this wordlist is odd. As the author of the algorithm, I'd expect to get best results when the distribution is spread out to use the full 8-bit range. Also, the .dict format compresses the frequencies down into a 5-bit range, so the words with frequency 72 and the words with frequency 80 might get the same frequency in the .dict file which would make them indistinguishable. I don't know if that is happening, but you might check on it. In general, it would be nice to be able to use the google wordlists unmodified, but in this case, it might be safer to adjust some of those frequency levels to spread them out a bit more. 2) The wordlist and .dict file are both about twice as big as the next-largest, for Hungarian. And we could never have shipped files this large in the earlier versions of FirefoxOS designed for relatively low-end devices. Just listing all possible hyphenated combinations of words seems like a brute-force approach to autocorrect for Romanian. I guess google is doing it, so it must work okay. And we do have more powerful devices now, so maybe we can get away with this. But as part of the review process, I think we need some data about how well it works. Bringing up a Romanian keyboard will load the 8mb .dict file. Does this introduce a noticeable lag? Do word suggestions show up quickly enough as you type? Also, with a dictionary this large, it could be that some of the search parameters (how deeply to search, how many candidates to retain, etc.) might need to be tweaked to get best results. 3) How obscure are the lowest-frequency words in the google wordlist? If you cut all of the words at f=72 and below, you'd have a wordlist that was 1/4 the size and still contained 250K words, which is twice as many as we have now. How good is the autocorrect performance of that reduced wordlist? I'm guessing it is still better than what we have now, and should support hyphens. If you think that is good enough, it gets us down to the size level of the german dictionary, I think. 4) I've only just now realized that the size of the dictionary is not due to hypenated words. This huge 1M word wordlist only has 7K hyphenated words in it. I think that the issue is that Romanian has lots of possible suffixes for words and that our autocorrect technology doesn't have any alternative except to just list them all. (I think I remember having a discussion with Kevin about the problem with Romanian and the need for different algorithms for autocorrect and spellcheck in languages like it.) Anyway, I see three ways to proceed here: 1) Just use the giant dictionary from Google. We'd want to check that the performance is okay first. And you'd want to consider adjusting the word frequencies to spread them out more. 2) Use a truncated version of the Google wordlist to get more words without having to worry so much about performance. (Still probably worth rescaling the frequencies, though) If you do this, you'll want to update dictionaries/README.md to describe the origin of the wordlist and how it was modified. I think that this might be the best approach, though it would depend on whether actual Romanian speakers think that it is good enough. 3) Just add the hyphenated words from this new wordlist to the existing wordlist as mentioned in comment #43. It would probably be tricky to add the new words at the appropriate place in the old list, but you could probably figure something out. I think that option 2 is probably better than this one, though.

Flags: needinfo?(jobaval10n)

David Flanagan [:djf]

Comment 46

•

10 years ago

I said above that Rudy might be the best reviewer, but I forgot that he is not at Mozilla anymore. So when you're ready, the best thing is probably to ask :timdream for a review. He can transfer the review request to the right person on his team or to me.

Jobava

Assignee

Comment 47

•

10 years ago

Thank you for the review, David. I will look into rescaling/normalization, also at pruning the list a little. For Romanian, there are about 1500 inflection rules and it may be possible to auto-generate inflections based on just "base forms" (like the people at Dexonline: https://dexonline.ro/modele-flexiune?modelType=V&locVersion=5.0 -- just the verb forms, for example) , but at added complexity and it's language-specific. Will get back to you.

Flags: needinfo?(jobaval10n)

Kevin Scannell

Comment 48

•

10 years ago

If you produce the clean wordlist you want to use, I can then add the frequencies from a biggish (20 million word) web corpus, scaled appropriately. This is essentially David's approach 2) in comment 45, truncating according to actual usage on the web. It should produce something of reasonable size.

Jobava

Assignee

Comment 49

•

10 years ago

Kevin, your word list does not contain hyphenated words, does it? Can you make that public? We may need to have frequencies from one list compatible with the other, I guess (if it's indeed an issue).

Flags: needinfo?(kscanne)

Jobava

Assignee

Comment 50

•

10 years ago

David, can you give us a few details on the actual binary format of the .dict file, to get an idea of how the frequencies actually look like when used by Gaia?

Flags: needinfo?(dflanagan)

Jobava

Assignee

Comment 51

•

10 years ago

Ah sorry, David, just found the resource I need. Will get back to you after more progress. https://wiki.mozilla.org/Gaia/System/Keyboard/IME/Latin/Dictionary_Blob

Flags: needinfo?(dflanagan)

Kevin Scannell

Comment 52

•

10 years ago

I can tokenize the texts and include hyphens, so yes it will have the hyphens. The previous iteration did not only because the Mozilla spellchecker did not. Top 50000 words from the web corpus are downloadable here: http://crubadan.org/languages/ro but I'd do a fresh crawl to add more texts probably.

Flags: needinfo?(kscanne)

Jobava

Assignee

Comment 53

•

10 years ago

Kevin, I am adding a list of words which is a combination of the Chrome wordlist and that from dexonline.ro. The Dexonline list is comprehensive and reasonably complete with inflection forms of Romanian and the Chrome one includes a few extra and also the hyphenated words. Can you add frequencies to this list? We'll then truncate the list at some acceptable frequency limit so that it can work in Gaia.

Flags: needinfo?(kscanne)

Jobava

Assignee

Comment 54

•

10 years ago

I hosted the file here, since it can't be added as an attachment: https://raw.githubusercontent.com/Jobava/mozilla-localization/master/chrome_plus_dexonline_sorted_uniq.txt (right-click, save as, about 13MB, one word per line)

Kevin Scannell

Comment 55

•

10 years ago

Attached file ro_wordlist.xml — Details

Flags: needinfo?(kscanne)

Tim Guan-tin Chien [:timdream] (please needinfo)

Comment 56

•

10 years ago

Thanks :djf for helping out! I am going to put myself on the mentor field so I don't forget to reply here, but I might not be as helpful as :djf on the prediction part of the Keyboard app. David, maybe you should consider adding yourself there too assuming your mail filter work the same way as mine? (In reply to David Flanagan [:djf] from comment #45) > 2) The wordlist and .dict file are both about twice as big as the > next-largest, for Hungarian. And we could never have shipped files this > large in the earlier versions of FirefoxOS designed for relatively low-end > devices. Just listing all possible hyphenated combinations of words seems > like a brute-force approach to autocorrect for Romanian. I guess google is > doing it, so it must work okay. And we do have more powerful devices now, so > maybe we can get away with this. But as part of the review process, I think > we need some data about how well it works. Bringing up a Romanian keyboard > will load the 8mb .dict file. Does this introduce a noticeable lag? Do word > suggestions show up quickly enough as you type? Also, with a dictionary > this large, it could be that some of the search parameters (how deeply to > search, how many candidates to retain, etc.) might need to be tweaked to get > best results. The only one small correction I have here: bug 1128396 have since implemented and enable users to download a dictionary from Amazon S3, instead of always requiring a dictionary to be included in the build. Size is still a concern though but it's not a major concern now. Once you update the .dict file in the Gaia tree, please file another bug to update the Amazon S3 as well. Thanks!

Assignee: nobody → jobaval10n

Mentor: timdream

Status: NEW → ASSIGNED

Jobava

Assignee

Comment 57

•

10 years ago

Kevin, the list looks good. You just arbitrarily chopped off words at frequency 1? (least frequent, most numerous) @Mihai I am unable to build the .dict since that requires cloning or downloading the repo here https://github.com/mozilla-b2g/gaia/ and my internet connection to github is too slow to be workable in console mode basically, you download the zip, unpack it cd gaia/apps/keyboard/js/imes/latin/dictionaries/ npm install rm -f ro.dict rm -f ro_wordlist.xml wget https://bugzilla.mozilla.org/attachment.cgi?id=8701611 -O ro_wordlist.xml node --harmony xml2dict.js -o ro.dict ro_wordlist.xml That should generate a new .dict from the attachment Kevin made

Flags: needinfo?(kscanne)

Jobava

Assignee

Updated

•

10 years ago

Flags: needinfo?(mihaibn)

Kevin Scannell

Comment 58

•

10 years ago

(In reply to Jobava from comment #57) > Kevin, the list looks good. You just arbitrarily chopped off words at > frequency 1? (least frequent, most numerous) > That's right - I only kept words that appeared at least once in a ~15 million word corpus crawled from the web.

Flags: needinfo?(kscanne)

Mihai Barbat

Reporter

Comment 59

•

10 years ago

Attached file ro.dict — Details

Here is the generated file

Flags: needinfo?(mihaibn)

Bogdancev

Comment 60

•

10 years ago

Jobava, could you please explain how to run that script? Which environment to use?

Jobava

Assignee

Updated

•

10 years ago

Flags: needinfo?(jobaval10n)

Jobava

Assignee

Comment 61

•

10 years ago

OK, I updated the old pull request with the updated files. https://github.com/mozilla-b2g/gaia/pull/33639 David, is this good for merging?

Flags: needinfo?(jobaval10n) → needinfo?(dflanagan)

Tim Guan-tin Chien [:timdream] (please needinfo)

Updated

•

10 years ago

Flags: needinfo?(timdream)

Tim Guan-tin Chien [:timdream] (please needinfo)

Comment 62

•

10 years ago

Attached file https://github.com/mozilla-b2g/gaia/pull/33639 — Details

I am just going to LGTM (look good to me) this patch because there is no process exist to verify dictionary change outside the actual user from the community except making sure it's legal (it is since it comes from AOSP). Thanks for contributing. The only think prevent this pull request from being merged is the lack of bug number on the commit message -- please amend the commit message so we could get this merged.

Flags: needinfo?(timdream)

Flags: needinfo?(dflanagan)

Attachment #8719369 - Flags: review+

Jobava

Assignee

Comment 63

•

10 years ago

Thanks, Tim, I amended the commit message and updated the PR.

Flags: needinfo?(timdream)

Tim Guan-tin Chien [:timdream] (please needinfo)

Comment 64

•

10 years ago

master: https://github.com/mozilla-b2g/gaia/commit/681b7736460116d120f0070c8b406fc6cbcb512e

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Flags: needinfo?(timdream)

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.