Closed Bug 1401575 Opened 2 years ago Closed 4 months ago
Update Bulgarian hyphenation dictionary
72.48 KB, patch
|Details | Diff | Splinter Review|
47 bytes, text/x-phabricator-request
|Details | Review|
It was identified that the current dictionary is inadequate so I propose an update.
The corresponding license.
Are the files usable by themselves or should I prepare a patch? If the files are OK should I convert them to UTF-8? Should I rename the dictionary file to match the current one?
(In reply to Stoyan Dimitrov [:stoyan] from comment #2) > Are the files usable by themselves or should I prepare a patch? If the files > are OK should I convert them to UTF-8? Should I rename the dictionary file > to match the current one? Yes, converting to UTF-8 would be helpful as a first step. Attaching an actual patch (https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/How_to_Submit_a_Patch) that replaces the current file with the new one would make it easier to review and test the change. We'll want it to be named consistently with the existing practice. Is there any information regarding the expected changes in behavior compared to the old dictionary? You might also want to submit an update to the tex-hyphen project (see http://tug.org/tex-hyphen/), where an extensive collection of patterns is maintained. It's helpful for projects like Mozilla to be able to rely on a single "upstream" source for these resources.
Except for being more than 3 times longer than the original I can not give you more technical details about this dictionary. As a side note this is the hyphenation that is used in Open/Libre Office. I've sent request to tux.org for adding this dictionary to the list.
The information I gathered meanwhile states that the dictionary covers the official hyphenation algorithm as published from Institute for Bulgarian Language in the Official Spelling Dictionary which is the official normative reference book on spelling Bulgarian language.
This makes a pretty major change to the behavior of Bulgarian content with "hyphens:auto"; compare the "test" and "reference" renderings of the testcase at  for a very brief example. The new patterns are much less "liberal" in allowing break points, but I'm not able to judge whether this is an overall improvement or not. For example, the old patterns allow "раждат" and "равни" to be hyphenated as "раж-дат" and "рав-ни", while the new ones don't; I have no idea whether those are examples of good or bad hyphenations. (Though judging by , at least, it looks like the hyphenation of "раж-дат", as done by the old patterns, would be correct.) I'd like to see some more extensive feedback on the changed behavior -- e.g. an examination by Bulgarian readers of the hyphenations found in a reasonably large corpus of words -- before we decide on a change here. Ideally, we'll make this decision in coordination with the tex-hyphen project (which is our "upstream" for a lot of languages). (I will be keeping an eye on the discussion with Mojca and Arthur on the tex-hyphen mailing list; it would be great to see a consensus emerge there as to what should be supported.)  https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/X82QoeFySOub8ANjDs1bAA/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1  http://rechnik.chitanka.info/w/%D1%80%D0%B0%D0%B6%D0%B4%D0%B0%D0%BC
That's pretty awesome feedback. Probably not the best way to get it but still very nice to have.
Since the official hyphenation rules has been updated on tex-hyphen can we continue?
OK, I think we could try switching to the new Bulgarian patterns. Could you prepare a fresh patch based on the most up-to-date rules? Then we'll need to check what adjustments have to be made to the existing testcase, as some words will no doubt change their breaks.
Attachment #8911748 - Attachment is obsolete: false
Pushed by firstname.lastname@example.org: https://hg.mozilla.org/integration/autoland/rev/fbcd6b84b254 Update Bulgarian hyphenation dictionary. r=dholbert
Assignee: nobody → stoyan.moz
You need to log in before you can comment on or make changes to this bug.