Closed Bug 1401575 Opened 2 years ago Closed 4 months ago

Update Bulgarian hyphenation dictionary

Categories

(Core :: Layout: Text and Fonts, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla71
Tracking Status
firefox71 --- fixed

People

(Reporter: stoyan, Assigned: stoyan)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 3 obsolete files)

Attached file hyph_bg_BG.dic (obsolete) —
It was identified that the current dictionary is inadequate so I propose an update.
Attached file The corresponding license (obsolete) —
The corresponding license.
Are the files usable by themselves or should I prepare a patch? If the files are OK should I convert them to UTF-8? Should I rename the dictionary file to match the current one?
Blocks: 656750
(In reply to Stoyan Dimitrov [:stoyan] from comment #2)
> Are the files usable by themselves or should I prepare a patch? If the files
> are OK should I convert them to UTF-8? Should I rename the dictionary file
> to match the current one?

Yes, converting to UTF-8 would be helpful as a first step.

Attaching an actual patch (https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/How_to_Submit_a_Patch) that replaces the current file with the new one would make it easier to review and test the change. We'll want it to be named consistently with the existing practice.

Is there any information regarding the expected changes in behavior compared to the old dictionary?

You might also want to submit an update to the tex-hyphen project (see http://tug.org/tex-hyphen/), where an extensive collection of patterns is maintained. It's helpful for projects like Mozilla to be able to rely on a single "upstream" source for these resources.
Priority: -- → P3
Attached patch Prepared the patch. (obsolete) — Splinter Review
Except for being more than 3 times longer than the original I can not give you more technical details about this dictionary. As a side note this is the hyphenation that is used in Open/Libre Office.

I've sent request to tux.org for adding this dictionary to the list.
Attachment #8910308 - Attachment is obsolete: true
Attachment #8910309 - Attachment is obsolete: true
Attachment #8911748 - Flags: review?(jfkthame)
The information I gathered meanwhile states that the dictionary covers the official hyphenation algorithm as published from Institute for Bulgarian Language in the Official Spelling Dictionary which is the official normative reference book on spelling Bulgarian language.
This makes a pretty major change to the behavior of Bulgarian content with "hyphens:auto"; compare the "test" and "reference" renderings of the testcase at [1] for a very brief example.

The new patterns are much less "liberal" in allowing break points, but I'm not able to judge whether this is an overall improvement or not. For example, the old patterns allow "раждат" and "равни" to be hyphenated as "раж-дат" and "рав-ни", while the new ones don't; I have no idea whether those are examples of good or bad hyphenations. (Though judging by [2], at least, it looks like the hyphenation of "раж-дат", as done by the old patterns, would be correct.)

I'd like to see some more extensive feedback on the changed behavior -- e.g. an examination by Bulgarian readers of the hyphenations found in a reasonably large corpus of words -- before we decide on a change here. Ideally, we'll make this decision in coordination with the tex-hyphen project (which is our "upstream" for a lot of languages).

(I will be keeping an eye on the discussion with Mojca and Arthur on the tex-hyphen mailing list; it would be great to see a consensus emerge there as to what should be supported.)


[1] https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/X82QoeFySOub8ANjDs1bAA/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1
[2] http://rechnik.chitanka.info/w/%D1%80%D0%B0%D0%B6%D0%B4%D0%B0%D0%BC
That's pretty awesome feedback. Probably not the best way to get it but still very nice to have.
Since the official hyphenation rules has been updated on tex-hyphen can we continue?
Flags: needinfo?(jfkthame)
OK, I think we could try switching to the new Bulgarian patterns. Could you prepare a fresh patch based on the most up-to-date rules? Then we'll need to check what adjustments have to be made to the existing testcase, as some words will no doubt change their breaks.
Flags: needinfo?(jfkthame)

Updated to the latest patterns

Attachment #8911748 - Attachment is obsolete: true
Attachment #8911748 - Flags: review?(jfkthame)
Attachment #8911748 - Attachment is obsolete: false
Attachment #8911748 - Attachment is obsolete: true

I guess should request review but I can't seem to find a way to do so.

Flags: needinfo?(jfkthame)

Huh, I'm not sure if bugzilla allows you to set a r? request on an attached patch any more; the new workflow is all through phabricator.

Don't worry about it, though; I'll consider the needinfo here as a review request and take it from there. :) Leaving the ni? flag in place for now to remind myself.

I've pushed a try run at https://treeherder.mozilla.org/#/jobs?repo=try&revision=92be92027992e375d142dfa480a500635da22ee7 to see how things look. I expect there'll be a failure on the relevant reftest, so we'll need to update the expected result there.

Nice - the reftest doesn't change after all. (It's only a very minimal testcase, admittedly.) So I think we can go ahead with this. I'll move the patch over to phabricator so we can push it through autoland. Thanks, Stoyan!

Flags: needinfo?(jfkthame)
Pushed by jkew@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/fbcd6b84b254
Update Bulgarian hyphenation dictionary. r=dholbert
Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla71
Assignee: nobody → stoyan.moz
You need to log in before you can comment on or make changes to this bug.