Closed Bug 1105644 Opened 11 years ago Closed 10 years ago

Incorrect German hyphenation w/ CSS -moz-hyphens (is there a more up-to-date hyphenation dictionary we could use?)

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla38
Tracking Status
firefox38 --- fixed

People

(Reporter: bugzilla, Assigned: jfkthame)

Details

Attachments

(4 files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:33.0) Gecko/20100101 Firefox/33.0 Build ID: 20141113143407 Steps to reproduce: The German word "Grundausstattung" is wrongly hyphenated "Gr-undausstattung" when using -moz-hyphens:auto. The part "Grund" is not to be hyphenated at all. I tested in FF33 (regular) and FF35.0a2 (dev edition). Actual results: Minimal example to reproduce: data:text/html,<p style="width:40px;-moz-hyphens:auto;width:40px" lang="de">Grundausstattung</p> Expected results: The correct hyphenation is "Grund-aus-stat-tung". That should be shown.
Archaeopteryx, I don't suppose you know if we can update the hyphenation dictionary in use? (see also e.g. bug 966818)
Component: Untriaged → Internationalization
Flags: needinfo?(archaeopteryx)
Product: Firefox → Core
Summary: Incorrect German hyphenation w/ CSS -moz-hyphens → Incorrect German hyphenation w/ CSS -moz-hyphens (is there a more up-to-date hyphenation dictionary we could use?)
KaiRo maintains the most used German dictionaries, he is likely to hold more information on that.
Flags: needinfo?(archaeopteryx) → needinfo?(kairo)
I only maintain the AMO packaging for spelling dictionaries, the hyphenation stuff has independent of that and served in product.
Flags: needinfo?(kairo)
So, any news here? I would be happy, if the one in charge would speak up with an opinion. Unfortunately, one of our customers, a linguist as a matter of fact, spotted so many hyphenation errors, that we needed to disable -moz-hyphens on his website. Of course, this is an edge case, but I mention it to demonstrate, that the bug happens in more than just a niche of German.
(In reply to Manuel Strehl from comment #4) > So, any news here? I would be happy, if the one in charge would speak up > with an opinion. There's not really a "one in charge" at Mozilla. In any case, my knowledge of German is sufficient to agree that this is a bug (and I imagine the other people commenting so far, being native speakers, would agree too); the question is just how to fix it... software like Firefox uses hyphenation dictionaries to know where to hyphenate text in different languages, and so we basically need a better one than what is in use now. I don't know if such a better dictionary exists, nor where to find one, nor how many of the issues the linguist identified would be fixed by such a new dictionary. Archaeopteryx, or Jonathan, ideas?
Flags: needinfo?(jfkthame)
Flags: needinfo?(archaeopteryx)
This looks to me like it's probably a Gecko bug (rather than a shortcoming of the German hyphenation patterns we're using). Note that the bad hyphenation does *not* occur if you lowercase the word: data:text/html;charset=utf-8,<div lang="de" style="-moz-hyphens:auto;width:1em">grundausstattung gives the expected "grund-aus-stat-tung", whereas the capitalized version gives "Gr-und-aus-stat-tung". AFAIK, the result here should not have been dependent on the capitalization. So I suspect this is a code bug, perhaps in the dom/layout line-breaking code or in the integration of the hyphenation library.
Flags: needinfo?(jfkthame)
Hm, as far as I know, there are no major problems with the TeX ngerman patterns when used in LaTeX. (Correct me, if I'm wrong, but those are used, aren't they?) I also use them in an extension to Apache FOP for hyphenation in XSL-FO. Can it be, that the conversion step or the application in the code introduce inaccuracies? I attach a minimal XSL-FO testcase complete with rendered PDF. It shows, that Apache FOP + TeX patterns does not show this wrong hyphenation behaviour. (Reproducing: Unzip, then `fop -fo test.fo -pdf test.pdf`. Make sure, that the "offo" extension for Fop is installed.)
Comment on attachment 8564929 [details] XSL-FO + rendered PDF: No bug in Fop using TeX patterns (The test case had wrongly those attributes set: hyphenation-push-character-count="4" hyphenation-remain-character-count="4". Removing them doesn't change the result, though.)
The issue here is that we need to explicitly lowercase the characters in each word before passing to libhyphen for pattern matching; it does not do that internally. We're already having to do a UTF16-to-UTF8 copy, so we can apply lowercasing at the same time.
Attachment #8564959 - Flags: review?(smontagu)
Assignee: nobody → jfkthame
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
This affects a couple of our existing reftests, but I believe the changed behavior is an improvement in these cases, so just fixing the reference files to match. These cases just aren't as glaring as the German example, so we didn't pay sufficient attention to them.
Attachment #8564960 - Flags: review?(smontagu)
OS: Linux → All
Hardware: x86_64 → All
Version: 35 Branch → Trunk
Attachment #8564960 - Flags: review?(smontagu) → review+
Comment on attachment 8564959 [details] [diff] [review] Lowercase words before passing them to libhyphen, so as to match patterns fully Review of attachment 8564959 [details] [diff] [review]: ----------------------------------------------------------------- ::: intl/hyphenation/nsHyphenator.cpp @@ +99,5 @@ > + } > + > + // XXX What about language-specific casing? Consider Turkish I/i... > + // In practice, it looks like the current patterns will not be > + // affected by this, as they treat dotted and undotted i similarly. I was going to say "what about German SS/ß? That can impact hyphenation" -- but on second thoughts if we're lowercasing and not uppercasing, ẞ will just become ß and SS will become ss, so there shouldn't be a problem, right?
Attachment #8564959 - Flags: review?(smontagu) → review+
(In reply to Simon Montagu :smontagu from comment #11) > I was going to say "what about German SS/ß? That can impact hyphenation" -- > but on second thoughts if we're lowercasing and not uppercasing, ẞ will just > become ß and SS will become ss, so there shouldn't be a problem, right? If ß got uppercased to SS before, that would be an issue for hyphenation, e.g. Maße > MASSE > Masse Masse has a different meaning and a different hyphenation: Ma|ße vs. Mas|se
(In reply to Archaeopteryx [:aryx] from comment #12) > (In reply to Simon Montagu :smontagu from comment #11) > > I was going to say "what about German SS/ß? That can impact hyphenation" -- > > but on second thoughts if we're lowercasing and not uppercasing, ẞ will just > > become ß and SS will become ss, so there shouldn't be a problem, right? Right. > If ß got uppercased to SS before, that would be an issue for hyphenation, > e.g. > > Maße > MASSE > Masse > Masse has a different meaning and a different hyphenation: > Ma|ße vs. Mas|se If the text contains "SS" for an uppercase ß, that's the spelling we'll hyphenate; if the author writes "MASSE" and wants it hyphenated as uppercased "maße", i.e. "MA-SSE", they'd have to provide a manual soft hyphen, I think. The browser can't tell which of those words "MASSE" was meant to be. OTOH, if you're referring to a case where the author writes "maße", and then it is uppercased via text-transform:uppercase ... well, we don't seem to apply auto-hyphenation at all in that case, so the issue doesn't arise. (Though I'm not sure why that is -- perhaps another bug?)
My bug report on Launchpad could have the same reason → https://bugs.launchpad.net/ubuntu/+source/firefox/+bug/1209176 There I compared the hyphenation between Firefox and LibreOffice. Firefox still doesn’t correctly hyphenate the test words in my example html-page (which also contains lowercase words, see attachment).
Comment on attachment 8565930 [details] Hyphenation test page (German words) ><html lang="de"> ><head><meta charset="utf-8"> ><title>Hypenation Test</title> ><style type="text/css"> >p {-moz-hyphens:auto; hyphens:auto; >width:3em; border-right: 1px solid red; >font: 1em/1.32 FreeSerif,'Times New Roman',Times,serif;} ></style></head> ><body> ><p>mmmiii Türklinke Übungen wörtlich künftige öffentlich Überschriften überempfindlich</p> ></body> ></html>
sorry had to back this out for test failures like https://treeherder.mozilla.org/logviewer.html#?job_id=6720242&repo=mozilla-inbound on asan builds
Flags: needinfo?(jfkthame)
Yeah, that's a bug -- I didn't notice that the |begin| variable gets reused later in the method. I've fixed the patch locally, and pushed an asan try run to double-check before relanding.
Flags: needinfo?(jfkthame)
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Verified fixed with Nightly 38.0a1 20150219030204 on Windows 8.1
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: