Closed Bug 1105644 Opened 11 years ago Closed 10 years ago

Incorrect German hyphenation w/ CSS -moz-hyphens (is there a more up-to-date hyphenation dictionary we could use?)

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla38

Tracking Flags:

Tracking

Status

firefox38

---

fixed

People

(Reporter: bugzilla, Assigned: jfkthame)

Details

Attachments

(4 files)

XSL-FO + rendered PDF: No bug in Fop using TeX patterns 10 years ago Manuel Strehl 4.49 KB, application/x-zip		Details
Lowercase words before passing them to libhyphen, so as to match patterns fully 10 years ago Jonathan Kew [:jfkthame] 2.82 KB, patch	smontagu : review+	Details \| Diff \| Splinter Review
Update reftests for improved handling of capitalized words 10 years ago Jonathan Kew [:jfkthame] 2.53 KB, patch	smontagu : review+	Details \| Diff \| Splinter Review
Hyphenation test page (German words) 10 years ago Gerhard Großmann 340 bytes, text/html		Details

Manuel Strehl

Reporter

Description

•

11 years ago

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:33.0) Gecko/20100101 Firefox/33.0 Build ID: 20141113143407 Steps to reproduce: The German word "Grundausstattung" is wrongly hyphenated "Gr-undausstattung" when using -moz-hyphens:auto. The part "Grund" is not to be hyphenated at all. I tested in FF33 (regular) and FF35.0a2 (dev edition). Actual results: Minimal example to reproduce: data:text/html,<p style="width:40px;-moz-hyphens:auto;width:40px" lang="de">Grundausstattung</p> Expected results: The correct hyphenation is "Grund-aus-stat-tung". That should be shown.

:Gijs (he/him)

Comment 1

•

11 years ago

Archaeopteryx, I don't suppose you know if we can update the hyphenation dictionary in use? (see also e.g. bug 966818)

Component: Untriaged → Internationalization

Flags: needinfo?(archaeopteryx)

Product: Firefox → Core

Summary: Incorrect German hyphenation w/ CSS -moz-hyphens → Incorrect German hyphenation w/ CSS -moz-hyphens (is there a more up-to-date hyphenation dictionary we could use?)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 2

•

11 years ago

KaiRo maintains the most used German dictionaries, he is likely to hold more information on that.

Flags: needinfo?(archaeopteryx) → needinfo?(kairo)

Robert Kaiser

Comment 3

•

11 years ago

I only maintain the AMO packaging for spelling dictionaries, the hyphenation stuff has independent of that and served in product.

Flags: needinfo?(kairo)

Manuel Strehl

Reporter

Comment 4

•

10 years ago

So, any news here? I would be happy, if the one in charge would speak up with an opinion. Unfortunately, one of our customers, a linguist as a matter of fact, spotted so many hyphenation errors, that we needed to disable -moz-hyphens on his website. Of course, this is an edge case, but I mention it to demonstrate, that the bug happens in more than just a niche of German.

:Gijs (he/him)

Comment 5

•

10 years ago

(In reply to Manuel Strehl from comment #4) > So, any news here? I would be happy, if the one in charge would speak up > with an opinion. There's not really a "one in charge" at Mozilla. In any case, my knowledge of German is sufficient to agree that this is a bug (and I imagine the other people commenting so far, being native speakers, would agree too); the question is just how to fix it... software like Firefox uses hyphenation dictionaries to know where to hyphenate text in different languages, and so we basically need a better one than what is in use now. I don't know if such a better dictionary exists, nor where to find one, nor how many of the issues the linguist identified would be fixed by such a new dictionary. Archaeopteryx, or Jonathan, ideas?

Flags: needinfo?(jfkthame)

Flags: needinfo?(archaeopteryx)

Jonathan Kew [:jfkthame]

Assignee

Comment 6

•

10 years ago

This looks to me like it's probably a Gecko bug (rather than a shortcoming of the German hyphenation patterns we're using). Note that the bad hyphenation does *not* occur if you lowercase the word: data:text/html;charset=utf-8,<div lang="de" style="-moz-hyphens:auto;width:1em">grundausstattung gives the expected "grund-aus-stat-tung", whereas the capitalized version gives "Gr-und-aus-stat-tung". AFAIK, the result here should not have been dependent on the capitalization. So I suspect this is a code bug, perhaps in the dom/layout line-breaking code or in the integration of the hyphenation library.

Flags: needinfo?(jfkthame)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

10 years ago

Flags: needinfo?(archaeopteryx)

Manuel Strehl

Reporter

Comment 7

•

10 years ago

Attached file XSL-FO + rendered PDF: No bug in Fop using TeX patterns — Details

Hm, as far as I know, there are no major problems with the TeX ngerman patterns when used in LaTeX. (Correct me, if I'm wrong, but those are used, aren't they?) I also use them in an extension to Apache FOP for hyphenation in XSL-FO. Can it be, that the conversion step or the application in the code introduce inaccuracies? I attach a minimal XSL-FO testcase complete with rendered PDF. It shows, that Apache FOP + TeX patterns does not show this wrong hyphenation behaviour. (Reproducing: Unzip, then `fop -fo test.fo -pdf test.pdf`. Make sure, that the "offo" extension for Fop is installed.)

Manuel Strehl

Reporter

Comment 8

•

10 years ago

Comment on attachment 8564929 [details] XSL-FO + rendered PDF: No bug in Fop using TeX patterns (The test case had wrongly those attributes set: hyphenation-push-character-count="4" hyphenation-remain-character-count="4". Removing them doesn't change the result, though.)

Jonathan Kew [:jfkthame]

Assignee

Comment 9

•

10 years ago

Attached patch Lowercase words before passing them to libhyphen, so as to match patterns fully — Details — Splinter Review

The issue here is that we need to explicitly lowercase the characters in each word before passing to libhyphen for pattern matching; it does not do that internally. We're already having to do a UTF16-to-UTF8 copy, so we can apply lowercasing at the same time.

Attachment #8564959 - Flags: review?(smontagu)

Jonathan Kew [:jfkthame]

Assignee

Updated

•

10 years ago

Assignee: nobody → jfkthame

Status: UNCONFIRMED → ASSIGNED

Ever confirmed: true

Jonathan Kew [:jfkthame]

Assignee

Comment 10

•

10 years ago

Attached patch Update reftests for improved handling of capitalized words — Details — Splinter Review

This affects a couple of our existing reftests, but I believe the changed behavior is an improvement in these cases, so just fixing the reference files to match. These cases just aren't as glaring as the German example, so we didn't pay sufficient attention to them.

Attachment #8564960 - Flags: review?(smontagu)

Jonathan Kew [:jfkthame]

Assignee

Updated

•

10 years ago

OS: Linux → All

Hardware: x86_64 → All

Version: 35 Branch → Trunk

Simon Montagu :smontagu

Updated

•

10 years ago

Attachment #8564960 - Flags: review?(smontagu) → review+

Simon Montagu :smontagu

Comment 11

•

10 years ago

Comment on attachment 8564959 [details] [diff] [review] Lowercase words before passing them to libhyphen, so as to match patterns fully Review of attachment 8564959 [details] [diff] [review]: ----------------------------------------------------------------- ::: intl/hyphenation/nsHyphenator.cpp @@ +99,5 @@ > + } > + > + // XXX What about language-specific casing? Consider Turkish I/i... > + // In practice, it looks like the current patterns will not be > + // affected by this, as they treat dotted and undotted i similarly. I was going to say "what about German SS/ß? That can impact hyphenation" -- but on second thoughts if we're lowercasing and not uppercasing, ẞ will just become ß and SS will become ss, so there shouldn't be a problem, right?

Attachment #8564959 - Flags: review?(smontagu) → review+

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 12

•

10 years ago

(In reply to Simon Montagu :smontagu from comment #11) > I was going to say "what about German SS/ß? That can impact hyphenation" -- > but on second thoughts if we're lowercasing and not uppercasing, ẞ will just > become ß and SS will become ss, so there shouldn't be a problem, right? If ß got uppercased to SS before, that would be an issue for hyphenation, e.g. Maße > MASSE > Masse Masse has a different meaning and a different hyphenation: Ma|ße vs. Mas|se

Jonathan Kew [:jfkthame]

Assignee

Comment 13

•

10 years ago

(In reply to Archaeopteryx [:aryx] from comment #12) > (In reply to Simon Montagu :smontagu from comment #11) > > I was going to say "what about German SS/ß? That can impact hyphenation" -- > > but on second thoughts if we're lowercasing and not uppercasing, ẞ will just > > become ß and SS will become ss, so there shouldn't be a problem, right? Right. > If ß got uppercased to SS before, that would be an issue for hyphenation, > e.g. > > Maße > MASSE > Masse > Masse has a different meaning and a different hyphenation: > Ma|ße vs. Mas|se If the text contains "SS" for an uppercase ß, that's the spelling we'll hyphenate; if the author writes "MASSE" and wants it hyphenated as uppercased "maße", i.e. "MA-SSE", they'd have to provide a manual soft hyphen, I think. The browser can't tell which of those words "MASSE" was meant to be. OTOH, if you're referring to a case where the author writes "maße", and then it is uppercased via text-transform:uppercase ... well, we don't seem to apply auto-hyphenation at all in that case, so the issue doesn't arise. (Though I'm not sure why that is -- perhaps another bug?)

Jonathan Kew [:jfkthame]

Assignee

Comment 14

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/e46f80935409 https://hg.mozilla.org/integration/mozilla-inbound/rev/a36c441817d8

Target Milestone: --- → mozilla38

Gerhard Großmann

Comment 15

•

10 years ago

My bug report on Launchpad could have the same reason → https://bugs.launchpad.net/ubuntu/+source/firefox/+bug/1209176 There I compared the hyphenation between Firefox and LibreOffice. Firefox still doesn’t correctly hyphenate the test words in my example html-page (which also contains lowercase words, see attachment).

Gerhard Großmann

Comment 16

•

10 years ago

Attached file Hyphenation test page (German words) — Details

Gerhard Großmann

Comment 17

•

10 years ago

Comment on attachment 8565930 [details] Hyphenation test page (German words) ><html lang="de"> ><head><meta charset="utf-8"> ><title>Hypenation Test</title> ><style type="text/css"> >p {-moz-hyphens:auto; hyphens:auto; >width:3em; border-right: 1px solid red; >font: 1em/1.32 FreeSerif,'Times New Roman',Times,serif;} ></style></head> ><body> ><p>mmmiii Türklinke Übungen wörtlich künftige öffentlich Überschriften überempfindlich</p> ></body> ></html>

Carsten Book [:Tomcat]

Comment 18

•

10 years ago

sorry had to back this out for test failures like https://treeherder.mozilla.org/logviewer.html#?job_id=6720242&repo=mozilla-inbound on asan builds

Flags: needinfo?(jfkthame)

Jonathan Kew [:jfkthame]

Assignee

Comment 19

•

10 years ago

Yeah, that's a bug -- I didn't notice that the |begin| variable gets reused later in the method. I've fixed the patch locally, and pushed an asan try run to double-check before relanding.

Flags: needinfo?(jfkthame)

Jonathan Kew [:jfkthame]

Assignee

Comment 20

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/133f5ec3de01 https://hg.mozilla.org/integration/mozilla-inbound/rev/7c3e53198e5b

Ryan VanderMeulen [:RyanVM]

Comment 21

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/133f5ec3de01 https://hg.mozilla.org/mozilla-central/rev/7c3e53198e5b

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

status-firefox38: --- → fixed

Flags: in-testsuite+

Resolution: --- → FIXED

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 22

•

10 years ago

Verified fixed with Nightly 38.0a1 20150219030204 on Windows 8.1

Status: RESOLVED → VERIFIED

You need to log in before you can comment on or make changes to this bug.