Open Bug 630181 Opened 13 years ago Updated 22 days ago

Implementing the Knuth and Plass line breaking algorithm

Categories

(Core :: Layout: Text and Fonts, enhancement)

x86
All
enhancement

Tracking

()

People

(Reporter: Jeremie, Unassigned)

References

(Blocks 1 open bug)

Details

In order to have a cleaner text rendering it could be a nice idea to implement the Knuth and Plass line breaking algorithm.

Bram Stein has made a first attempt with Javascript : http://www.bramstein.com/projects/typeset/

It's amazing to see how it's code is clean and simple.

Unfortunately, I do not have the skills to port it to Gecko but it does not seem impossible to do it.
Quoting from my comments at http://news.ycombinator.com/item?id=1974963 :

  This algorithm, unless I'm missing something, doesn't handle situations in
  which the available widths are different for different lines in the
  paragraph, and in particular in which the available width for a line depends
  on the precise break positions and vertical alignment results of all the
  earlier lines in the paragraph. Handling this is required to correctly
  handle CSS floats.

Not to mention that this algorithm is quadratic in paragraph length, which is a huge issue for web browsers, which sometimes have to deal with giant (think tens of megabytes) paragraphs.

Of course if you control the content and know that it has no floats and is reasonable length you can use this algorithm (which is why the JS implementation actually makes some sense).  But we don't control the content.

I'm also pretty sure this has been extensively discussed before, since all the people working on the layout engine for the last 10 years are familiar with TeX, and the limitations of this algorithm in the context of the web.  I'm not sure whether there's a bug that discussion is hanging off of.
(In reply to comment #1)
> Quoting from my comments at http://news.ycombinator.com/item?id=1974963 :
> 
>   This algorithm, unless I'm missing something, doesn't handle situations in
>   which the available widths are different for different lines in the
>   paragraph, and in particular in which the available width for a line depends
>   on the precise break positions and vertical alignment results of all the
>   earlier lines in the paragraph. Handling this is required to correctly
>   handle CSS floats.

I'm not sure to fully understand what does that mean. In its implementation attempt, Bram show off some cases where lines' width are different. But I'm not sure if it's related to the algorithm or to it's own implementation.

> Not to mention that this algorithm is quadratic in paragraph length, which is a
> huge issue for web browsers, which sometimes have to deal with giant (think
> tens of megabytes) paragraphs.

Ok, I'm not good with algorithm (I have a design background rather than a computer one) but I can understand that. What if the algorithm is applied on text chunks rather than on the whole text? Does it make easier the control of performance and memory usage? Well maybe this implies a brand new algorithm :-/

>  Of course if you control the content and know that it has no floats and is
> reasonable length you can use this algorithm (which is why the JS
> implementation actually makes some sense).  But we don't control the content.

Indeed and authors don't control it as well as they should.

> I'm also pretty sure this has been extensively discussed before, since all the
> people working on the layout engine for the last 10 years are familiar with
> TeX, and the limitations of this algorithm in the context of the web.  I'm not
> sure whether there's a bug that discussion is hanging off of.

I've performed a search before posting that bug and I was not able to find anything on bugzilla. Now it's done ;) However, I will be interested if someone as further information.
> Bram show off some cases where lines' width are different.

Ah, indeed.   So I must have been missing something.  Good.

> What if the algorithm is applied on text chunks rather than on the whole text?

Then you get weirdness at chunk boundaries.  But yes, that would prevent it from getting to out of hand, at least.

Another interesting question is how or whether this algorithm would handle things like CSS3 Text (which has things like "don't allow breaks here unless there are no breaks elsewhere, in which case you can break here" currently proposed).
(In reply to comment #3)
> > Bram show off some cases where lines' width are different.
> 
> Ah, indeed.   So I must have been missing something.  Good.

Lines of different widths are not a problem in themselves. They become a problem when those widths are not known in advance, as when (for example) a float takes a "chunk" out of the side of the paragraph for text to wrap around it. The length required for any given line may depend on its exact vertical position; but that in turn might depend on which breaks end up getting chosen on earlier lines - and that choice may not be determined until _later_ in the paragraph, if there are several "active" possibilities under consideration.

This can be difficult even with a fixed-position float, if line heights vary (due to font changes, inline images, or all sorts of other factors); it gets worse if the float itself is anchored to a position within the text of the paragraph, and so the position of the float is not known in advance of line-breaking the text that contains it.

(This is really just expanding on your original comment about:
> and in particular in which the available width for a line depends
>  on the precise break positions and vertical alignment results of all the
>  earlier lines in the paragraph.)

I do think there might be potential to make use of a Knuth/Plass-style line-breaking algorithm, but the complexities of CSS and HTML layout make it very hard to do in a completely general way. A possibility might be to use an algorithm like this for "plain" areas of text, falling back to a simpler line-by-line approach when factors such as encroaching floats come into play.
(In reply to comment #3) 
> > What if the algorithm is applied on text chunks rather than on the whole text?
> 
> Then you get weirdness at chunk boundaries.  But yes, that would prevent it
> from getting to out of hand, at least.

That what I was expected but such a weirdness is not necessarily a problem. In a lot of case, a single paragraph does not contain many text and for all those cases, this algorithm could improve the quality of the rendering. But I agree that some edge cases with a bunch of text can hit the browser or the system capacity. IMO if the size limit for the chunk is wisely chosen then it's possible to cover almost 99% of the common use cases without side effect on performance. But once again, I'm not very good at low level computer stuffs and it's maybe a to much naive approach.

> Another interesting question is how or whether this algorithm would handle
> things like CSS3 Text (which has things like "don't allow breaks here unless
> there are no breaks elsewhere, in which case you can break here" currently
> proposed).

This is definitely an issue but at that time I haven't dig inside the CSS3 Text module yet.
But with such requirement in the spec, it might be wise that the CSS WG decide to provide a full algorithm to deal with the text line breaking

(In reply to comment #4)
> (In reply to comment #3)
> > > Bram show off some cases where lines' width are different.
> > 
> > Ah, indeed.   So I must have been missing something.  Good.
> 
> Lines of different widths are not a problem in themselves. They become a
> problem when those widths are not known in advance, as when (for example) a
> float takes a "chunk" out of the side of the paragraph for text to wrap around
> it. The length required for any given line may depend on its exact vertical
> position; but that in turn might depend on which breaks end up getting chosen
> on earlier lines - and that choice may not be determined until _later_ in the
> paragraph, if there are several "active" possibilities under consideration.
> 
> This can be difficult even with a fixed-position float, if line heights vary
> (due to font changes, inline images, or all sorts of other factors); it gets
> worse if the float itself is anchored to a position within the text of the
> paragraph, and so the position of the float is not known in advance of
> line-breaking the text that contains it.

Ok, if I understand well, the problem is that even if it's possible to know early what is the size of a floating element, it's almost impossible to predict it's final position and therefor impossible to anticipate the size of each line.

> I do think there might be potential to make use of a Knuth/Plass-style
> line-breaking algorithm, but the complexities of CSS and HTML layout make it
> very hard to do in a completely general way. A possibility might be to use an
> algorithm like this for "plain" areas of text, falling back to a simpler
> line-by-line approach when factors such as encroaching floats come into play.

If I understand you, using the Knuth/Plass algorithm as is, is not possible for complex HTML layout. This make sens because at some point, HTML layout are "alive" and can change a lot due to user, script or whatever interaction. And of course the original Knuth/Plass algorithm is made for some static text which is render once in a while with some minimal precaution to take about the necessary resources needed to perform the rendering. To many people (and I'm part of them) do not really understand such requirement and do not understand that a browser can not be compare to software such as InDesign or simply MSWord or OpenOffice... but frankly that what they expect. 

A that point I have to said that the original purpose of this feature request is less implementing this specific algorithm than improving the text rendering. I regularly see a lot of complain by author about the way line breaking is done by the browsers (as an example, see : http://astheria.com/design/choosing-type-alignments-fortheweb). The Knuth/Plass algorithm is one opportunity to improve that but not the only one. This is especially true when authors wish to use the CSS property text-align:justify which need a lot of carefulness.

Your idea to use that algorithm when its possible is really appealing but I'm not sure it's reasonable to have some kind of dual algorithm for such a few things (once again I'm not necessarily aware of what's really happen under the hood). Maybe there is some other quick wins to imagine (such as dealing with kerning in addition to word spacing... even with text-align:left or right).

Thanks anyway for those enlightening answers :)
There appears to be a relevant patent from Adobe: http://www.freepatentsonline.com/6510441.html
To quote from the mailing list discussion where I found it: "Adobe's algorithm differs from the Knuth-Plass algorithm in that the
former looks at the whole paragraph while the latter merely looks at
portions of the paragraph (these "portions" being defined in the patent
text). Adobe's argument in their patent is that an algorithm that looks
at the whole paragraph requires much more memory space and processing
time."
(In reply to Mathnerd314 from comment #6)
> There appears to be a relevant patent from Adobe:
> http://www.freepatentsonline.com/6510441.html
> To quote from the mailing list discussion where I found it: "Adobe's
> algorithm differs from the Knuth-Plass algorithm in that the
> former looks at the whole paragraph while the latter merely looks at
> portions of the paragraph

I suspect the author has mixed up "former" and "latter" here, otherwise this seems to make no sense.

> (these "portions" being defined in the patent
> text). Adobe's argument in their patent is that an algorithm that looks
> at the whole paragraph requires much more memory space and processing
> time."

While I haven't examined the claims in Adobe's patent (and in any case wouldn't be qualified to comment on their merits), I doubt it'd be an obstacle to anything we might try to implement in this area, short of an attempt, perhaps, to directly "clone" their behavior by treating the details of their patent as a design spec. I don't see how they could claim any rights over what's in TeX, or the ideas and algorithms described in the classic Knuth & Plass paper (1981) - which, btw, even refers to a suggestion in a paper by Cooper (1967) that long paragraphs could be handled in overlapping portions to avoid excessive memory requirements.

Prior art, I believe it's called. (But I am not a patent lawyer, so you should probably assume I have no idea what I'm talking about.)
The algorithm can be implemented to run in O(n) time[2,3]. In fact, even the dynamic programming solution will run in O(min(w * n, n^2)) time[5], where w is the maximum number of words on a line. Since w is fixed, the algorithm is linear for large n.

For a true linear time algorithm you have to make some assumptions on the cost function (namely, concavity). This is usually not a problem, but I think that it may be incompatible with varying line widths. There is yet another way to implement the algorithm which runs in O(n log n) time[1], but apparently with smaller constants than the O(n) algorithm[2].

Now, while I don't know the complexities of the CSS float model, I would really appreciate to have some way to render aesthetically pleasing text in a browser. It does not have to support all the bells and whistles of normal HTML text - even a simple text field would be nice.

Paragraph formatting is pretty much a solved problem in academia and I'm posting this to point you or potential implementors at the relevant literature. However, in my opinion, implementing the dynamic programming algorithm with max line cutoff is both easy and fast enough.

References:

The first line of research is into the "least weight subsequence problem", of which Knuth-Plass paragraph formation is a special case. All papers assume that the cost function is concave.

[1] D.S. Hirschberg and L.L. Larmore, The Least Weight Subsequence Problem
(This contains both an O(n log n) time algorithm and an O(n) algorithm with additional assumptions.)

[2] Robert Wilber. 1988. The concave least-weight subsequence problem revisited
(The first O(n) algorithm which works with any concave cost function.)

[3] Z. Galil and K. Park. 1990. A linear-time algorithm for concave one-dimensional dynamic programming
(A simplified O(n) algorithm.)

Both [2] and [3] depend on an algorithm for "monotone matrix search", which is described in:

[4] A Aggarwal, M Klawe, S Moran, P Shor, and R Wilber. 1986. Geometric applications of a matrix searching algorithm

Finally, for the cost function used by TeX you will have to consult the relevant chapter in Knuth's "Digital Typography" book. Additionally, there is a paper/literate program which implements an O(n) time algorithm:

[5] Oege de Moor and Jeremy Gibbons. 1997. Bridging the Algorithm Gap: a Linear-Time Functional Program for Paragraph Formatting
> Now, while I don't know the complexities of the CSS float model

I'll summarize the complexity bit.  In CSS layout, the available width for a line depends on the exact line break positions of all previous lines.  It also depends on the height of the line, which depends on the exact items that end on the line and their vertical alignment.

I'm almost certain (though I haven't proved it) that any reasonable cost function here is not concave....
Shouldn't this block the meta bug 206152?
(In reply to Boris Zbarsky [:bz] from comment #9)
I think TeX also has to deal with these issues. Is there any possibility of using the TeX code as a guideline for an implementation in Firefox?

The original TeX code is available here written in WEB. It looks to be very cleanly written and well documented (following literate programming). Search for: [38] Breaking paragraphs into lines.

https://www.tug.org/texlive/devsrc/Build/source/texk/web2c/tex.web

Sorry, I don't know Javascript or WEB, so I can't help directly.


> > Now, while I don't know the complexities of the CSS float model
> 
> I'll summarize the complexity bit.  In CSS layout, the available width for a
> line depends on the exact line break positions of all previous lines.  It
> also depends on the height of the line, which depends on the exact items
> that end on the line and their vertical alignment.
> 
> I'm almost certain (though I haven't proved it) that any reasonable cost
> function here is not concave....
> I think TeX also has to deal with these issues.

Please do read the whole bug?  That exact issue was discussed at the very beginning.  The summary is that as far as I can tell you think incorrectly...
(In reply to Boris Zbarsky [:bz] from comment #12)
Let me clarify my statement - I believe TeX also has to deal with the same issues: Lines formed from symbols with different heights are common in TeX documents; Filling paragraphs with line widths that are not known before previous lines are rendered: the LaTeX wrapfig package allows you to have floats on the side and to wrap text around it. Since the JS implementation on Bram Stein's website is based on the Knuth and Plass paper, it will not account for these complexities, but the actual TeX/LaTeX source code should account for them and may be a good start for an implementation. [Disclaimer - I am a TeX user, not a TeX programmer.]

Another issue (discussed earlier in these comments) is speed. In my experience, the TeX algorithm is fast enough to justify a 20 page report in a fraction of a second (with normal sized paragraphs). For abnormally long paragraphs, we can either revert to the current fast method, or implement a faster method (from callcc's excellent survey).

> > I think TeX also has to deal with these issues.
> 
> Please do read the whole bug?  That exact issue was discussed at the very
> beginning.  The summary is that as far as I can tell you think incorrectly...
OS: Windows XP → All

It's possible that, if implemented, this could be activated by text-wrap: balance.

... actually, no, I was misreading the spec. balance is an entirely different feature that includes balancing the last line.

There’s a lot of conceptual overlap with balance, but it is indeed definitely a different type of wrapping. But the text-wrap property is relevant here.

It’d be text-wrap: pretty that you’d use to definitely opt into Knuth-Plass.

And text-wrap: stable would definitely opt out of it, retaining the current greedy wrapping algorithm.

The key would be what text-wrap: wrap (the default value) means. I suspect some experimentation would be warranted. The safe thing to do would be for it to be equivalent to stable; anything beyond that and you’re just about certain to make some things behave strangely, however rare. I suspect a decent initial heuristic would be: stable if within a textarea or contenteditable, otherwise pretty.

I plan to attempt to partially implement this over the next few weeks. I'll try to document what I learn in the process, even if I can't get anything to work

Severity: normal → S3

Chromium is working on text-wrap: pretty: https://bugs.chromium.org/p/chromium/issues/detail?id=1432798

(In reply to Aaron Janse from comment #17)

I plan to attempt to partially implement this over the next few weeks. I'll try to document what I learn in the process, even if I can't get anything to work

How did it go?


To update a little while bumping: that Adobe patent should be expired, I think. The patent was granted in January 2003 and patents last 20 years, so it should have expired last year, in January 2023, and is now irrelevant.

Chromium has now shipped text-wrap: pretty and CanIUse reports >60% penetration. However, it's not as simple as 'just implementing Knuth-Plass': it does that but only for the last 4 lines of a paragraph, apparently? And also does some stuff with handling orphan-words-at-ends-of-paragraphs to try to make the last line at least 2 words. The Chromium implementation also seems to emphasize eliminating hyphens. There is some discussion about whether pretty should do all that or be disaggregated: https://github.com/w3c/csswg-drafts/issues/3473

You need to log in before you can comment on or make changes to this bug.