Bug 95067 (uax14)

UAX14: line-break should be allowed after hyphens (unless followed by number)

RESOLVED WORKSFORME

Status

()

defect
RESOLVED WORKSFORME
18 years ago
2 years ago

People

(Reporter: alanmwood, Unassigned)

Tracking

(Blocks 2 bugs, {intl, testcase})

Trunk
Points:
---
Dependency tree / graph
Bug Flags:
blocking1.8b3 -
blocking-aviary1.5 -
wanted-next +

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [webcompat], URL)

Attachments

(9 attachments, 11 obsolete attachments)

4.48 KB, text/html
Details
485 bytes, text/html
Details
1.95 KB, text/html
Details
3.04 KB, text/html
Details
319 bytes, text/html
Details
41.63 KB, image/png
Details
22.80 KB, patch
Details | Diff | Splinter Review
19.27 KB, image/gif
Details
90.85 KB, image/gif
Details
(Reporter)

Description

18 years ago
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 95)
BuildID:    2001080110

Very long words, such as systematic chemical names, do not wrap to stay within 
the browser window, on-screen and when printed.

Reproducible: Always
Steps to Reproduce:
1.Display page
2.
3.

Actual Results:  Very long words are cut off at the right-hand edge of the 
window.

Expected Results:  Very long words words should be wrapped to stay within the 
browser window. It is conventional to break such words after hyphen, right 
parenthesis or right bracket.
OS: Windows 95 → All
Hardware: PC → All
Status: UNCONFIRMED → NEW
Ever confirmed: true

Comment 1

18 years ago
If I read http://www.w3.org/TR/REC-html40/struct/text.html#h-9.3.3 correctly, I
don't think it's required to break a line at a hard hyphen, only at a soft
hyphen (­). But it would be nice, I agree.

But splitting at a hyphen wouldn't be enough, you would also have to implement
real word-wrapping. And the rules for that are very complicated, and are
different for every language. I know a guy who helped to implement an advanced
word-wrapping algorithm in Dutch for a newspaper, and it took 2 man-years to do it !

See also bug 47483 and bug 9101 for soft hyphen support. I think this should be
supported at the least. But not in this case - hyphens are mandatory in chemical
formulas.
(Reporter)

Comment 2

18 years ago
Re Johan Hermans' comments, I don't consider this bug to be anything to do with 
HTML 4 recommendations.  HTML pages do not have a defined page width (unlike 
word processor documents), so I think they should wrap to stay within the 
browser window when it is re-sized. Internet Explorer has done this with very 
long words since at least version 4.  The word-wrapping algorithm does not need 
to be perfect, and I would have thought that there was already something 
available in the public domain.
Alan Wood (alan.wood@context.co.uk)
*** Bug 101519 has been marked as a duplicate of this bug. ***
Old summary:
"Very long words in table cells do not wrap"

New summary:
"Very long words in table cells do not wrap (such as hyphens)"

(to make dupe-finding easier)
Summary: Very long words in table cells do not wrap → Very long words in table cells do not wrap (such as hyphens)
*** Bug 106179 has been marked as a duplicate of this bug. ***

Comment 6

18 years ago
-->attinasi
Assignee: karnaze → attinasi
*** Bug 108073 has been marked as a duplicate of this bug. ***
Note: IE 6.0 breaks at the hyphens. today's trunk cvs build on WINNT does not.
Target Milestone: --- → mozilla1.1
*** Bug 111104 has been marked as a duplicate of this bug. ***
*** Bug 112545 has been marked as a duplicate of this bug. ***

Comment 12

17 years ago
This patch can wrap long word by key-characters.

Key-charcters are parted tree groups.
  suffixed-key	  /  :	;  &
  suffixed-key	  )  ]	}  >  !  ?
  prefixed-key	  (  [	{  <
  prefixed-key	  $  \
  sepalated-key   %  -

I tested mozilla-0.9.8.

Comment 14

17 years ago
My patch program "patch to wrap long word by key-characters" 
includes a patch to fix a bug.
Probably a bug has not just yet be reported to Bugzilla.org.
A word included Japanese Kanji can not be wrapped after Kanji,
i.e. its word goes through the TABLE tag width.

I list testcase below.
A word in the 1st TABLE tag is wrapped corectly,
but a word in the 2nd TABLE tag is not wrapped.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>Testcase for a not wrapped word</TITLE>
<META http-equiv=Content-Type content="text/html; charset=shift_jis">
</HEAD><BODY>
<TABLE cellSpacing=0 cellPadding=0 width=80 align=left border=1>
  <TBODY><TR><TD>
<!-- this word is wrapped correctly -->
012あabxy
  </TD></TR></TBODY>
<TABLE cellSpacing=0 cellPadding=0 width=80 align=left border=1>
  <TBODY><TR><TD>
<!-- this word is not wrapped -->
012あab<A href="">xy</A>
  </TD></TR></TBODY>
</TABLE></BODY></HTML>

Comment 16

17 years ago
Now I have a mistake.
A attachment "testcase of a word not wrapped after Kanji" is not patch.

Updated

17 years ago
Attachment #76916 - Attachment is patch: false

Updated

17 years ago
Attachment #76916 - Attachment mime type: text/plain → text/html

Comment 17

17 years ago
P.S.
This testcase depends on the font size.
Please select an appropriate font size or change the width of TABLE tag.
(change width="80" to width="60" for example)
*** Bug 142446 has been marked as a duplicate of this bug. ***

Comment 19

17 years ago
just some comments for the patch author, and hopefully some more visibility for 
the patch...
is_alpha
is '_' really an alpha?

is_number reimplements what i think is a standard function call...

IS_BREAK_CHAR by case looks like a macro but is not
afaik return x; is preferred over return (x); and else is shunned after return
the function also appears to be asking for a switch statement.

#define isalnum_(c) (isalnum(c) || c=='_')
/*yes i just broke the case naming law for macros, whoops*/
inline PRBool
is_break_char(PRUnichar c, PRUnichar cc)
{
  if (isalnum_(c)) {
    switch (cc) {
      case '/': case ':': case ';': case '&':
      case '!': case '?':
      case ')': case ']': case '}': case '>':
      return PR_TRUE;
    }
  } else
  switch (c)
  case '%': case '-':
    return PR_TRUE;
  case '(': case '[': case '{': case '<':
    if (is_alpha(cc) || is_number(cc))
      return PR_TRUE;
  case '$': case'\\':
    if (is_number(cc))
      return PR_TRUE;
  }
  return PR_FALSE;
}

comments anyone?
Keywords: patch

Comment 20

17 years ago
Posted patch patch v2 (obsolete) — Splinter Review
Mr. timeless, thank you for your advice.
The patch was updated.

Comment 21

17 years ago
Although it is also in a former comment, if there is an element of <xx>...</xx>
which follows the multiple Kanji characters which does not contain a white
space,
also contains the patch coping with the fault which the portion of "..." is not
turned up correctly.

The following is an portion of the patch.

layout/html/base/src/nsTextFrame.cpp:
     } else {
       firstChar = *bp2;
+      if (IS_CJK_CHAR(firstChar))
+        aTextData.mIsBreakable = PR_TRUE;
     }

Refer to the patch of Bug 135323 for the macro IS_CJK_CHAR used in a patch.

Comment 22

17 years ago
:)
fwiw i made one trivial omission:
      return PR_TRUE;
break; //<-- this
  case '$': case'\\':

it doesn't actually matter, ideally the compiler will realize that the next 
case can't be satisfied any better than the previous case but for correctness 
we should have the break.

good luck.. the rest of this is out of my area

Comment 23

17 years ago
Posted patch patch v3 (obsolete) — Splinter Review
Thank you for your indication.
It was careless.

The patch was updated.

Comment 24

17 years ago
Posted patch patch v4 (obsolete) — Splinter Review
The macro of IS_CJK_CHAR was defined in nsTextFrame.cpp.

Comment 25

17 years ago
Saito san,
This is again a very difficult problem to handle. We have problem in breaking
certain symbol sequence, like :-). Those commonly used stuff should be kept
together, and that's why we only break ascii word by space. From this point of
view, your current approach will be unacceptable. 

My proposal is (logically) to do this in 2 steps. First, we try all those old
logic to wrap words, as what we are doing now. Second, if and only if a single
word is too long to fit in current cell, should we try to break the word. In
Linebreaker, we need to implement an new API to break a word into word segment.
That should be rather easy to do, basically you can move your new code to this
function. In layout code, we want to call this api and break a word only when
such situation arises. That is difficult because layout code looks too
complicated, but it is doable. Let me know if you agree with this approach and
if you have time to do it. 

Comment 26

17 years ago
I understood that many problems are in my patch.
I will investigate whether anything may be made according to your advice.

My patch points out other problem related to a line break.
If an element as shown in <a>...</a> is included in a line which does not
contain any space only with the multiple CJK characters, a line break will go
wrong.
A CJK character is possible to break a line, therefore, variable
aTextData.mIsBreakable should be set to TRUE.

layout/html/base/src/nsTextFrame.cpp:
     } else {
       firstChar = *bp2;
+      if (IS_CJK_CHAR(firstChar))
+        aTextData.mIsBreakable = PR_TRUE;
     }
If it seems that there is especially no comment, I will report newly a bug
report as this problem.

Comment 27

17 years ago
Posted patch patch v5 (obsolete) — Splinter Review
It is a test patch.
It changed so that the string longer than the width of a table might be broken.

Comment 28

17 years ago
Posted patch patch v6 (obsolete) — Splinter Review
The bug of Bug 153504 with nowrap property of CSS might be fixed.
Since correction has been added if it sometimes tests, though regrettable, it
is a very complicated patch.

Although it may be hard to accept my patch, 
I think that processing to break a string in case the breakable string and
unbreakable string is connected turns into similarly complicated processing.

Comment 29

17 years ago

Comment 30

17 years ago
Explanation of a patch:
The string loaded first turns up and asks for the longest WORD length.
An actual string is laid so that a wrap may not take place if possible.
It is because window width cannot be made narrower than the WORD length which 
asked first when window width is shortened.

Comment 31

17 years ago
I am sorry. It corrects.

When table width is variable length, it is because table width cannot be made 
narrower than the WORD length which asked first when window width is shortened.

Comment 32

17 years ago
Posted patch patch v7 (obsolete) — Splinter Review
It tested about the <td nowrap> and <nobr></nobr> elements, the patch was
corrected.

Comment 33

17 years ago

Comment 34

17 years ago
As you work on hyphenation, could you please test that it breaks at soft
hyphen.

As a result I expect text:
ab-
bc
c d

And not:
abbc
c d

as does 2002072221 nightly

Comment 35

17 years ago
*** Bug 159340 has been marked as a duplicate of this bug. ***

Comment 36

17 years ago
*** Bug 154541 has been marked as a duplicate of this bug. ***

Comment 37

17 years ago
Posted patch patch v8 (obsolete) — Splinter Review
I tried to fix the problem of soft-hypen, the changes is only a display part.
This patch also includes the fix of word-wrapping, a problem of nowrap property
of CSS and a problem of <nobr></nobr> tag.

Comment 38

17 years ago
*** Bug 160852 has been marked as a duplicate of this bug. ***
*** Bug 149137 has been marked as a duplicate of this bug. ***
*** Bug 175799 has been marked as a duplicate of this bug. ***
(Reporter)

Comment 41

17 years ago
Does anyone know what is happening with this bug?  After a burst of activity in 
July, it seems to have gone quiet, and the target (1.1alpha) has not been 
updated.

It is assigned to attinasi@netscape.com (Marc Attinasi), but email to that 
address is rejected (User unknown).

Alan Wood
alan.wood@context.co.uk

Comment 42

17 years ago
Posted patch patch v9 for mozilla-1.2.1 (obsolete) — Splinter Review

Updated

17 years ago
Attachment #93007 - Attachment is obsolete: true

Updated

17 years ago
Attachment #91950 - Attachment is obsolete: true

Updated

17 years ago
Attachment #91096 - Attachment is obsolete: true

Updated

17 years ago
Attachment #87485 - Attachment is obsolete: true

Updated

17 years ago
Attachment #85732 - Attachment is obsolete: true

Updated

17 years ago
Attachment #83846 - Attachment is obsolete: true

Updated

17 years ago
Attachment #83837 - Attachment is obsolete: true

Updated

17 years ago
Attachment #75233 - Attachment is obsolete: true

Updated

17 years ago
Attachment #108709 - Flags: review?(shanjian)

Comment 43

17 years ago
I've tried to interpret this bug and patch. Can someone clarify what will happen
with the strings:

"2002-12-31" (ISO date)

"-2147483648-2147483647" (the interval for a 32 bit 2-complement integer)

"x=(-3)*y-5" (some maths)
"x=-3*y-5" (some maths)

Comment 44

17 years ago
If page width is narrowed, it will be displayed as follows.

"2002-12-31" result:
2002-
12-31

"-2147483648-2147483647" result:
-
2147483648-
2147483647

"x=(-3)*y-5" result:
x=(-
3)
*y-5

"x=-3*y-5" result:
x=-
3*y-
5

Comment 45

17 years ago
Not all those were especially good. I could accept them if they were the last
resort, but I know from MSIE that MSIE unnecessary compresses table cells and
forces breaking were none is wanted. I would be unhappy to see the same in Mozilla. 

Comment 46

17 years ago
Compared with MSIE, patch has an advantage.
By the case "x=(-3)*y-5 x=-3*y-5" is displayed, when a margin is in the width of
a table cell, an effect can be seen.

MSIE result:
<------------->
"x=(-3)*y-5 x=-
3*y-5"

mozilla with patch result:
<------------->
x=(-3)*y-5
x=-3*y-5
Attinasi is gone. Reassigning to patch author.
Assignee: attinasi → saito

Comment 48

16 years ago
*** Bug 192757 has been marked as a duplicate of this bug. ***

Comment 49

16 years ago
Comment on attachment 108709 [details] [diff] [review]
patch v9 for mozilla-1.2.1

Saito-san, please post your new patch.
Attachment #108709 - Attachment is obsolete: true

Comment 50

16 years ago
Posted patch patch for mozilla-1.3b (obsolete) — Splinter Review
I don't think we should break on hyphens.  So why should we add this much
additional code complexity to fix something that isn't even a bug?

(Adding support for soft hyphen, etc., is definitely a good thing, but I imagine
the changes to do that would be much simpler.)
OK, I'll partially retract that statement (in response to email sent to me that
should have been a comment on this bug).

I don't think we should break on hyphens due to the complexity of the current
linebreaking code.

(I recall a discussion of this issue in much more detail in another bug, though,
that led me to think we shouldn't break on hyphens at all.)

Comment 53

16 years ago
I was going to disagree with Comment #51, even after looking at the patch, but I
didn't as I've worked on line breaking code before.

Comment #52 gets my agreement, but maybe for differing reasons. Line breaking
code is horribly complex as it has to deal with all possible scenarios. Now, as
Mozilla is a "mutli-lingual" browser, line breaking becomes exponentially worse.
I'd hate to try and determine linebreaks for Kanji(which I don't think has
hypens at all) or Arabic(which I recall is read right to left).

I beleive the W3C standard is to break on &shyn which is a deliberate choice
made by the web page designer. Maybe that is all that is needed, and other cases
can go to Evang.
(Reporter)

Comment 54

16 years ago
Please don't give up on this bug.  Check the URL from my original bug report in 
Mozilla, and then compare with I.E. 4+ and Opera 6+, both of which wrap 
reasonably well.

Printed material has always broken lines after hyphens. Word processors break 
lines after hyphens. Why should Web browsers be any different?

There are some hyphens where breaking is not appropriate, and the rules need to 
take this into account, by for example not allowing a break if it would produce 
a string of 3 or fewer characters at the start of the next line.

Algorithms like this have existed for many years for computerised typesetting, 
and there is no good reason why Web browsers cannot have this facility.

I appreciate that multi-script line breaking introduces even more complexity, 
but isn't Mozilla intended to be the best Web browser?

Alan Wood

Comment 55

16 years ago
We shouldn't drop this. If necessary, push it into the future. For many
languages, such as English, good style requires line breaking on hyphens. We
should do this, eventually. I'm sure languages have different hyphenation rules,
but maybe that could be addressed by relying on statements like xml:lang="en" in
the page source.
The point is, it would be nice to make the current line-breaking and
text-measurement code something approaching readable.  nsTextFrame::Reflow is
already one of the most regression-prone (if ever touched) and undecipherable
pieces of code in Mozilla. I think breaking it up into multiple functions (ones
that are not 300 lines long), renaming the variables to have names that have
something to do with what they're doing (some fail this test last I checked),
writing some frigging comments should take opriority over adding new stuff to
it.... (my 2cents).

At that point, changes would become much more reasonable both to code and to
review...

Comment 57

16 years ago
dbaron, do you oppose the simple code which fix this bug?

Comment 58

16 years ago
Please see a screen shot, some strings overflow a table frame. It is the bug of
mozilla. Since the patch contained the code for correcting its bug, it became
complicated. It is likely to be necessary to divide a patch.
I guess I really shouldn't talk, since I don't work on horizontal aspects of
inline layout...
I think we all understand the reasons for this patch.... I agree that
they are good reasons.  All I'm asking is that people consider the
maintainability of the code in addition to its functionality.  The current
code is unmaintainable, so it would be nice if someone who understands
the code (as Saito-san clearly does) could make it more readable and
maintainable... (either before or after landing this patch, as long as
it _happens_).

Comment 61

16 years ago
*** Bug 193360 has been marked as a duplicate of this bug. ***

Comment 62

16 years ago
> Bug 193360
Bugzilla recognizes that some strings of <textare> is URL. If a string is
wrapped at '/', '-' or etc., it will become impossible to recognize a URL since
new-line is included. This patch does not wrap any long words in <textare> that
has -moz-pre-wrap property.

Comment 63

16 years ago
*** Bug 195491 has been marked as a duplicate of this bug. ***
Why is this bug applied only to table cells?  Mozilla doesn't break long strings
in any context, to my knowledge.

This may be late in the game, but line-breaking on punctuation probably should
strive to follow the rules laid out in:
  http://www.unicode.org/unicode/reports/tr14/

To wit, see this comment from the &shy; bug:
  http://bugzilla.mozilla.org/show_bug.cgi?id=9101#c30

Comment 65

16 years ago
Posted patch patch (obsolete) — Splinter Review
I checked only ascii code.
Please refer to a function of nsTextTransformer::GetNextDividedWord.
This patch includes a bug fix shown below, 
because style of text should be updated for connecting some fragmentary text.

nsTextFrame::ComputeWordFragmentDimensions
+  nsIStyleContext* aStyleContext;
+  aTextFrame->GetStyleContext(&aStyleContext);
+  const nsStyleText* textStyle = (const nsStyleText*)
+    aStyleContext->GetStyleData(eStyleStruct_Text);
+  aCanBreakBefore = (NS_STYLE_WHITESPACE_NORMAL == textStyle->mWhiteSpace) ||
+    (NS_STYLE_WHITESPACE_MOZ_PRE_WRAP == textStyle->mWhiteSpace);
Attachment #114308 - Attachment is obsolete: true

Comment 66

16 years ago
Care to file a separate bug for that? Do well to also include a testcase to show
the problem. It is not clear why |aCanBreakBefore| is out-of-sync when it is
passed to ComputeWordFragmentDimensions().

Comment 67

16 years ago
*** Bug 204233 has been marked as a duplicate of this bug. ***

Comment 68

16 years ago
Posted patch patchSplinter Review
This patch includes some changes for hyphenation of soft hyphen.

The frame's white-space mode should be updated whenever the text of the next
text frame is read , but I was not able to show the effect of the following
patch. I removed the following patch.

nsTextFrame::ComputeWordFragmentDimensions
+  aCanBreakBefore = (NS_STYLE_WHITESPACE_NORMAL == textStyle->mWhiteSpace) ||
+    (NS_STYLE_WHITESPACE_MOZ_PRE_WRAP == textStyle->mWhiteSpace);
Attachment #120999 - Attachment is obsolete: true
*** Bug 214618 has been marked as a duplicate of this bug. ***
*** Bug 215166 has been marked as a duplicate of this bug. ***

Updated

16 years ago
Target Milestone: mozilla1.1alpha → ---

Comment 71

16 years ago
Posted image Example Screenshot

Comment 72

16 years ago
*** Bug 217520 has been marked as a duplicate of this bug. ***

Comment 73

16 years ago
*** Bug 217705 has been marked as a duplicate of this bug. ***

Updated

16 years ago
Keywords: intl

Updated

16 years ago
Retitling bug from "Very long words in table cells do not wrap (such as
hyphens)" to "line-break should be allowed after hyphens (unless followed by
number)", which better reflects what this bug is about.

We're not going to change the way table cells compute their size.  Unbreakable
things inside table cells will increase the size of the table.  The web depends
on this, and it's described in (admittedly, an informative part of) the CSS2
table model.  What we might change is what's considered breakable and what isn't.

UAX #14 suggests that line breaks should be allowed after hyphens unless they're
followed by a numeric character.  I agree that this makes sense and we should do
this.  It should be possible to do without increasing the complexity of the code.

One serious problem with our current line breaking code is that we follow three
separate codepaths -- one for text that's entirely ASCII, one for text that's
not ASCII but has no CJK characters, and one for text with CJK characters.  In
some cases, these codepaths have different behavior.  Since HYPHEN-MINUS is in
ASCII, all three codepaths need to be modified to fix this bug -- or we need to
combine the codepaths.  I think combination of the codepaths is probably a good
approach, and I may be able to look into it in the near future.  It might also
allow us to implement additional improvements to line breaking based on UAX #14
(also see discussion on bug 56652 and bug 206152), such as support for soft hyphens.

Any duplicates of this bug that aren't about hyphens should (I think) be
reopened (and perhaps marked as duplicates of other bugs or marked invalid).  If
someone disagrees, please say so.
Summary: Very long words in table cells do not wrap (such as hyphens) → line-break should be allowed after hyphens (unless followed by number)
Component: Layout: Tables → Layout: Fonts and Text
Comment on attachment 123515 [details] [diff] [review]
patch

I think one reason this patch introduces so much additional complexity is that
it's trying to modify line breaking from a level of the code other than the one
at which line breaking happens.  I would expect the fix to this bug to be
closer to our current line breaking code, i.e., nsJISX4501LineBreaker (sic) and
nsTextTransformer::Scan*.

I don't see why a fix for this bug would need to modify other code.

Comment 76

16 years ago
My problem (originally 217520, but transferred to 95067) concerned
very_long_words (long string of text without white space) that exceeded cell
width, and often window width.
This has little to do with splitting a word on hyphens, etc.
What is needed at a minimum is too ensure that a very_long_words will NEVER
cause the cell width to exceed the window width.  This is in itself a relatively
simple problem, unlike splitting the very_long_word in an esthetically pleasing
manner.
It also solves a serious problem.  When a very_long_word causes a text cell to
exceed the window width, in a long text, the text is rendered unreadable, due to
requiring scrolling right/left for each line, without visual cues to maintain
one's place in the text.  As this is often important information (e.g. re
security patches) this poses a SERIOUS PROBLEM.
Agreed, it would be NICE to have a more esthetically pleasing presentation, at A
MINIMUM, THIS PARTICULAR PROBLEM should be solved as A PRIORITY.
No.  As I said in comment 74, we're not changing the basic table algorithm.  The
web depends on it, and your proposal really won't help for all but the simplest
case.  (Why does breaking at the *window width* help for a table that has
multiple columns?)  But I reopened your bug and marked it a duplicate of a
different bug.

Please do not discuss the issue further on *this* bug.  It's off-topic.
(Reporter)

Comment 78

16 years ago
I am not completely happy with the end of the new title "unless followed by number".

I can see the validity of this for dates, as in 2002-12-31 (comment 43), but I
feel that some breaking of dates can be avoided by a widow/orphan setting of 3
characters, i.e. don't break if it would result in 1, 2 or 3 characters at the
start or end of a line.  Perhaps a widow/orphan setting could be included in
Preferences?

Not breaking after a hyphen that is followed by a number does not work so well
for chemical names, which started this bug.  For example:

2-bromo-4,4-dichlorophenol

could happily be broken as

2-bromo-
4,4-dichlorophenol

but breaking as

2-bromo-4,4-
dichlorophenol

is much less satisfactory.

However, breaking after some hyphens would be MUCH better than never breaking
after hyphens.  We cannot have manual checking of each break, and so we will
have to accept some imperfections.

Alan Wood
 
Regarding the venerated and abused ASCII hyphen (Unicode HYPHEN-MINUS): "unless 
followed by number" is not explicitly called for by UAX14:
  http://www.unicode.org/reports/tr14/#HY
Instead, that spec rather vaguely states: "Some additional context analysis is 
required to distinguish usage of this character as a hyphen from the use as 
minus sign (or indicator of numerical range). If used as hyphen, it acts like 
HYPHEN."  In this instance, HYPHEN is a subsection of:
  http://www.unicode.org/reports/tr14/#BA

The context analysis mentioned is primarily to determine whether the hyphen is 
being used as a MINUS:
  http://www.unicode.org/reports/tr14/#PR
(I'm a little surprised about the "indicator of numerical range" clause, as that 
implies functional equivalence to the EN DASH, which breaks like HYPHEN.)

I'm not sure that context analysis could be relied on to distinguish between 
breaking and non-breaking requirements for the chemical names.  Perhaps the best 
solution to that problem is to use explicit Unicode hyphen (2010) and 
non-breaking hyphen (2011) characters in the text.
*** Bug 217520 has been marked as a duplicate of this bug. ***
(Reporter)

Comment 82

16 years ago
With reference to the last paragraph of comment 79, the idea of using U+2010
(hyphen) and U+2011 (non-breaking hyphen) in chemical names is a non-starter.

These characters are not on any keyboard, and so would need to be entered as
numeric character references.  These characters are not included in the core
fonts for Windows (Arial, Courier New, Times New Roman), and therefore are not
likely to be displayed in most people's Web browsers.

I can live with sub-optimal breaking at hyphens in chemical names.  Any sort of
breaking at hyphens would make it possible to view and print long chemical names
in Mozilla, something which is sadly not possible at the moment.

Alan Wood
I suspect the most common place where the sequence hyphen,number occurs is in
negative numbers.  I think that's why UAX #14 recommends that a break should not
be allowed between a hyphen and a numeric character.  We don't want to format
the sentence:

  The sum of 7000 and -5000 is 2000.

as:

  The sum of 7000 and -
  5000 is 2000.

Distinguishing the negative number case from others requires more than
pair-based analysis, and the algorithm presented in UAX #14 is based on
pair-based analysis, which is simpler to implement than more complex analysis
for line breaking.
(Reporter)

Comment 84

16 years ago
With reference to comment 83, I had not thought about negative numbers, but
there must be lots of them on the Web, mostly using the hyphen-minus character
from the keyboard instead of the proper Unicode minus sign (U+2212) which is
present in WGL4 (and is therefore present in many TrueType fonts for Windows).

I agree that the case for not breaking after the 'minus' of a negative number is
stronger than my case for allowing it because of chemical names (comment 78).

Alan Wood
Regarding the issue of rendering for Hyphen and Non-Breaking Hyphen: as it turns 
out, Mozilla realizes that if these characters are not part of the font, it can 
render them using the Hyphen-Minus glyph, and does so.  Of course, it doesn't 
break after either.

IE6 also substitutes, but only for the Non-Breaking Hyphen; Opera, whose Unicode 
handling is currently not so great anyway, doesn't substitute for either.  
However, both of these browsers correctly handle the breaking properties of both 
characters.

I hadn't realized the precise nature of the rendering problem because my default 
web font is Palatino Linotype, which does define the two hyphens characters, so 
my test page looked fine in all three browsers.
xref to bug 56552.

Comment 87

16 years ago
xref to bug 56652 =)
Attachment #108709 - Flags: review?(sli0262)
*** Bug 207549 has been marked as a duplicate of this bug. ***
*** Bug 228243 has been marked as a duplicate of this bug. ***

Comment 90

16 years ago
*** Bug 230100 has been marked as a duplicate of this bug. ***
*** Bug 230716 has been marked as a duplicate of this bug. ***

Comment 92

15 years ago
This bug is the apparent cause of a large block of white-space in
http://www.cnn.com/2004/TECH/ptech/02/05/bus2.feat.dumbest.moments/index.html
Mozilla won't line-break the text:
  April-Fool's-joke-29-days-after-April-Fool's-Day
in the article.
Does this qualify the bug for the TOP100 keyword?

Comment 93

15 years ago
Is the intent here to add opportunistic line-breaking on all hyphens, or is this
merely last-resort line-breaking where problems are encountered, like Comment #92?

If you want to break at every hyphen except those that match pre-programmed
rules, the complexity to get it right is going to reach ridiculous levels. So if
the argument to do this is just to make layout nicer, I think all that's going
to happen is extra code, and different (not nicer) layout. Some cases will be
better, and some worse.

Breaking short compound words, e.g. delayed phase-
ins, is very jarring. We're not a newspaper; column
length doesn't cost trees.

We should, however, be breaking before or after em dashes, as they are always
punctuation, and never part of words, names, mathematical equations, negative
numbers, etc.

Comment 94

15 years ago
*** Bug 234071 has been marked as a duplicate of this bug. ***

Comment 95

15 years ago
*** Bug 239157 has been marked as a duplicate of this bug. ***
*** Bug 242615 has been marked as a duplicate of this bug. ***

Comment 97

15 years ago
About the complexity of the program code you may be right. But I think you
mostly see websites written in english. The use of the hyphen as wordbreak is
often used in german language with sometimes very long words. 
I would prefer a simple breaking rule by matching [A-Za-z]{3}-[A-Za-z]{3} or a
-moz- CSS-rule wich I can add to the body.

Comment 98

15 years ago
*** Bug 248452 has been marked as a duplicate of this bug. ***

Comment 99

15 years ago
*** Bug 249083 has been marked as a duplicate of this bug. ***

Comment 100

15 years ago
(In reply to comment #97)
> About the complexity of the program code you may be right. But I think you
> mostly see websites written in english. The use of the hyphen as wordbreak is
> often used in german language with sometimes very long words. 

An "in English" example is how Firefox does not wrap the rather long
trackback_urls when permalinks is on (on WordPress sites).

French also uses hyphens as wordbreaks, but you are talking then about soft
hyphens wich is not the subject of this page. Refer to Bug 9101 for soft hyphens.

Anyway, compound words containing normal hyphens should break after the hyphen
if needed (at the end of a line).

Comment 101

15 years ago
(In reply to comment #93)
> Is the intent here to add opportunistic line-breaking on all hyphens, or is this
> merely last-resort line-breaking where problems are encountered, like Comment #92?
> 
> If you want to break at every hyphen except those that match pre-programmed
> rules, the complexity to get it right is going to reach ridiculous levels. So if
> the argument to do this is just to make layout nicer, I think all that's going
> to happen is extra code, and different (not nicer) layout. Some cases will be
> better, and some worse.


Look nicer is not the only argument. Typography is an art that existed for
centuries for the ease of reading. We're used to it. Reading a good typographed
taxt is far from reading each words. We read only part of the word and mentally
complete the words depending on the general purpose of the text.
In justified text, large space between words do create white columns that
attract the eye and break reading. You whould say that it's not the case in
left-aligned text. In fact it is also a problem there since the more your text
width is short, the more you will have differences of length between lines. So
that reading become more and more difficult.


> Breaking short compound words, e.g. delayed phase-
> ins, is very jarring. We're not a newspaper; column
> length doesn't cost trees.
> 
> We should, however, be breaking before or after em dashes, as they are always
> punctuation, and never part of words, names, mathematical equations, negative
> numbers, etc.


It is better to use nowrap SPANs according to your own country rules of
typography, than not allowing others to do anything against this problem. Opera
and IE do respect these conventions, so why not Moz ? Column lenght does not
cost trees but how many sites use a 2 or more columns layout ? Nearly all. So
that short columns are common. And even if a one column layout is used, imagine
reading the text on a small PDA screen if compound words are not broken: the
problem remains still.
*** Bug 252327 has been marked as a duplicate of this bug. ***
*** Bug 257673 has been marked as a duplicate of this bug. ***
*** Bug 173534 has been marked as a duplicate of this bug. ***

Comment 105

15 years ago
*** Bug 263803 has been marked as a duplicate of this bug. ***

Comment 106

15 years ago
*** Bug 174302 has been marked as a duplicate of this bug. ***
*** Bug 271878 has been marked as a duplicate of this bug. ***

Comment 108

15 years ago
Please don't forget that this fix can change the height of a block-element. If
you  use overflow: hidden ore read the height width the javascript-function
scrollHeight the size won't be correct after rendering the whole page (same
effect by using images).

Comment 109

14 years ago
re : comment #74, I totally agree. bug 255990 deals with it although I haven't
made any patch in that direction. The only patch uploaded there is just a
kludge/stop-gap measure.

Depends on: 255990

Comment 110

14 years ago
(In reply to R.K.Aa., comment #61)
> *** Bug 193360 has been marked as a duplicate of this bug. ***

If this bug is indeed related to textareas, please change the summary to make it
easier to find.

Thanks,

Prog.
The iFrame on the right side of the page loads incorrectly, the text inside of
the  iFrame should wrap to the iFrame's width.

Comment 112

14 years ago
Consideration should be given to treatment of strings w/o hyphens, soft or
otherwise -- e.g. URLs, which may be quite long. Ref page:
http://freepages.genealogy.rootsweb.com/~dav4is/Sources/BRAINERD.shtml

1. URLs are certainly necessary content to be able to handle, e.g. to make
printed copy more useful (<A> links don't print underlying URL.)

2. Author must not be required to do UA job of rendering, e.g. by inserting breaks.

3. By way of example, Opera and MSIE both will break URLs at certain special
characters, e.g. dash, slash, question mark.

-R.
dav4is@yahoo.com

Comment 113

14 years ago
re : comment #112
That's what's being dealt with in bug 255990. 

Comment 114

14 years ago
*** Bug 290045 has been marked as a duplicate of this bug. ***
*** Bug 291405 has been marked as a duplicate of this bug. ***
I would be nice if words would break on - (&#45;) rather sooner than later.
Both Opera and IE do it and that is what everyone expects.

Because of the ever increasing popularity of FF I think this should be fixed in 1.1

->?1.1
Flags: blocking-aviary1.1?

Updated

14 years ago
Flags: blocking1.8b3?
*** Bug 294059 has been marked as a duplicate of this bug. ***
*** Bug 289462 has been marked as a duplicate of this bug. ***

Updated

14 years ago
Blocks: majorbugs

Updated

14 years ago
No longer blocks: majorbugs

Updated

14 years ago
Flags: blocking1.8b3? → blocking1.8b3-

Updated

14 years ago
Flags: blocking-aviary1.1? → blocking-aviary1.1-

Updated

14 years ago
Assignee: saito → nobody
QA Contact: amar → layout.fonts-and-text

Comment 119

13 years ago
Proposing keyword "helpwanted".

Comment 120

13 years ago
RE: Comment #93
"If you want to break at every hyphen except those that match pre-programmed 
rules, the complexity to get it right is going to reach ridiculous levels."

That's probably quite true. However, instead of simply allowing line breaks after any hyphen, at least some kind of basic limitations would be desirable. Here's one interesting proposal (quoting an ISO/IEC expert contribution at http://anubis.dkuug.dk/JTC1/SC2/WG3/docs/n506.pdf):

"HYPHEN-MINUS (002D): HYPHEN-MINUS allows an automatic line break to be established just after it only if it is both immediately preceded by a letter and immediately followed by a letter. HYPHEN-MINUS should be imaged by a graphic symbol identical with that representing HYPHEN when immediately preceded or immediately followed by a letter. HYPHEN-MINUS should be imaged by a graphic symbol identical with that representing MINUS otherwise."

So according to this approach, HYPHEN-MINUS (i.e., the regular ASCII hyphen character) would allow a line break only within a (compound) word. Line breaks would not be allowed in connection with numerals nor other punctuation or special characters, so e.g. most smileys would remain intact. Also elliptical hyphens, which occur regularly in the beginnings of words in certain contexts in some languages, would stick with the string of letters they are attached to (for example, in Finnish you could write "videokasetti ja -levy", meaning video cassette and video disk, where the hyphen in the beginning of the last word makes it unnecessary to repeat the word "video").

For decent typography, it might also be good to set an additional rule that line break is not allowed unless there is at least two or three letters on both sides of the hyphen. Thus, words such as "T-shirt" would not be broken.

On the other hand, the idea that HYPHEN-MINUS should be imaged identically with MINUS if not connected to a letter, seems a little problematic with regard to, e.g., smileys or data formats such as 2006-06-09. It's probably better to dismiss that part of the proposal and encourage Web authors to use the HTML character entity reference or Unicode MINUS character instead, when necessary.

As HYPHEN-MINUS is the default hyphen character available on keyboards today and most people probably don't know (or even care of) how to produce other kinds of hyphens, it is important that it is treated in a way that is supposed to be adequate on most situations.

The quoted proposal also speaks of the Unicode characters HYPHEN and NON-BREAKING HYPHEN (as well as two kinds of SOFT HYPHENS, but that's a separate issue):

"HYPHEN (2010): HYPHEN allows an automatic line break to be established just after it. HYPHEN is imaged by a graphic symbol."

"NON-BREAKING HYPHEN (2011): NON-BREAKING HYPHEN is a graphic character, the visual representation of which is identical to that of HYPHEN. NON-BREAKING HYPHEN is for use as hyphen when an automatic line break just before or just after it is to be prevented in the text as presented."

Thus, if necessary, Web authors might use HYPHEN to get around the special restrictions appended to HYPHEN-MINUS. Respectively, NON-BREAKING HYPHEN (as well as CSS rule "white-space: nowrap") could be used to restrict the hyphenation further. 

Hyphenation is a tricky question, and this approach certainly wouldn't resolve all or even the principal problems. However, (in combination with support for soft hyphens) it might be a reasonable, not too complicated compromise at least until more sophisticated, language specific hyphenation solutions may emerge.

Comment 121

13 years ago
RE: Comment #120
"For decent typography, it might also be good to set an additional rule that
line break is not allowed unless there is at least two or three letters on both
sides of the hyphen. Thus, words such as 'T-shirt' would not be broken."

On reflection, one could simply treat numerals the same way, as this "additional" rule in itself also guaranteed that HYPHEN-MINUS marking a negative number would not be separated from the numeral string. Basically, line break within a compound word will be allowed only if there is a minimum amount of characters both before and after HYPHEN-MINUS, and no additional rules will be necessary (at this point). This would allow line break after HYPHEN-MINUS within long numeral strings as well as letter strings. However, this approach may not suffice to resolve the problem with long chemical names, such as "2-bromo-4,4-dichlorophenol" (comment #78).

The same (or at least a similar) rule should apply to line breaks after slashes. It is not desirable to allow line break within, e.g., an abbreviaton such as "c/o", but especially within long URLs it would indeed often result in better typography.

Obviously, punctuation characters should not be counted in the character strings preceding and following a hyphen or slash. For example, in string 

"...(Latin-1)." 

there is in effect only one character after the hyphen, and the ending parenthesis, full stop and quotation mark shouldn't make line break possible in that kind of situation.

Comment 122

12 years ago
A workaround would be to add &#x200B; (invisible space) after each "-" that should break. GreaseMonkey or a bookmarklet can do this.

Pasted &#x200B; work in firefox but may turn into "​" in non unicode views.

Comment 123

12 years ago
I present to y'all, the Great Wrapinator!

javascript:var bob = document.body.innerHTML; bob = bob.replace(/([^<> ]{80})[^&]/g, "$1&#x200B;"); document.body.innerHTML = bob;

Comment 124

12 years ago
Sorry, i should've tested on clearer data.

javascript:var bob = document.body.innerHTML; bob = bob.replace(/([^<> ]{80})([^&])/g, "$1&#x200B;$2"); document.body.innerHTML = bob;

Comment 125

12 years ago
The Css Style Class "WORD-BREAK:BREAK-ALL" Is working with IE and Not working with Firework. Anything can be done to Fix it??..

Comment 126

12 years ago
(In reply to comment #125)
> The Css Style Class "WORD-BREAK:BREAK-ALL" Is working with IE and Not working
> with Firework. Anything can be done to Fix it??..

That's bug 249159.
Duplicate of this bug: 369437
Alias: uax14
Summary: line-break should be allowed after hyphens (unless followed by number) → UAX14: line-break should be allowed after hyphens (unless followed by number)
Duplicate of this bug: 379826

Comment 129

12 years ago
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Any news?

Comment 130

12 years ago
See my point? That is what I need a solution for!

Comment 131

12 years ago
You can fix this using javascript, ASP, PHP or other things like that.
I don't belive this should be a concern of the browser.

Anyway, portuguese have a lot of words with - that in most cases shouldn't be break, like quinta-feira (thursday).

Comment 132

12 years ago
Iuri:

1. Correct word-wrapping IS a concern of the browser, because it depends highly on how the browser is configured (windows size, font size etc.) -- and can change dynamically (if you resize the window, for instance). It does not make sense for the webserver to have to anticipate this and serve pre-wrapped pages.

2. Yes, Portuguese has a lot of compound, hyphenated words. And you know what? It IS acceptable to break the line after the hyphen. It is even PREFERRED, as a matter of style -- so you don't end up with an extra break inside the same word, which looks ugly. I mean, what do you like better:

Vou ao cinema na próxima quinta-fei-
ra à noite.

or

Vou ao cinema na próxima quinta-
feira à noite.
I fixed bug 255990, now URLs are breaking by some characters.
See the spec table:
http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/tools/spec_table.html

So, the issues for Portuguese of comment 132 should be fixed now.

I tested some specs in bug 255990. By the experience, I don't believe that UAX#14 is a best solution for us. Because we need to handle non-natural languages. e.g., many date formats, many time formats, fragments of code of programing languages, ASCII arts, URLs and file paths(UNIX, Win32/DOS)... So, I believe that there is no best spec which can fit to all context. Therefore, I used WinIE7 based (customized) spec for us. This makes better compatibility with WinIE (and also the web pages which is designed for WinIE). Especially when the table has many text but the table width is too narrow, the line breaking compatibility is very important (e.g., tinderbox and checked-in list of bonsai).

I can agree to use UAX#14 for characters of each languages, but I think that we should keep the compatibility with WinIE in ASCII range for layout of table cells.
We can mark this fixed now, right? Because we do break after a hyphen thats not part of a number.

Comment 135

12 years ago
Does Mozilla break at U+2010 Hyphen now or just U+002D Hypen‐Minus?
ok, I mark this to FIXED, we don't use UAX#14, but we fix the actual bug.

-> FIXED

(In reply to comment #135)
> Does Mozilla break at U+2010 Hyphen now or just U+002D Hypen‐Minus?

Now, U+2010 is not breaking the line. But we can fix it easy. Please file a new bug and CC me.
Status: NEW → RESOLVED
Last Resolved: 12 years ago
Resolution: --- → FIXED

Comment 137

12 years ago
(In reply to comment #133)
> I fixed bug 255990, now URLs are breaking by some characters.
> See the spec table:
> http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/tools/spec_table.html

Thanks for the informative table. However, when defining the context (surroundings) of the possible line-breaks, the term "character"  feels rather ambiguous. Basically, all letters, numbers and punctuation marks are characters. Perhaps instead of "characters", here it would be better to speak of "letters"?

How about SPACE? I hope SPACE does not count as a letter-character because if it did, some undesirable line-breaks could occur:

after a hyphen: "suffix -ed"
after a slash: "in /home directory"
after a degree sign: "the temperature is 20 °C" 

Even if SPACE does not count as a letter, some problems seem to remain:

after a hyphen: "T-shirt"
after a slash: "c/o"
before and after parentheses: "colo(u)ring"

In "T-shirt", a line-break after the hyphen would leave "T" orphaned in the end of a line, which woud look ugly -- although it would be unlikely to cause any real confusion. In "c/o", a line-break after the slash would be a little more distracting. 

The worst case, however, would be if the string "colo(u)ring" could break both before and after the parentheses. Even if it is a rather exeptional example, I think this kind of behavior would be a much more severe bug than not breaking after hyphens. I wonder whether it is at all reasonable to allow line-break in connection with parentheses, square-brackets etc.
We should treat punctuation next to a space as non-punctuation so we don't break before or after it. That would be fairly easy.

The other problems are kind of hard. It's impossible to please everyone. Authors may have to learn to use ZWJ or white-space:nowrap in rare cases.
(In reply to comment #137)
> after a hyphen: "suffix -ed"

This case can be fixed easy.

> after a slash: "in /home directory"

You should test it. This case will be rendered same as your hope.

Other cases should be marked as INVA or WONTFIX.

Please file bugs for each issues if you find them.

Comment 140

12 years ago
Authors should certainly *not* be using ZWJ to suppress breaking within a word. ZWJ has its own semantics, it's not there to suppress breaking between punctuation. Also, doing weird things to text content to avoid breaking has undesired consequences for other uses of text, like copy/paste and text comparison.

I strongly agree that punctuation adjacent to spaces should not be providing any breaking opportunities. That will solve the worst problems with the recent changes. The next-worst ones can only be solved by prioritizing break points, which I highly recommend we do if we want to keep these new punctuation-derived break opportunities. Granted it's too late for 1.9... I'm quite disappointed that these new breaks were not more carefully considered. As Simo says in bug 56652, copying IE may be simple, but it's not necessarily good. If the critical market for these new breaks is East Asia, couldn't we avoid such breaks unless the block has encountered a CJK character before the first break? At least until we have a more intelligent line-breaking algorithm in place?

Comment 141

12 years ago
I appreciate the efforts to solve the layout problems caused by URLs and long
words. However, I disagree with the general idea that compatibility with IE or
handling of non-natural languages should be more important than respecting the
writing conventions of natural languages. Basically, technology is a tool, and
tools should adapt to people's needs, not vice versa.

I just realized that this bug seems to be a duplicate of bug 56652, which
considers linebreaking algorithms from a little more general perspective.
Perhaps the discussion should be moved there?
> If the critical market for these new breaks is East Asia, couldn't we avoid such 
> breaks unless the block has encountered a CJK character before the first break?

That's similar what we were doing before, and it's decidedly odd. Inserting a CJK character arbitrarily far away in a word shouldn't change breaking behaviour.

I'm sorry this comes as such a surprise. This was worked on over a long period of time. I'd sort of assumed you were CCed on the bug(s).
Backing out bug 255990 (again) is an option, but we need to be sure we're not making the best the enemy of the good. I'm quite pleased with the changes in my browsing. I don't want to back out a patch that's an overall improvement just because a hypothetical better approach may exist which no-one has managed to specify precisely yet. I'd rather see an argument that 255990 actually made things worse overall.
Especially because I suspect that there is no algorithm --- certainly no reasonably simple algorithm --- that can interpret the intent of Web text accurately enough to always choose good breaks. So whatever our algorithm is, people will be able to produce examples where it falls down badly.
(In reply to comment #140)
> I strongly agree that punctuation adjacent to spaces should not be providing
> any breaking opportunities.

This is a interesting idea, would you file a new bug?

> The next-worst ones can only be solved by prioritizing break points,
> which I highly recommend we do if we want to keep these new punctuation-derived
> break opportunities. Granted it's too late for 1.9...

> I'm quite disappointed that these new breaks were not more carefully 
> considered. As Simo says in bug 56652, copying IE may be simple, but it's
> not necessarily good.

I think that we need long testing time for this. Because there are very very many patterns. I think that we should fix each problems on current trunk.

> If the critical market for these new breaks is East Asia, couldn't we
> avoid such breaks unless the block has encountered a CJK character before
> the first break?

No, the URL doesn't have CJK characters in most cases.

Comment 146

12 years ago
Well, how about the approach suggested by Jukka Korpela in his criticism on UAX #14 (I already posted this link for bug 56652, but here it is again):
http://www.cs.tut.fi/~jkorpela/unicode/linebr.html

The basic idea is that the generic (language-independent) linebreaking rules should be very minimalistic. A break would be allowed only after spaces, hyphens and dashes (and even then not always), and any further exceptions should be language-dependent. However, for extremely long strings, such as URLs, a special "emergency break" rule could be applied, allowing breaks even after slashes and ampersands.

Comment 147

12 years ago
No comments? Well, let's elaborate the suggested approach a bit:

At the language-independent level, a line-break is allowed after a space, hyphen or dash, unless
- the space or hyphen is of the no-break type (this should be obvious)
- the space is followed by another space (so no break can occur between two space characters)
- there is a space or any punctuation character either immediately before or immediately after the hyphen or dash (break is allowed after the space, of course, unless it is a no-break space; note that this rule should be sufficient to prevent breaks even inside smileys)
- there is only one alphabetical character on either side of the hyphen or dash (this is merely a wishlist feature that would improve the typographical appearence a little; one might even consider strightening the rule into "there is only _two_or_less_ alphabetical characters on either side of the hyphen or dash")
- there is a numerical character on either side of the hyphen or dash (but it would be nice if the previous rules, probably after some tweaking, were sufficient to cover even the numerical contexts; for example, it might be better if long chemical names, such as "2-bromo-4,4-dichlorophenol" as described in comment 78, were allowed to break after a hyphen, even if the break point wasn't always optimal).

Otherwise, line-breaks may occur only if allowed at the language-specific level, or by the "emergency break" rules for exceptionally long strings. 

These minimal linebreaking rules should cover the most important cases at least for Latin scripts (although I'm sure I have overlooked something, please feel free to append the list).

The language-specific additional rules may be specified as needed, for example:
- in English, a line-break is allowed both before and after an em-dash, and irrespective of how many alphabetical or numerical characters there are on either side
- in French, a line-break is not allowed after a space if it is followed by an exclamation mark, a question mark, a colon, a semicolon or a closing guillemet, nor if it is preceded by an opening guillemet.

Of course one can come by many more language-specific rules, but they can be added little by little, as native speakers start to point out the deficiencies. And perhaps one day, the rules may be appended to include even language-specific hyphenation algorithms. But for now, I suppose that's something we can only dream of.
One problem is that we often don't know what the language is.

Defining some break opportunities as "emergency breaks" that are only used if there are no regular break opportunities on the line would require some significant changes, but could definitely be done.

Comment 149

12 years ago
The generic rules should allow even non-Latin characters to behave in the way that was most likely expected of them in their natural context. Thus, a line-break would be allowed after any CJK character. 

However, if put into a Latin context, a non-Latin character should rather be treated as a symbol character inherent in Latin scripts (and thus, linebreaking would not be allowed); this can be specified at the language-dependent level. I suppose the same principle for embedding exotic characters should be valid even for other alphabetic scripts (e.g., Cyrillic and Greek), while a reversed approach may sometimes be suitable for alphabetic characters put into a CJK context.

For better typography, a couple of clarifications might be added to the generic linebreaking rules:

- symbol characters (such as @, $ and %) should generally be treated similarly to alphabetic characters

- two adjacent hyphens could be considered equivalent to a single em-dash.
(Reporter)

Comment 150

12 years ago
I have updated the URL, because the original one no longer exists.  The page is exactly the same.
Really?
I can access to the original one.
Oops.
The original one is redirected to the new one.

Comment 153

12 years ago
(In reply to comment #147)
> The language-specific additional rules may be specified as needed, for example:
> - in French, a line-break is not allowed after a space if it is followed by an
> exclamation mark, a question mark, a colon, a semicolon or a closing guillemet,
> nor if it is preceded by an opening guillemet.
> 
I this case there should be nonbreakable thin space (U+202F) instead of standard space (U+0020). Thin space because it looks better and nonbreakable because it suppress line break on white space character. I.e. you don't need any special rule.

If web page authors wrote typographically clear text, web browser wouldn't need crystal ball.

> Of course one can come by many more language-specific rules, but they can be
> added little by little, as native speakers start to point out the deficiencies.
>
For example in Czech, hyphen (U+002D) can be used between two words which are tightly connected (e.g. black-white). If you want to break this compound word on the hyphen, you will need to repeat the hyphen on the next line:

… the black-
-white pattern …

So, this is example how your proposed rules break typography in non-English language.

If I could, I would allow word wrapping only before breakable white characters (and on extraordinary long strings). The other rules (like breaking at hyphen/dash) are language specific and therefore should be implemented independently as a language add-on.
(Reporter)

Comment 154

11 years ago
Is there anywhere that the "FIXED" version can be downloaded and tested?

The bug is definitely not fixed in Firefox 3.0b4 under Windows Vista Business, when viewing the example page:
http://www.alanwood.net/pesticides/abamectin.html
We're breaking after some hyphens on that page, but not others where we should. I'm not sure why, we should be able to break after the hyphen in "arabino-hexopyranoside" for example. Need a reduced testcase.
And it should probably go in a new bug.
(Reporter)

Comment 157

11 years ago
(In reply to comment #156)

The originally-reported problem has not been fixed, so there is no justification for opening a new bug.

There are 162 hyphens in the IUPAC cell of the example page, but the "FIXED" version only seems able to break after 5 of them.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I only just realized that none of the patches here ever got checked in, so reusing this bug does make sense, sorry about that.
Status: REOPENED → RESOLVED
Last Resolved: 12 years ago11 years ago
Resolution: --- → FIXED
I'm afraid that I don't have the cycles to work on this for Gecko 1.9. A reduced testcase would help us get it fixed in the next release.
Flags: wanted-next+
oops
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Reporter)

Comment 161

11 years ago
(In reply to comment #157)

> There are 162 hyphens in the IUPAC cell of the example page, but the "FIXED"
> version only seems able to break after 5 of them.

Please accept my apologies for the incorrect information in #157.

Robert O'Callahan is absolutely correct, we do need a simpler test case.  Try this file:
http://www.alanwood.net/demos/bug-95067-systematic.html
and the long systematic names DO wrap.  My thanks to everyone who has worked on this bug.

However, the data sheets in my pesticide website still don't display properly in Firefox 3.04b.  This is now because of problems with wrapping InChIs, which did not exist when I first filed the bug.

Here is a test file for InChIs:
http://www.alanwood.net/demos/bug-95067-inchi.html

Lines are not being broken after hyphens (ASCII decimal 45 or hexa 0x2D) that separate 2 numbers:
52-37-25-16-26-38-52,47-59(9,10)53-39-27-17-28-40-53
resulting in some very long lines.

Interner Explorer, Safari and Opera do break after these hyphens.  Would allowing breaks after these hyphens in Firefox cause problems for any other data?

If not, would it be simple to amend the new wrapping code?

Firefox is also not breaking after hyphens like these:
33+,34-,35-,36-,37-,

Internet Explorer and Safari do break after these hyphens, but although this allows wrapping, it puts a comma as the first character in a line, which does not look good.  Would it cause any problems to allow breaks after these commas?

Comment 162

11 years ago
> Firefox is also not breaking after hyphens like these:
> 33+,34-,35-,36-,37-,

I have no idea what that string might be about, but I recognize that this can be a genuine issue for chemists. Nevertheless, I don't like the idea of allowing breaks after commas. If I see a break after a comma, I generally assume that there is a whitespace after the comma -- but I suppose it would be a (big?) mistake in this case. Seeing a line beginning with a comma could be distracting, but at least it would give me a hint that there is something exceptional going on.

Unfortunately, allowing breaks between a hyphen and a comma may cause other problems. For example, in Finnish it is possible for a (compound) word to end with an elliptical hyphen (indicating that the last part of the compound has been omitted), and sometimes the hyphen may be followed by a comma. Now, as the comma would normally be followed by a whitespace, I grant that usually the odd comma is likely to fit in the same line as the preceding word with hyphen. But occasionally there would not be enough space and the comma would have to be moved to the next line. I think this would be rather unfortunate in a natural language context, where the reader expects the text to flow according to the general orthographic conventions.

There may be other problematic cases too. In many languages, comma is used as the decimal marker (instead of the decimal point, as in English). If, in addition, a leading zero is replaced with a hyphen, you may see strings such as "-,50" (e.g., in a price tag). Here it would clearly be undesirable to break after the hyphen (and even more so after the comma).

Perhaps some kind of an emergency break rule could be composed, though. The rule could allow exceptional breaks between a hyphen and a comma, but only in very long strings and if there was no better break opportunity within 10 (or even more?) characters.

Comment 163

11 years ago
(In reply to comment #162)
> I have no idea what that string might be about, but I recognize that this can
> be a genuine issue for chemists.

I don’t really think that this is a “genuine issue”, as you put it. If chemists or others really desire automatic line‐breaking behavior, they should be using the more specific character, U+2010 Hyphen, which should always allow line‐breaks afterward, instead of the more generic U+002D Hyphen‐Minus character, after which line‐breaking behavior is ambiguous. This should give authors what they want 100% of the time, provided that the browser supports the behavior.

The described behavior for U+2010 Hyphen was supposed to have been addressed in Bug 388096 (which I created to address Comment #136); unfortunately, I don’t have a Moz. Firefox 3 beta build installed with which to verify that it was indeed fixed. Opera 9.26, Safari 3.1 (525.7) (beta), and Win. Internet Explorer 7.0.5730.13 all seem to support automatic line‐breaks after U+2010 Hyphen, or, at least, they pass the test case that I wrote for that bug, so cross‐browser compatibility shouldn’t be an issue here.
(Reporter)

Comment 164

11 years ago
(In reply to comment #163)
> If chemists or others really desire automatic line‐breaking behavior,
> they should be using the more specific character, U+2010 Hyphen,
> which should always allow line‐breaks afterward, instead of the more
> generic U+002D Hyphen‐Minus character, after which line‐breaking
> behavior is ambiguous.

Sorry, but this is not a solution for InChIs.  They are specified by IUPAC as containing only ASCII characters, and so the horizontal line has to be the hyphen-minus.

Comment 165

11 years ago
(In reply to comment #164)
> (In reply to comment #163)
> > If chemists or others really desire automatic line‐breaking behavior,
> > they should be using the more specific character, U+2010 Hyphen,
> 
> Sorry, but this is not a solution for InChIs.  They are specified by IUPAC as
> containing only ASCII characters, and so the horizontal line has to be the
> hyphen-minus.
> 
And we are back to soft hypen which Firefox does not still support.

Comment 166

11 years ago
(In reply to comment #165)
> And we are back to soft hypen which Firefox does not still support.
>
Reverting. Bug #9101 seems implementing soft hyphen.

Accually, you have to choices now:

(1) Use proper Unicode characters which signals where the word break is acceptible (e.g. by adding soft hyphen) or

(2) Use ASCII only and be disappointed with suboptimal automatic emeregency break algorithm.

IMHO, (2) will never be good enough unless you provide some mark-up that the string is IUPAC compliant chemical name and firefox implements some IUPAC-specific algorithms. This doesn't apply only to chemical names, it's applyable to natural languages too.

You can start introducing new langague code <span xml:lang="iupac">1-methyl-propan</span> and then we can talk about (3) addind langauge specific hyphenation algorithms. (I believe this is the right way besides (1).)

Comment 167

11 years ago
> (In reply to comment #157)
> Interner Explorer, Safari and Opera do break after these hyphens.  Would
> allowing breaks after these hyphens in Firefox cause problems for any other
> data?

Allowing line‐breaks after a Hyphen‐Minus character separating two double‐digit numbers could result in line‐breaks within ranges/comparisons (e.g., 20-30), dates (e.g., 03-17-08), and numeric IDs (e.g., Figure 14-20). If you allow them between numbers of arbitrary size, then you may also get breaks within things like phone/fax numbers (e.g., 123-456-7890), zip codes (e.g., 12345-6789), S.S. numbers (e.g., 123-45-0345), serial numbers, etc. If allowing after hyphens immediately followed by a number, you might get breaks between terms such as Final Fantasy X-2, Carbon-14, ISO-8859-1, etc.

Of course, most of the above could be addressed with use of more specific characters too…

(In reply to comment #164)
> Sorry, but this is not a solution for InChIs.  They are specified by IUPAC as
> containing only ASCII characters, and so the horizontal line has to be the
> hyphen-minus.

Perhaps this issue should be addressed by IUPAC instead? Could you get around this issue by use of the shorter InChIKey format? Maybe implementation of the CSS3 Text text-wrap: unrestricted or word-wrap: break-all declarations would address the issue?

(In reply to comment #166)
> (1) Use proper Unicode characters which signals where the word break is
> acceptible (e.g. by adding soft hyphen) or

I don’t think that use of U+00AD Soft Hyphen is what Alan is looking for. He wants line‐breaks after hyphens that are intended to always be invisible. Soft hyphens are visible only when they occur just before a line‐break.

Comment 168

11 years ago
Firefox 3 seems to break lines on hard hyphens, while Firefox 2 does not.

Firefox 2 has the correct behavior.  The HTML 4.01 specification at http://www.w3.org/TR/html401/struct/text.html#h-9.3.3 states:

"In HTML, there are two types of hyphens: the plain hyphen and the soft hyphen. The plain hyphen should be interpreted by a user agent as just another character. The soft hyphen tells the user agent where a line break can occur."

Firefox should not be break lines on hard hyphens -- is it possible to fix this bug in Firefox 3?
HTML 4 does not specify how line breaking should be performed. We're quite within our rights to break lines after hard hyphens.

Comment 170

11 years ago
I'm sorry, I thought that "The plain hyphen should be interpreted by a user agent as just another character" was crystal-clear.

Comment 171

11 years ago
Yes,  it must be interpreted as just another character -- meaning that is should always be displayed. The soft hyphen is placed inside a word to mark a place where a line break CAN occur. If the break DOES occur, then it is displayed. Otherwise, it is hidden. So, the soft hyphen is NOT a normal character, because it sometimes is displayed, sometimes isn't.

Also, the plain hyphen (or hyphen-minus) is no different from other common characters in regard to line breaks: the user-agent may decide whether to break a line inside a word, using whatever algorithm is appropriate. It just happens that an algorithm to break words on the hyphen is a rather simple one, and works in most Western languages -- while more sophisticated algorithms have to be language-specific, and even then they will make weird mistakes. The HTML spec does not forbid breaking lines inside words, so it does not forbid breaking lines after hyphens.

Comment 172

11 years ago
You said that "the user-agent may decide whether to break a line inside a word, using whatever algorithm is appropriate."

I am not sure that this is the case.  Section 9.1 of the HTML 4.01 specification refers to "inter-word space," which allows a user-agent to place line breaks between words, but makes no such reference to "intra-word space" at all.

1. Is there a specification which indicates that intra-word space is permitted?  I must be reading the wrong document.

2. If what you say is true, then how can a site designer tell a remote user-agent to not split on a hyphen under any circumstances?  Using &nbsp;, we can address this problem for "inter-word space."  Is there a corresponding solution for "intra-word space"?  (I think that the answer is no because intra-word space is not to be inserted arbitrarily.)  I suspect that the PRE environment is not really the right answer, either, since it carries other semantic baggage as well.
Many languages, including Thai and Chinese, do not use spaces to indicate word breaks. They have paragraphs of text containing no spaces at all.

You can use white-space:nowrap to inhibit line breaking. It can be applied to inline elements.

I'm not sure what this bug is about anymore. We should probably close it and have people file new bugs about issues which are clearly actual bugs.

Comment 174

11 years ago
I think the reason the bug remains open is that the rules from UAX14 aren’t implemented yet? What is implemented now is based off what IE does (as far as I understand), but I don’t think it is complete/applies to all languages...

Anyway, a clear, fresh bug about that is probably better than this one with its 173 comments. The basic behaviour that the bug reporter describes has been implemented.

Comment 175

11 years ago
(In reply to comment #172)
 
> I am not sure that this is the case.  Section 9.1 of the HTML 4.01
> specification refers to "inter-word space," which allows a user-agent to place
> line breaks between words, but makes no such reference to "intra-word space" at
> all.
> 
> 1. Is there a specification which indicates that intra-word space is permitted?
>  I must be reading the wrong document.

Yes there is. Right on the same page you pointed, in http://www.w3.org/TR/html401/struct/text.html#h-9.1 , you can see:

"When formatting text, user agents should identify these words and lay them out according to the conventions of the particular written language (script) and target medium."

So: as long as the conventions of a language allow it, and the HTML spec does not forbid it, then the user-agent may do it. In fact, if the programmers bother to do it, the user-agent may even break words that DON'T have hyphens -- as long as they follow the rules of the language (meaning, identifying syllables correctly and such).  But it's hard doing it right, in particular for the Web, where you can't trust the language to be identified correctly, so AFAIK nobody went to the trouble.

Breaking after hyphens, however, it's easy in comparison, and for most people is an acceptable compromise -- hyphenated words tend to be long, so allowing them to be broken lessens the problem of ugly lines with too much white space between words.

Comment 176

11 years ago
(In reply to comment #172)
> 1. Is there a specification which indicates that intra-word space is permitted?
>  I must be reading the wrong document.

You might want to note that the title of this bug references Unicode Standard Annex #14 (UAX #14). You can find a link to it in comment #64 or, more specifically, you can find the information that you’re looking for at <http://www.unicode.org/unicode/reports/tr14/#DescriptionOfProperties>; search for the string “HY: Hyphen”. It specifically says that, in situations where the U+002D HYPHEN-MINUS character is used as a hyphen, there’s a line break opportunity after the character.

> 2. If what you say is true, then how can a site designer tell a remote
> user-agent to not split on a hyphen under any circumstances?

You should be able to use either a U+2011 NON-BREAKING HYPHEN character in place of the HYPHEN-MINUS character or the U+2060 WORD JOINER character immediately subsequent to the HYPHEN-MINUS character to get this behavior. (Firefox 3 doesn’t seem to support for the latter though; I don’t see a bug report for it either.)
(Reporter)

Comment 177

11 years ago
(In reply to comment #164)
> Sorry, but this is not a solution for InChIs.  They are specified by IUPAC as
> containing only ASCII characters, and so the horizontal line has to be the
> hyphen-minus.

Firefox 3.1a2 has introduced support for the CSS3 property word-wrap: break-word.  This now makes it possible to break InChIs nicely in Firefox, with an appropriate style applied to them.  See my updated test file:

http://www.alanwood.net/demos/bug-95067-inchi.html

As far as I am concerned, this bug can now be closed.

My thanks to the Firefox developers.
OK, thanks Alan!
Status: REOPENED → RESOLVED
Last Resolved: 11 years ago11 years ago
Resolution: --- → WORKSFORME
Duplicate of this bug: 809020

Updated

2 years ago
Whiteboard: [webcompat]
You need to log in before you can comment on or make changes to this bug.