Closed Bug 389056 Opened 14 years ago Closed 14 years ago

Don't break line between periods and quote

Categories

(Core :: Layout: Text and Fonts, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: jruderman, Assigned: masayuki)

References

()

Details

(Keywords: regression, testcase)

Attachments

(4 files, 21 obsolete files)

4.36 KB, text/html
Details
75.84 KB, patch
roc
: review+
Details | Diff | Splinter Review
10.43 KB, patch
Details | Diff | Splinter Review
42.64 KB, patch
roc
: review+
Details | Diff | Splinter Review
Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.9a7pre) Gecko/2007072004 Minefield/3.0a7pre

  <div style="width: 1px">"foo..."</div>

Renders as

  "foo...
  "

This looks silly.  Apparently it only happens if there are two or more periods.
This regressed some time in July.  Guessing it's due to the change in bug 255990.
Blocks: 255990
Flags: blocking1.9?
Keywords: regression
confirmed on win XP.
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a7pre) Gecko/2007072005 Minefield/3.0a7pre

(btw   <div style="width:1px"> foo=bar </div> renders as
 foo=
 bar
)
(In reply to comment #2)
>  foo=
>  bar

This is intended, of course.
Sorry for the spam.

j.j.

This can be fixed easy, I think.

# The equal issue is designed. We should break in the point for the URL params.
# And please file new bugs for other issues!

I think that always we should not break around |"|. We defined it as "starting parenthesis" (e.g., like as |(|). But it is often used as "finished parenthesis". It should be character class, I think. (|'| is character class.)

I'll attach a patch.
Assignee: nobody → masayuki
OS: Mac OS X → All
Hardware: PC → All
Version: unspecified → Trunk
(In reply to comment #4)
> I think that always we should not break around |"|. We defined it as "starting
> parenthesis" (e.g., like as |(|). But it is often used as "finished
> parenthesis". It should be character class, I think. (|'| is character class.)

Oops, this comment is wrong...

U+0022(") is character
U+0027(') is character
Attached patch Patch rv1.0 (obsolete) — Splinter Review
I think that this patch can fix this and similar issues.

1. If current and next character is U_PERIOD, current character class should be 8 (character).
2. ContextualAnalysis should be able to know the previous character class. Because the previous character may be "contextual analysis"ed.
3. "," should be character class as default. e.g., we have same problem in |i.e.,"|
Attachment #273311 - Flags: superreview?(roc)
Attachment #273311 - Flags: review?(roc)
(In reply to comment #6)
> 1. If current and next character is U_PERIOD, current character class should be
> 8 (character).

Or, should we define the default class is 8? (U_PERIOD default class is 1, it means "end parenthesis".)
Perhaps we could fix this (and some other issues) in a simple way by requiring that there be at least one white-space or non-punctuation character between any two breaks (including treating the end of the string as a break)? This means that we would never break inside any string of punctuation at the end of a word.

However currently we don't distinguish punctuation from normal characters in the character classes. How hard would that be?
I think that we have two ways:

1. we should have punctuation tables.
2. we should add 'a' class for punctuations.

I think that 2 is better. It makes simple code.
The punctuations are several classes. There are connector, dash, open, close, initial quote, final quote and others, so, we may not be able to fix this by "2" easy. "1" may be better...
We don't need to distinguish all those, right? We just need one new class for all the punctuation that currently is classified as "character".

BTW can you add comments somewhere explaining what all the (compressed) classes actually mean?
This is a draft for new approach, but I have following issues:

1. The class name translating of JIS X 4051 is so hard for me. I need help of native speakers.
2. Need to change over U+0080 too.
3. Now, we are using class 7 (Alphabet) *only* for U+2126 (OHM SIGN), I think that it should be class 8 (Character), therefore, we can remove class 7. (And we can use class 7 for new class.)
Attachment #273311 - Attachment is obsolete: true
Attachment #273311 - Flags: superreview?(roc)
Attachment #273311 - Flags: review?(roc)
er, the table of breakable (section 5) is wrong, we should break between [c] and (16 or 17 or 18).

              1 [a] 7  8  9 [b]15 16 18 COMPLEX [c]

        1     X  X  X  X  X  X  X  X  X  X       X
      [a]        X                                
        7        X  X                             
        8        X              X                 
        9        X                                
      [b]        X                                
       15        X        X     X     X          X
       16        X                 X  X          X
       18        X              X  X  X          X
  COMPLEX        X                       T        
      [c]        X              X  X  X          X

should be:

              1 [a] 7  8  9 [b]15 16 18 COMPLEX [c]

        1     X  X  X  X  X  X  X  X  X  X       X
      [a]        X                                
        7        X  X                             
        8        X              X                 
        9        X                                
      [b]        X                                
       15        X        X     X     X          X
       16        X                 X  X          X
       18        X              X  X  X          X
  COMPLEX        X                       T        
      [c]        X                               X

But I think that we need special handling for period (for file path) and single/double quote(They can be both open/close parenthesis).
er, I'm being confused...

please ignore comment 13.

the table should be:

              1 [a] 7  8  9 [b]15 16 18 COMPLEX [c]

        1     X  X  X  X  X  X  X  X  X  X       X
      [a]        X                               X
        7        X  X                             
        8        X              X                 
        9        X                                
      [b]        X                                
       15        X        X     X     X          X
       16        X                 X  X          X
       18        X              X  X  X          X
  COMPLEX        X                       T        
      [c]        X              X  X  X          X

We don't need to break after [c] when next is character/numeric... And we should not break before [c] when prev is [a], right?
> 1. The class name translating of JIS X 4051 is so hard for me. I need help of
> native speakers.

What you have there is pretty good except I don't know what some of them are. In particular can you explain a bit more:
+      5: Middle dot
+      8: Prefix (ellipsis) symbol
+      9: Suffix (ellipsis) symbol
+     19: Split line note (Warichu) begin quote
+     20: Split line note (Warichu) end quote
?

Yes, make OHM a character.

 #define NUMERIC_CLASS  6 // JIS x4051 class 15 is now map to simplified class 6
 #define CHARACTER_CLASS  8 // JIS x4051 class 18 is now map to simplified class 8
+#define PUNCTUATION_CLASS 0xA // see bug 389056

Also create #defines for all our compressed classes. I'll help you come up with names.
(In reply to comment #15)
> > 1. The class name translating of JIS X 4051 is so hard for me. I need help of
> > native speakers.
> 
> What you have there is pretty good except I don't know what some of them are.
> In particular can you explain a bit more:
> +      5: Middle dot

e.g., &middot;

> +      8: Prefix (ellipsis) symbol

e.g., $, &yen; and "No."

> +      9: Suffix (ellipsis) symbol

e.g., cent sign, % and degree sign

> +     19: Split line note (Warichu) begin quote
> +     20: Split line note (Warichu) end quote

http://www.w3.org/TR/2003/CR-css3-text-20030514/#text-combine-prop
Warichu is a ruby layout pattern. It will be styled as "text-combine: lines;".
19 is open parenthesis of Warichu, 20 is close it. We don't support these class.
I'm rearranging the definition table (jisx4501class.txt). bug 388096 will be fixed by the next patch.
Blocks: 388096
I misunderstood the class 7 of JIS X 4051. It is not a class for hyphens. It means that the characters are breakable before and after, but they are not breakable between same class characters.
Attachment #273615 - Attachment is obsolete: true
And this patch changes the class of '<' and '>' to new punctuation class.
Don't break around '(' and ')' and ';'.
Cleaning up the code with Mozilla standard coding style.
Attachment #273852 - Attachment is obsolete: true
er, the current approach for '(' and ')' are wrong.

data:text/html,<div%20style="width:1px;">a[([b])]c</div>

This case, cannot clear. I'll try to fix it...
Attached patch Patch rv2.0 (obsolete) — Splinter Review
O.K. This patch works fine for me.

I create new two classes, [c] and [d]. [c] is character like open parenthesis. [d] is character like close parenthesis (or punctuation). By the two class, '><' problem is cleared. And also '[([' problem too. And contextual analysis is changed to very simple.
Attachment #273944 - Attachment is obsolete: true
Attachment #274126 - Flags: superreview?(roc)
Attachment #274126 - Flags: review?(roc)
Attached patch Patch rv2.0.1 (obsolete) — Splinter Review
sorry, for the spam.

this is better names.
Attachment #274126 - Attachment is obsolete: true
Attachment #274128 - Flags: superreview?(roc)
Attachment #274128 - Flags: review?(roc)
Attachment #274126 - Flags: superreview?(roc)
Attachment #274126 - Flags: review?(roc)
It looks OK, but I don't see how it fixes this bug. The text
..."
is just a run of punctuation of class [d], and we can break between those characters. Right?

http://groups.google.com/group/mozilla.dev.tech.layout/browse_thread/thread/47c49202f2d13f94
Er, what I was going to say is, I think we should discuss this on this thread:
http://groups.google.com/group/mozilla.dev.tech.layout/browse_thread/thread/47c49202f2d13f94
before we go ahead. Also I'd like to find out what the jp-critical bugs are that require breaking of Latin-1 text around punctuation.
Flags: blocking1.9? → blocking1.9+
Component: Layout → Layout: Fonts and Text
Attached patch Patch rv3.0 (obsolete) — Splinter Review
This changes many character's behaviors by the feedbacks.

1. Added the new class which is not breakable around them even if there are breakable. But if there is SPACE, it is breakable. This behavior is defined in UAX#14.

2. [] is not breakable now if around characters are character/numeric class.

3. / and \ are not breakable now. But if a word has two or more /s or \s, they are breakable except the first one. And they are breakable *before* them at the time. Therefore, the path text always has / or \. I believe that this is best way. But / is not breakable everytime if next is numeric for date format. (2007/01/01)

4. DEGREE SIGN is not breakable if after character is character class. But it's still breakable if after character is numeric, for compatibility.

5. The hyphen is not breakable if the text doesn't have character class, or next is numeric class. Therefore, we cannot break 2007-Aug-07

6. Pound/Yen sign are not same behavior with IE7.
Attachment #274128 - Attachment is obsolete: true
Attachment #275307 - Flags: superreview?(roc)
Attachment #275307 - Flags: review?(roc)
Attachment #274128 - Flags: superreview?(roc)
Attachment #274128 - Flags: review?(roc)
Attached patch Patch rv3.0.1 (obsolete) — Splinter Review
Oops, sorry for the mistake and the spam.
Attachment #275307 - Attachment is obsolete: true
Attachment #275308 - Flags: superreview?(roc)
Attachment #275308 - Flags: review?(roc)
Attachment #275307 - Flags: superreview?(roc)
Attachment #275307 - Flags: review?(roc)
Attached file testcases #1 (obsolete) —
Hmm... I'm working for this testcase... I hate smiley :-(
Attachment #275308 - Attachment is obsolete: true
Attachment #275308 - Flags: superreview?(roc)
Attachment #275308 - Flags: review?(roc)
Attached file testcase #2 (obsolete) —
Attachment #275854 - Attachment is obsolete: true
Attached patch Patch rv4.0 (obsolete) — Splinter Review
I suppress the breaking in most ASCII characters if the around characters are characters or numerics.

This patch is changed from previous version:

1. Add new class that is for punctuation like character, it is same as character class, but it is needed for hypen processing.

2. The date format is checked the word length, therefore, the date format is not broken, but the chemical context is breakable.

3. If the last character of a word is CLASS_CLOSE or CLASS_CLOSE_LIKE_CHARACTER, we never break before it.

4. '%' is not breakable. But if the 3rd before or after character is '%', It may be '%' encoded URL param, therefore, it is broken before only then.

5. '&' and ';' is not breakable, But if there are one or more '='s before it, they are broken before them. It may be a param of URL.

6. '-' is not breakable, But if the both sides are numeric or character class, it is broken after.

Note that they are checked the context only in a word, therefore, the all testcases are cleared by this patch.
Attachment #275972 - Attachment is obsolete: true
Attachment #275975 - Flags: superreview?(roc)
Attachment #275975 - Flags: review?(roc)
Are these testcases added to the reftests?
It's difficult issue, because the line-breaking spec may be fluid now...
# If some languages users have serious issue, we may give up some non-important cases.

But for non-intended regressions, it's a good thing...
Sure, you can change the tests as you change the desired behaviour; the tests are there to tell you when you make changes that _aren't_ desired. :)
roc:

Should we land it for getting new feedbacks?
Attached patch Patch rv5.0 (work in progress) (obsolete) — Splinter Review
work in progress for new "safety" approach.
Attachment #275975 - Attachment is obsolete: true
Attachment #275975 - Flags: superreview?(roc)
Attachment #275975 - Flags: review?(roc)
Attached patch Patch rv5.1 (obsolete) — Splinter Review
This should be 'safety' for western language.

1. Using CLASS_OPEN_LIKE_CHARACTER and CLASS_CLOSE_LIKE_CHARACTER for western punctuations.

2. If a breakable point is 'near' from start of word, end of word and previous breakable point in western language context, we use GetPairForWestern instead of GetPair. It is not breaking between numeric, western characters, CLASS_OPEN_LIKE_CHARACTER and CLASS_CLOSE_LIKE_CHARACTER.
# now, I defined that the 'near' is 6 characters. By this, the date format is not broken. (E.g., 01/01/2007 and 2007-01-01)

3. If the word has CJK characters, we always use GetPair. So, CJK context is always broken by JIS X 4051 rules.
Attachment #279309 - Attachment is obsolete: true
Attachment #279339 - Flags: superreview?(roc)
Attachment #279339 - Flags: review?(roc)
Attached file testcase #3 (obsolete) —
Attachment #275972 - Attachment is obsolete: true
Comment on attachment 279339 [details] [diff] [review]
Patch rv5.1

er, this has a mistake, please wait next.
Attachment #279339 - Attachment is obsolete: true
Attachment #279339 - Flags: superreview?(roc)
Attachment #279339 - Flags: review?(roc)
Attached patch Patch rv5.2 (obsolete) — Splinter Review
Attachment #279352 - Flags: superreview?(roc)
Attachment #279352 - Flags: review?(roc)
+IS_WORDSEPARATOR(PRUnichar u)
+{
+  return (0x0009 <= u && u <= 0x000D) || u == 0x0020 || u == 0x0085 ||
+         u == 0x00A0 || u == 0x1680 || u == 0x180E ||
+         (0x2000 <= u && u <= 0x200B) || u == 0x2028 || u == 0x2029 ||
+         u == 0x202F || u == 0x205F || u == 0x3000;
+}

Could you add comments or #defines explaining what all these are? Also for the first range, if we're going to include 0x000B and 0x000C, why not include all characters <= 0x0020?

+GetPairForWestern(PRInt8 c1, PRInt8 c2)

I think this name is misleading. It's not "western" in particular because the other GetPair is also used to break "Western" words if they're long enough. This function is more like GetPairConservative; we use it where we want to err on the side of *not* breaking. This would mean changes to some of the comments.

+  void SetIndex(PRUint32 aIndex) {
+    NS_ASSERTION(mIndex < aIndex || mIndex == aIndex && mIndex == 0,
+                 "not supported to index decreasing");

Call this AdvanceIndexTo. The comment can say "the index cannot decrease." You could just test mIndex <= aIndex.

+  PRBool UseJISX4051() {

We're not using JISx4051, really, so we shouldn't add new references to it, right?

+  PRUnichar mPrevOfHyphen;

Call it "mCharBeforeHyphen" and add a comment explaining exactly what it is.

+  PRPackedBool mHasCJKChar;

Call it mWordHasCJKChar and document exactly what it means.

+  PRUint32 mCountOfSlash;
+  PRUint32 mCountOfBackSlash;
+  PRUint32 mCountOfEqual;

Instead of storing these, why not just calculate them when they're needed, since they're hardly ever needed?

+    if (prevIsNum && nextIsNum)
+      return CLASS_NUMERIC;
+    // If both sides are numeric or character, this hyphen should be breakable.

Hmm ... it looks like if both sides are numeric, we don't allow breaking?

+    // This text has two or more (BACK)SLASHs, this may be file path or URL.

Should be "If this text..."

+    // This text has two or more (BACK)SLASHs, this may be file path or URL.
+    if (aState.UseJISX4051()) {
+      PRInt32 count =
+        (cur == U_SLASH) ? aState.CountOfSlash() : aState.CountOfBackSlash();
+      if (count > 1)
+        return CLASS_OPEN;
+    }

Why do we need this? We would still break after a slash without this code, right?

I also don't see why we need special code for % characters.

+    // If this may be a separator of params of URL, we should break after.
+    if (aState.UseJISX4051AtNext() && aState.CountOfEqual() > 0)
+      return CLASS_CLOSE;

Why don't we make these CLASS_CLOSE always?
(In reply to comment #42)
> +  PRBool UseJISX4051() {
> 
> We're not using JISx4051, really, so we shouldn't add new references to it,
> right?

Do you have the idea for the better name?

> +    if (prevIsNum && nextIsNum)
> +      return CLASS_NUMERIC;
> +    // If both sides are numeric or character, this hyphen should be
> breakable.
> 
> Hmm ... it looks like if both sides are numeric, we don't allow breaking?

Right. Do we need to break?

> +    // This text has two or more (BACK)SLASHs, this may be file path or URL.
> +    if (aState.UseJISX4051()) {
> +      PRInt32 count =
> +        (cur == U_SLASH) ? aState.CountOfSlash() : aState.CountOfBackSlash();
> +      if (count > 1)
> +        return CLASS_OPEN;
> +    }
> 
> Why do we need this? We would still break after a slash without this code,
> right?

No, U_SLASH and U_BACKSLASH is CLASS_CHARACTER now. Therefore, they are not breakable. We need to change the CLASS when we need to break.
 
> I also don't see why we need special code for % characters.
> 
> +    // If this may be a separator of params of URL, we should break after.
> +    if (aState.UseJISX4051AtNext() && aState.CountOfEqual() > 0)
> +      return CLASS_CLOSE;
> 
> Why don't we make these CLASS_CLOSE always?

I think U_AMPERSAND and U_SEMICOLON should not be breakable in normal sentence, right? E.g., "A&B" in western language or "&nbsp;" in the testcase.
(In reply to comment #43)
> (In reply to comment #42)
> > +  PRBool UseJISX4051() {
> > 
> > We're not using JISx4051, really, so we shouldn't add new references to it,
> > right?
> 
> Do you have the idea for the better name?

How about UseConservativeBreaking() (which is the opposite of what UseJISX4051 returns now)?

> > +    if (prevIsNum && nextIsNum)
> > +      return CLASS_NUMERIC;
> > +    // If both sides are numeric or character, this hyphen should be
> > breakable.
> > 
> > Hmm ... it looks like if both sides are numeric, we don't allow breaking?
> 
> Right. Do we need to break?

Ok, the comment needs to be more clear. I guess it should say "if one side is numeric and the other is a character, the hyphen should be breakable".

> > +    // This text has two or more (BACK)SLASHs, this may be file path or URL.
> > +    if (aState.UseJISX4051()) {
> > +      PRInt32 count =
> > +        (cur == U_SLASH) ? aState.CountOfSlash() : aState.CountOfBackSlash();
> > +      if (count > 1)
> > +        return CLASS_OPEN;
> > +    }
> > 
> > Why do we need this? We would still break after a slash without this code,
> > right?
> 
> No, U_SLASH and U_BACKSLASH is CLASS_CHARACTER now. Therefore, they are not
> breakable. We need to change the CLASS when we need to break.

OK

> > I also don't see why we need special code for % characters.
> > 
> > +    // If this may be a separator of params of URL, we should break after.
> > +    if (aState.UseJISX4051AtNext() && aState.CountOfEqual() > 0)
> > +      return CLASS_CLOSE;
> > 
> > Why don't we make these CLASS_CLOSE always?
> 
> I think U_AMPERSAND and U_SEMICOLON should not be breakable in normal sentence,
> right? E.g., "A&B" in western language or "&nbsp;" in the testcase.

OK, although both of those would not be broken thanks to DONT_BREAK_RANGE.

BTW "DONT_BREAK_RANGE" would be better named "CONSERVATIVE_BREAK_RANGE" since we can actually break in that region...

I don't like this special-case (approximate) detection of URLs and file paths. But I don't have a better idea right now.
Attached patch Patch rv5.3 (obsolete) — Splinter Review
> +IS_WORDSEPARATOR(PRUnichar u)
> +{
> +  return (0x0009 <= u && u <= 0x000D) || u == 0x0020 || u == 0x0085 ||
> +         u == 0x00A0 || u == 0x1680 || u == 0x180E ||
> +         (0x2000 <= u && u <= 0x200B) || u == 0x2028 || u == 0x2029 ||
> +         u == 0x202F || u == 0x205F || u == 0x3000;
> +}
> 
> Could you add comments or #defines explaining what all these are? Also for the
> first range, if we're going to include 0x000B and 0x000C, why not include all
> characters <= 0x0020?

They are listed in Unicode spec as white spaces.
http://www.unicode.org/Public/UNIDATA/PropList.txt
Attachment #279352 - Attachment is obsolete: true
Attachment #279495 - Flags: superreview?(roc)
Attachment #279495 - Flags: review?(roc)
Attachment #279352 - Flags: superreview?(roc)
Attachment #279352 - Flags: review?(roc)
> HasCharacterInWordAlrady

HasCharacterInWordAlready

+  void Reset() {

Call this "NextWord" instead ... "Reset" sounds like it will reset everything to the initial state.

IS_WORDSEPARATOR is really complicated. I don't think we need to care about all those spaces. How about we just limit it to 0x0A, 0x0D, 0x20, 0xA0, 0x200B, and 0x3000? Then we could change nsLineBreaker::IsSpace to match, and then simplify your nsLineBreakder code by removing the word-separator related code from ContextState, since we know IS_WORDSEPARATOR characters will not occur in the parameter to GetJISx4051Breaks. Likewise you would change WordMove so that it stops at IS_WORDSEPARATOR instead of IS_SPACE. What do you think of that?

I still don't understand why we need to break before %.

I also still don't like the URL-detection stuff that's going on here, although I still haven't thought of a better way.
+  PRBool UseConservativeBreakingAtNext() {

Instead of writing the same function twice, how about taking a PRUint32 parameter that represents the offset from the current position that you want to test (which will be 1 or 0)?

Is semicolon really a URL parameter separator?
Sorry, for the delay.

(In reply to comment #46)
> I still don't understand why we need to break before %.

Most Japanese character is 3 bytes in UTF-8. Therefore, the encoded Japanese text is:

(the count of Japanese character) * 3 bytes * 3 (%xx)

So, the encoded string is usually very long word.

(In reply to comment #47)
> Is semicolon really a URL parameter separator?

HTML4 spec said:

http://www.w3.org/TR/html401/appendix/notes.html#h-B.2.2

> We recommend that HTTP server implementors, and in particular, CGI implementors support the use of ";" in place of "&" to save authors the trouble of escaping "&" characters in this manner.

But I don't know whether such implementations are on the Earth :-(
(In reply to comment #46)
> IS_WORDSEPARATOR is really complicated. I don't think we need to care about all
> those spaces. How about we just limit it to 0x0A, 0x0D, 0x20, 0xA0, 0x200B, and
> 0x3000? Then we could change nsLineBreaker::IsSpace to match, and then simplify
> your nsLineBreakder code by removing the word-separator related code from
> ContextState, since we know IS_WORDSEPARATOR characters will not occur in the
> parameter to GetJISx4051Breaks. Likewise you would change WordMove so that it
> stops at IS_WORDSEPARATOR instead of IS_SPACE. What do you think of that?

Sounds good. But if I change nsLineBreaker::IsSpace, it seems that I need to change other codes. Because the current code assume that arround |nsLineBreaker::IsSpace| can break. But 0x0A is not so...

And we should keep to check |0x2000 <= u && u <= 0x200B|. Because they can be in Western text. We should keep the spec in Western context, I think.
OK, drop 0x0A from IS_WORDSEPARATOR/IsSpace. 0x0D can stay; I think it's OK to allow a soft break after that. The text-frame code won't ever use a soft break at the very start of a line.

> And we should keep to check |0x2000 <= u && u <= 0x200B|

OK

Let's drop breaking after ';', I've never seen that as a URL parameter separator.

Breaking before a % is OK.
(In reply to comment #50)
> OK, drop 0x0A from IS_WORDSEPARATOR/IsSpace. 0x0D can stay; I think it's OK to
> allow a soft break after that. The text-frame code won't ever use a soft break
> at the very start of a line.

hmm... If I drop 0x0A from them, we cannot clear the test of 48, 49 and 50 in the latest testcase, is it O.K.? It seems that they are realistic cases...
Attached patch Patch rv5.4 (obsolete) — Splinter Review
how about this?

In UseConservativeBreaking, the no-breakable spaces are searched directly. (Only when the text has one or more non-breakable spaces.)

# and 0x2007 is not a space, it's no-breakable, sorry.
Attachment #279495 - Attachment is obsolete: true
Attachment #280777 - Flags: superreview?(roc)
Attachment #280777 - Flags: review?(roc)
Attachment #279495 - Flags: superreview?(roc)
Attachment #279495 - Flags: review?(roc)
> test of 48, 49 and 50 in the latest testcase

What testcase?

+           u == 0x000A ||                  // LINE FEED

Don't treat this as nsLineBreaker::IsSpace or IS_SPACE

+static inline PRBool
 IS_SPACE(PRUnichar u)

Can we move this to a static inline function in nsILineBreaker? Then it could be shared with nsLineBreaker.h so we don't have to duplicate code and keep it consistent.

There is a problem here which is that spaces which combine with an adjacent combining mark are not really breakable. I guess I will try to solve this at a higher level by changing the text that gets input to the linebreaker and textrun.

+    for (PRUint32 i = index; index - CONSERVATIVE_BREAK_RANGE < i;) {
+      if (IS_NO_BREAKABLE_SPACE(GetCharAt(--i)))

Rewrite this to not use predecrement, please. I find it confusing. Also here:
+      if (GetCharAt(--i) == aCh)
+      PRUnichar ch = GetCharAt(--i);

mHasNoBreakableSpace should be "mHasNonbreakableSpace" and IS_NO_BREAKABLE_SPACE should be IS_NONBREAKABLE_SPACE. ("mHasNoBreakableSpace" would mean that the word has no breakable spaces in it, which is always true. Also, "non" is not an English word on its own so we shouldn't treat it as one.)

+    // If this text has no-breakable space, we need to check whether the index
+    // is near it.

Delete "If" --- at this point, we know the text has nonbreakable space.

This is looking good.
Comment on attachment 280777 [details] [diff] [review]
Patch rv5.4

clearing request pending comments
Attachment #280777 - Flags: superreview?(roc)
Attachment #280777 - Flags: review?(roc)
Attached patch Patch rv5.5 (obsolete) — Splinter Review
>> test of 48, 49 and 50 in the latest testcase
> 
> What testcase?

attachment 279353 [details].
Attachment #280777 - Attachment is obsolete: true
Attachment #281318 - Flags: superreview?(roc)
Attachment #281318 - Flags: review?(roc)
Attached patch Patch rv5.5.1Splinter Review
er, sorry, I killed wrong line...
Attachment #281318 - Attachment is obsolete: true
Attachment #281322 - Flags: superreview?(roc)
Attachment #281322 - Flags: review?(roc)
Attachment #281318 - Flags: superreview?(roc)
Attachment #281318 - Flags: review?(roc)
Comment on attachment 281322 [details] [diff] [review]
Patch rv5.5.1

OK I like this. Let's land it.

Can you turn that testcase into a reftest?
Attachment #281322 - Flags: superreview?(roc)
Attachment #281322 - Flags: superreview+
Attachment #281322 - Flags: review?(roc)
Attachment #281322 - Flags: review+
checked-in the patch.

(In reply to comment #58)
> Can you turn that testcase into a reftest?

O.K. I'll try to create it tonight.
ah, my working files were lost by a trouble of my system... :-(
please wait the reftests. sorry.
Roc:

(In reply to comment #48)
> (In reply to comment #46)
> (In reply to comment #47)
> > Is semicolon really a URL parameter separator?
> 
> HTML4 spec said:
> 
> http://www.w3.org/TR/html401/appendix/notes.html#h-B.2.2
> 
> > We recommend that HTTP server implementors, and in particular, CGI implementors support the use of ";" in place of "&" to save authors the trouble of escaping "&" characters in this manner.
> 
> But I don't know whether such implementations are on the Earth :-(

I found such implementation:
http://git.kernel.org/
Attached patch fix some nits. (obsolete) — Splinter Review
I found 3 bugs at creating reftests.

1. '?' is defined as CLASS_CLOSE, it should be CLASS_CLOSE_LIKE_CHARACTER for prohibiting the breaking between alphabets.

2. ContextState::GetPreviousNonHyphenCharacter returns PR_FALSE if mIndex == 0, it should be U_NULL.

3. The class of HYPHEN and FIGURE DASH is deferent from HYPHEN/MINUS in ASCII. They should be same class for same behavior.

And for comment 61, we should break after SEMICOLON if it is a part of param of URL. We should respect the HTML spec. And PHP can use ';' as '&'. See http://www.php.net/manual/en/ini.core.php#ini.arg-separator.input
Attachment #281998 - Flags: superreview?(roc)
Attachment #281998 - Flags: review?(roc)
Attachment #281998 - Flags: approval1.9?
er, And EN DASH was not added to IS_HYPHEN, it should be hyphen in UAX#14.
Attached patch fix some nits. v2. (obsolete) — Splinter Review
And the leaders (U+2024 ONE DOT LEADER, U+2025 TWO DOT LEADER and U+2026 HORIZONTAL ELLIPSIS) should be CLASS_CLOSE_LIKE_CHARACTER. (They should be same class as FULL STOP of ASCII.)
Attachment #281998 - Attachment is obsolete: true
Attachment #281999 - Flags: superreview?(roc)
Attachment #281999 - Flags: review?(roc)
Attachment #281999 - Flags: approval1.9?
Attachment #281998 - Flags: superreview?(roc)
Attachment #281998 - Flags: review?(roc)
Attachment #281998 - Flags: approval1.9?
Attached patch fix some nits. v3. (obsolete) — Splinter Review
hmmm.... The for loops were changed the meaning after roc's review...
I hate the unsigned index loop :-(
Attachment #281999 - Attachment is obsolete: true
Attachment #282034 - Flags: superreview?(roc)
Attachment #282034 - Flags: review?(roc)
Attachment #282034 - Flags: approval1.9?
Attachment #281999 - Flags: superreview?(roc)
Attachment #281999 - Flags: review?(roc)
Attachment #281999 - Flags: approval1.9?
Comment on attachment 282034 [details] [diff] [review]
fix some nits. v3.

you could write those for loops as

     for (PRUint32 i = mIndex; i > 0; --i) {
       if (GetCharAt(i - 1) == aCh)
         return PR_TRUE;
     }
Attachment #282034 - Flags: superreview?(roc)
Attachment #282034 - Flags: superreview+
Attachment #282034 - Flags: review?(roc)
Attachment #282034 - Flags: review+
Attachment #282034 - Flags: approval1.9?
Attachment #282034 - Flags: approval1.9+
sorry, for the delay.

This is a first post for the testing. If I find more tests, I'll post is in new bug.
Attachment #283979 - Flags: superreview?(roc)
Attachment #283979 - Flags: review?(roc)
I updated the wiki to latest line-breaking spec. And I also removed the fixed issues documents.

http://wiki.mozilla.org/Gecko:Line_Breaking
What's the deeper sense of the "don't break near &nbsp;" rule?

Isn't it reasonable to break after the hyphen here:
xxx&nbsp;xxx-yyyyyy ?
(In reply to comment #70)
> Isn't it reasonable to break after the hyphen here:
> xxx&nbsp;xxx-yyyyyy ?

no, 'xxx-' is too short. non-breakable spaces might be word separator.
But xxx&nbsp;xxx isn't too short, and xxx&nbsp;xxx-yyyyyy is long enough to destroy layout.
Something like this should definitly break after "-":

xxxxxxxxxxxxxxxxx&nbsp;xxxx-yyyyyyyyyyyyyyyy

Furthermore, I dislike "don't break between hyphens".
Very long chains of "-" should break IMHO (min. 6 characters?).
And something like this should break also:
-------------------------------------------------------------xxxxxxxxxxxxx---------------------------------------------------------------------------------------------
(In reply to comment #72)
> But xxx&nbsp;xxx isn't too short, and xxx&nbsp;xxx-yyyyyy is long enough to
> destroy layout.

I have a question, do you think what cases? It seems that we cannot care your suggestion without hyphen. Isn't it a very special case? Note that we cannot support all cases on Earth.

> Furthermore, I dislike "don't break between hyphens".
> Very long chains of "-" should break IMHO (min. 6 characters?).

I think that this case really needs prioritized line-breaking. So, we should not care this case on Gecko 1.9.

Note that I know the current spec is not enough for many cases. Before this bug's patch was checked-in, very many ASCII characters were breakable. But it broke many text layout of Western language. See the bugs which was blocked by this bug.

Current approach is designed with following:

1. Don't break western text layout compatibility with Fx2.
2. Should break in URI for Japanese marketing.
3. Should break after hyphen for some Western language.
4. Don't use a risky way.
I don't think we should break long chains of consecutive hyphens. Someone's probably using them for graphic effect. It's best to err on the side of caution.
Comment on attachment 283979 [details] [diff] [review]
reftests for line-breaking

+<!-- BACK SLASH is used for YEN SIGN and WON SING in Japan and Korea -->

WON SIGN?

These look good! Thanks!
Attachment #283979 - Flags: superreview?(roc)
Attachment #283979 - Flags: superreview+
Attachment #283979 - Flags: review?(roc)
Attachment #283979 - Flags: review+
Attachment #283979 - Flags: approval1.9+
can this be marked FIXED?
er, exactly. Sorry.

-> FIXED
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Flags: in-testsuite+
(In reply to comment #74)
> I don't think we should break long chains of consecutive hyphens. Someone's
> probably using them for graphic effect. 

That's unlikely since IE, Safari3 and Opera9.2 break between any "--". (Opera 9.5, latest weekly, doesn't)

> It's best to err on the side of caution.

Isn't it best to reduce browser incompatibilities?




Depends on: 805456
You need to log in before you can comment on or make changes to this bug.