Closed Bug 255990 Opened 20 years ago Closed 17 years ago

Characters below U+0100 are not subject to line-breaking rules at all

Categories

(Core :: Layout: Text and Fonts, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla1.9alpha8

People

(Reporter: jshin1987, Assigned: masayuki)

References

(Blocks 1 open bug)

Details

(Keywords: intl, jp-critical)

Attachments

(4 files, 23 obsolete files)

1.83 KB, text/html
Details
78.92 KB, image/jpeg
Details
55.79 KB, application/xhtml+xml
Details
33.98 KB, patch
Details | Diff | Splinter Review
http://xpz.chinguya.net/forums/data/xpzdoc/firefox/firefox_bug3.gif

Case A and case B in the screenshot at the URL above are almost identical and
lines should be broken in both cases (beause there are line breaking
opportunities). However, in case B, lines are not broken. I wondered why A and B
are treated differently. 

I found that in intl/lwbrk/src/nsJISX4501.cpp, JIS X 4501 rules are not applied
unless there are CJK characters in input strings. I got rid of that condition
check, but there's no difference. Then, I took a look at call sites
(layout/html/base/src/nsTextTransformer.cpp). There are two code paths in the
file, one for 'ASCII' (but it seems to be for any characters below U+0100) and
the other for 'Unicode'. In the former,  only very simplistic line breaking
rules are applied and no call is made to APIs of nsILineBreaker. I guess 'ASCII'
path is for optimization, but I'm not sure how much time we save while totally
messing up line breaking for 'ASCII' strings. (JIS X 4051 is not perfect and we
should implement UTR #14, but JIS X 4051 is still a lot better than nothing).

We need to measure the performance with the 'ASCII' code path eliminated.
Attached file a test case
This has two sets of test cases with four cases in each set. With the current
Mozilla code, JIS X 4051 rules are only applied to strings with East Asian
characters so that all cases but the fourth one in both sets go outside the
'enclosing' div. 

With my patch to to remove the CJK (East Asian) check in nsJIS4501.cpp that I
mentioned in comment #0 applied, case 3 as well as case 4 is rendered
differently from cases 1 and 2.
Assignee: nobody → jshin
Status: NEW → ASSIGNED
This patch does two things:

1. get rid of CJK check in nsJISX4501.cpp. With this, once nsILineBreaker APIs
are invoked, they're applied regardless of whether there's a CJK character or
not
2. change the line breaking rule around 'slash' per UTR #14 (this was what I
wanted to do initially, but it didn't work for ASCII only strings so that I
took a look at call sites)
This screenshot demonstrates that nsILineBreaker APIs are not invoked for
strings entirely made of characters below U+0100 (1st and 2nd cases)
> I guess 'ASCII'
> path is for optimization, but I'm not sure how much time we save while totally
> messing up line breaking for 'ASCII' strings. 

Robert, what do you think of this? 
>This screenshot demonstrates that nsILineBreaker APIs are not invoked for
>strings entirely made of characters below U+0100 (1st and 2nd cases)

nsTextFrame only attempts linebreaking after the first "word" (modulo that the
first word isn't the fisrt one on the line). As soon as the first word is done,
it turns its |canBreakBefore| the following words to true. 

In your screenshot with ASCII only text, all that text is considered as a single
(first) word. No surprise therefore that linebreaking isn't even attempted.

If you want to add a line breaking opportunity at '/', you might want to tweak
|nsTextTransformer::GetNextWord| so that it stops after seeing '/', thereby
allowing that fragment to be seeing as a "word" by nsTextFrame::MeasureText.
Note: I don't anticipate an excessive perf degradation on GfxWin & GfxGTK
because there is another optimization to measure several words at once on these
platforms based on an estimated number of chars that can fit.
(In reply to comment #5)

> nsTextFrame only attempts linebreaking after the first "word" (modulo that the
> first word isn't the fisrt one on the line). As soon as the first word is done,
> it turns its |canBreakBefore| the following words to true. 
> 
> In your screenshot with ASCII only text, all that text is considered as a single
> (first) word. No surprise therefore that linebreaking isn't even attempted.

Even after the first word, nsILineBreaker APIs are never called for 8bit
code-path as far as I can tell. 
That is, nsTextFrame break lines ONLY at whitespaces for strings made up of
characters below U+0100. 
 
> If you want to add a line breaking opportunity at '/', you might want to tweak
> |nsTextTransformer::GetNextWord| so that it stops after seeing '/', thereby
> allowing that fragment to be seeing as a "word" by nsTextFrame::MeasureText.


It's not just '/'. There are a whole lot of line break opportunities (?, +, =,
etc)  in test cases attached, but none of them is made use of because '8bit
code-path' never invokes nsILineBreak APIs.  
If you make the ASCII path of |GetNextWord/ScanNormalAsciiText| to stop after
the whitespace or friends ('/', '+', etc, which can presumably be folded in an
API), why would you need other nsILineBreak APIs there (for ASCII text)? It
seems to me that the very fact that |GetNextWord| is stopping should be enough
to cater for the ASCII situation. Indeed all such stops would be the (only)
linebreaking opportunities (for ASCII text).
As a stop-gap measure, what you suggested might be sufficient. However, even for
ASCII-only text, it's not that simple. See UAX #14
(http://www.unicode.org/reports/tr14/) or our code (JIS 4051 implementation :
http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp
The file is misnamed. It should have been nsIJSx4051LineBreaker.cpp) 

Attached patch patch v2 (obsolete) — Splinter Review
I implemented what rbs suggested in nsTextTransformer.cpp while keeping
attachment 156431 [details] [diff] [review] for characters above U+0100 which are not CJK.
Attachment #156431 - Attachment is obsolete: true
Are you happy with the perf? The patch only affects ScanNormalAsciiText_F. Are
you leaving ScanNormalAsciiText_B out of this intentionally?

For a solution that would cover all the cases in a definitive manner, you could
keep the prev ch, the current ch, and fold the isPrevCharBreakable test in
nsILineBreak::BreakInBetween(&prevch, 1, &ch, 1) -- or a custom fast API for
this case. What do you think? That should address your concerns for UAX #14
(assuming nsILineBreak is already up to sratch and that the API is not a perf hog).
*** Bug 277610 has been marked as a duplicate of this bug. ***
Blocks: uax14
Blocks: shy
Does this have any connection with Bug 56652?
Attached patch patch v3 (obsolete) — Splinter Review
This does what rbs suggested except that it does only for Scan..Ascii_F.
|Scan...Ascii_B| already has some mechanism to stop at punctuation marks so
that I decided to leave it alone for now.
Attachment #164887 - Attachment is obsolete: true
Attached patch patch v3 (alt) (obsolete) — Splinter Review
I tried to extend the range of characters handled by a new method
(CanBreakBetweenASCII) to Latin1, but perhaps I shouldn't for now...
Jungshik:

Don't you request review for v3?
That patch doesn't work any more because of a recent change in line breaking
code. It also has other problems. I made a new patch last week, but haven't had
the time to test it fully (it works as far as the line beraking is concerned),
but I have yet to measure how much it slows down the rendering of pure ASCII pages. 
O.K. thanks.

FYI. Mozilla Japan thinks that this bug is very important. Because in Japan,
this bug is very famous as one of annoying bugs. I hope that this bug has high
priority in your works. We hope that this bug is fixed before Firefox 2.0 release.
I tried to build an optimized build with JPROF enabled only to realize that
'setupProfilingStuff()' [1] is missing so that I can't build that way.
Certainly, it used to work before.. I really need to do profiling ... I'll ask
around as to what went wrong with JPROF. 



[1]  http://lxr.mozilla.org/seamonkey/source/toolkit/xre/nsAppRunner.cpp#1997
Attachment #188832 - Attachment is obsolete: true
Attachment #188833 - Attachment is obsolete: true
(In reply to comment #19)
> I tried to build an optimized build with JPROF enabled only to realize that
> 'setupProfilingStuff()' [1] is missing so that I can't build that way.

Ooops. I thought I had tools/jprof in my tree, but it appears that I haven't
checked it out since I overhauled my source tree the other day. I'm now building
an optimized build.
Attached patch update (obsolete) — Splinter Review
still work in progress. I really need to measure the perf.
Attachment #197981 - Attachment is obsolete: true
With the latest patch, line breaks work as I intended. As for SHY (soft-hyphen), lines can be broken before and after it (I need to change the rule so that lines can break only after SHY). However, on Windows where SHY is 'blackholed', it doesn't show up at all no matter where it is (i.e. even at the end of a line. I tried to prevent SHY from being pulled into 'blackhole' at the end of a text-run, but it didn't work so far.). On the other hand, SHY is visible everywhere on other platforms where bug 205387 hasn't been fixed (so that SHY is not turned into 'nothingness'). 

> I tried to prevent SHY from being pulled into 'blackhole' at the
> end of a text-run, but it didn't work so far.

What if you substitute it with MINUS -- i.e., it in the event where the linebreaker does break after SHY, it is transformed into MINUS on that spot (to pick a visible glyph), so that you don't have to go down to gfx to be stucked in its 'blackhole'.
(In reply to comment #24)
> > I tried to prevent SHY from being pulled into 'blackhole' at the
> > end of a text-run, but it didn't work so far.
> 
> What if you substitute it with MINUS -- i.e., it in the event where the
> linebreaker does break after SHY, it is transformed into MINUS on that spot (to
> pick a visible glyph), so that you don't have to go down to gfx to be stucked
> in its 'blackhole'.

I thought about it, but didn't dare to look into nsTextFrame code ;-). Perhaps what I can do is :

1. Replace SHY with hyphen-minus at the end of a line probably here 
(http://lxr.mozilla.org/seamonkey/source/layout/generic/nsTextFrame.cpp#3171)

2. On platforms where bug 205387 is not fixed (I hoped fixing bug 121540 on Mac OS X would fix it for free there, but at least patch 7r1  didn't...) , I also have to get rid of SHY's at places other than at the end of a line. This part may have a performance implication....  Would it help to flag whether SHY is present or not during the 'scanning'(? transforming) phase and use the flag in |nsTextFrame::RenderString|? Sure, it would help, but the question is whether it's worth the effort. I'll measure 'pageload'  time and see how it goes.
 


(In reply to comment #24)
> What if you substitute it with MINUS -- i.e., it in the event where the
> linebreaker does break after SHY, it is transformed into MINUS on that spot (to
> pick a visible glyph), so that you don't have to go down to gfx to be stucked
> in its 'blackhole'.

I think that it's good idea for rendering. However, if we do it on nsTextTransformer, isn't it changing the text at copying to clipboard, right? I think that we should not replace it on nsTextTransformer, we should replace on nsTextFrame::PrepareUnicodeText.
# Note that we can get that the nsTextFrame is the tail of the line or not so by
# |IsEndOfLine(mState)|.
(In reply to comment #26)


> think that we should not replace it on nsTextTransformer, we should replace on
> nsTextFrame::PrepareUnicodeText.

afaict, rbs meant nsTextFrame (not nsTextTransformer). Also see his comments in bug 205387. After writing my previous comment, I also thought about doing something in nsTextFrame::PrepareUnicodeText (instad of nsTextFrame::RederString).
(In reply to comment #23)

> end of a line. I tried to prevent SHY from being pulled into 'blackhole' at the
> end of a text-run, but it didn't work so far.). 

It turned out that there was a typo in my code leading to a compile time error in gfx/src/windows (when I ran make in 'gfx layout/generic layout/build browser/app'). With that typo fixed, I ran it again and found that SHY behaved as it should on Gfx:Win! However, what's suggested by rbs and Masayuki is a lot better than hacking Gfx code so that I'll try to do that (I've not been successful so far.)

BTW, bug 205387 is likely to be fixed on Mac OS X very soon. 
Target Milestone: --- → mozilla1.8.1
This is not for review nor for check-in because I wanna take an XP approach in nsTextFrame instead of hackish fixes in Gfx implementations. I'm just uploading it here as a check-point. 

Perhaps, it's a good idea to fix only what this bug is about here (leaving alone SHY) and then fix the SHY problem separately in bug 9101.
Attachment #212391 - Attachment is obsolete: true
You should be aware that we need to remake nsTextFrame on the trunk, and in fact plans for this are well under way. See
http://wiki.mozilla.org/Gecko2:NewTextAPI
and
http://wiki.mozilla.org/Mozilla2:GFXTextRun

So, major hacks to nsTextFrame are really only useful if they're worth doing just for the 1.8.1 branch.
(In reply to comment #30)
> You should be aware that we need to remake nsTextFrame on the trunk, and in

Don't worry. I'm very well aware of that (I have been reminded of that work almost 10 times by now). I thought I wrote this work is for 1.8.x, but apparently I didn't other than targetting this bug at 1.8.1. This  and SHY issue are very critical and should be fixed for 1.8.1/FF2. Simply, we can't wait until FF3.
Attached patch 1.8.x branch patch (obsolete) — Splinter Review
This is for 1.8.x branch. SHY-handling part is not included, but SHY almost works in that lines can break after SHY although it's 'discarded' regardless of its location (that is, even at the end of a line, it's not visible).

I wonder how much pageload degradation can be taken. According to my primilinary pageload measurements, my optimized trunk build with this patch (well, trunk is deCOMtanimated so that my trunk patch is a little different) takes 1.9% longer than the latest trunk build I got from mozilla.org (1691ms vs 1659ms). Although I used exactly the same set of build options, compiers are different so that it'll be better to compare my own build with the patch with one without the patch. The pageload perf. differnece is likely to be a little bigger for 1.8.x than for trunk because of extra overhead due to 'COMtanimation' in 1.8.x
Jungshik:

We cannot change the interface of nsI*.h. See
http://developer.mozilla.org/devnews/index.php/2006/01/30/tree-rules-for-the-18-branch/
The 'pseudo-interface' means nsI*.h that is not created by IDL.
(In reply to comment #33)

Thanks for remining me of that. 
 
> We cannot change the interface of nsI*.h. See
> http://developer.mozilla.org/devnews/index.php/2006/01/30/tree-rules-for-the-18-branch/
> The 'pseudo-interface' means nsI*.h that is not created by IDL.

Sigh ... This long standing bug and bug 9101 are about to be solved, but this sorta got in the way...
even adding a new method is not allowed? (adding it at the end might work? ....)  Probably, it's possible to cram a new method into |BreakInBetween| but with some perf. loss and additional ugliness... 

I'll write to drivers@ to ask what to do here before going any further.
mconnor said, adding new interface instead of changing is allowed on 1.8 branch.
(In reply to comment #35)
> mconnor said, adding new interface instead of changing is allowed on 1.8
> branch.
That's also what I heard on IRC and by mail. I'll add nsILineBreaker2 or nsILineBreaker_MOZILLA_1_8_BRANCH inheriting nsILineBreaker with one new method added. 

Attachment #212751 - Attachment is obsolete: true
With attachment 212907 [details] [diff] [review], my 1.8 branch build on Linux (with the same build options as 1.8 branch build at moz.org) is 1.6% slower than mozilla.org's 1.8 branch build in terms of pageload time. (1670ms vs 1643ms : Avg. Median) Test pages are 100% English so that some slow-down was expected. I guess the difference in pageload time of non-Western-European pages is even smaller than that or my patch might be even a little faster. 
Attached patch updated branch patch (obsolete) — Splinter Review
The previous patch added nsILineBreakerFactory2 unnecessarily, which has a lot of ripple effects in layout/content/editor. It also changed |nsIDocument| which I was not supposed to change. Per IRC discussion, in this patch, I just added nsILineBreaker_MOZILLA_1_8_BRANCH inheriting nsILineBreaker and use QI in nsTextTransformer ctor to get nsIL...1_8_BRANCH. I could have QI'd elsewhere, but doing it in nsTextTransformer ctor seems to be least expensive perf-wise. 

I haven't yet measured perf. with this patch although I made about 10 measurements with and without the previous patch. The resuts are mixed. Most of time my optimized build with the patch is slower by a few percent points, but in one case it's faster by a few percent points than my optimized build without the patch. (in terms of pageload time: 30+ web sites all in ASCII).
Attachment #212907 - Attachment is obsolete: true
This is a general problem of breaks. Where should breaks occur when a width specification would require a break, or when a browser window size would exclude some text? 

I suggest that a break should be placed at the last point in the text line where it does not split up a word or number. (note that commas, periods, "E", "+", "i", and "-" may form part of a number). 

If there is no such place in the text, I would break after the last character which can be displayed, without any inserted hyphen. I suggest it gets very messy if you try to hyphenate on the basis of sylables, especially with multiple languages

This problem shows up in tables but it could show up for any block where there is a width specified, inherited, or implied.

Note that absolute widths may specify block widths exceeding the window, which should be respected.

tOM
Masayuki, I really love to fix this bug before FF 2.0, but I'm afraid I can't squeeze out time to fix this bug for FF 2.0 (even if  it's not too late to put this in for FF 2.0). 

My current patch does resolve most of problems (including 'SHY' issue : bug 9101 where bug 205387 is fixed) more or less(for FF3, we need to take a better approach, but for FF2, this 'stop-gap' patch should do), but I don't have time to test and measure performance. Besides, I was held back by that test routines in layout/generic/nsTextTransformer.cpp (that needs to be revised to match a new line breaking rule or to be made non fatal). 

In short, if you wanna take over this bug, please go ahead. 
(In reply to comment #41)
> In short, if you wanna take over this bug, please go ahead. 

O.K. If I can, I will do it.
Assignee: jshin1987 → masayuki
Status: ASSIGNED → NEW
Priority: -- → P1
Status: NEW → ASSIGNED
I looked the code, but for SHY problem, we need large change in nsTextFrame and nsLineLayout. I think that we should fix it after refactoring it. I'll attach the new patch without SHY problem.
Comment on attachment 212686 [details] [diff] [review]
a new patch with an ad-hoc Gfx:Win fix for SHY

Jungshik:
> +  else if(aChar1 == U_SLASH)
> +    c1 = NUMERIC_CLASS;

What means this?
I cannot understand this.
Attached patch Patch rv1.0 for URI LB (obsolete) — Splinter Review
Jshin:
Could you review the patch?
I cannot found the problem of comment 44.
And we don't need to change the |ScanNormalAsciiText_B|. Because it is the reverse of |ScanNormalAsciiText_F_ForWordBreak| that is using only for selection per word. We don't need to change these functions.
Attachment #212686 - Attachment is obsolete: true
Attachment #213337 - Attachment is obsolete: true
Attachment #226990 - Flags: review?(jshin1987)
Attached patch Patch rv1.1 for URI LB (obsolete) — Splinter Review
That breaks NBSP. This patch cares it.
Attachment #226990 - Attachment is obsolete: true
Attachment #226993 - Flags: review?(jshin1987)
Attachment #226990 - Flags: review?(jshin1987)
Attached patch Patch rv1.2 for URI LB (obsolete) — Splinter Review
The previous patch breaks some statistics tables. Because comma and period break lines in numeric. And IE doesn't break by these characters in URI. I think that we should not break by these characters in Latin1 text. (Normally, Latin1 text may have space after comma or period in most cases.)
Attachment #226993 - Attachment is obsolete: true
Attachment #227004 - Flags: review?(jshin1987)
Attachment #226993 - Flags: review?(jshin1987)
Big textframe changes are coming, so don't start working on textframe refactorings yet ...
(In reply to comment #48)
> Big textframe changes are coming, so don't start working on textframe
> refactorings yet ...

Of course, I don't start the work. Current patch that fixes only without SHY problem is changing only nsTextTransformer in layout. Or, Will be changed the nsTextTransformer by the refactoring?
A change that small would be OK.

Realistically, it may be too late to land this for FF2. this is going to be risky.
(In reply to comment #50)
> Realistically, it may be too late to land this for FF2. this is going to be
> risky.

That may be right. But this bug is very unpopularity in Japan. Because some sites are broken by this bug. Therefore, I should do my best until time is over.
(In reply to comment #45)

> I cannot found the problem of comment 44.

 What did you mean by the above sentence? 

(In reply to comment #43)
> I looked the code, but for SHY problem, we need large change in nsTextFrame and
> nsLineLayout. 

  For FF3, we need to do it 'the right way', but for FF2 (if we can make it), just passing SHY intact to nsLineBreaker (instead of killing it in nsLayout) and treating it the same way as a hyphen is treated  in nsLineBreaker 'almost' works. (on Windows and Mac OS X, it should work as far as rendering is concerned.) 

Nonetheless, it may be a good idea to solve one problem at a time.

(In reply to comment #47)

> The previous patch breaks some statistics tables. Because comma and period
> break lines in numeric. 

 What are some statistics tables? Did you mean the line break checking function?
That shouldn't fail if you don't touch SHY, should it? 

> And IE doesn't break by these characters in URI. I
> think that we should not break by these characters in Latin1 text. (Normally,
> Latin1 text may have space after comma or period in most cases.)

I guess UTR #14 doesn't agree with IE. 

BTW, please, do measure the performance. One of the primary reasons this bug had been  held up was that I couldn't find time to measure performance. While you're at it, you'd better make a patch for 1.8.x branch (similar to attachment 213337 [details] [diff] [review]) and measure its performance as well.
(In reply to comment #52)
> (In reply to comment #45)
> 
> > I cannot found the problem of comment 44.
> 
>  What did you mean by the above sentence?

Sorry, for my strange English. I cannot understand why the code is needed. Because I cannot find any problem by removing the code.

> (In reply to comment #43)
> > I looked the code, but for SHY problem, we need large change in nsTextFrame and
> > nsLineLayout. 
> 
>   For FF3, we need to do it 'the right way', but for FF2 (if we can make it),
> just passing SHY intact to nsLineBreaker (instead of killing it in nsLayout)
> and treating it the same way as a hyphen is treated  in nsLineBreaker 'almost'
> works. (on Windows and Mac OS X, it should work as far as rendering is
> concerned.) 

I think that your idea is a risky. Your gfx hacks break some cases that are with letter-spacing, word-spacing, justification and small caps.

> (In reply to comment #47)
> 
> > The previous patch breaks some statistics tables. Because comma and period
> > break lines in numeric. 
> 
>  What are some statistics tables? Did you mean the line break checking
> function?
> That shouldn't fail if you don't touch SHY, should it? 

The period and comma break a line after them. But I think that they should not break the line. Because they are used by in numeric, domain and file name. Please wait next my comment. I'll check UAX14.

> > And IE doesn't break by these characters in URI. I
> > think that we should not break by these characters in Latin1 text. (Normally,
> > Latin1 text may have space after comma or period in most cases.)
> 
> I guess UTR #14 doesn't agree with IE. 
> 
> BTW, please, do measure the performance. One of the primary reasons this bug
> had been  held up was that I couldn't find time to measure performance. While
> you're at it, you'd better make a patch for 1.8.x branch (similar to attachment
> 213337 [edit]) and measure its performance as well.

How to measure the performance? Using tools? Or insert the code?
(In reply to comment #53)
> >  What are some statistics tables?

Sorry, I forgot this question. I meant a statistical table that includes numerics in small table cells, e.g., access log viewer. I confirmed that the patch breaks the layout of Webalizer.
http://www.mrunix.net/webalizer/
I confirmed the UAX14. I understand why you write special case code for SLASH. But I think that we need same logic in the case of COLON, SEMICOLON, PERIOD and COMMA. I'll attach new patch soon.
Attached patch Patch rv1.3 (obsolete) — Splinter Review
updating. I think that we should update |ContextualAnalysis| for UAX#14.
Attachment #227004 - Attachment is obsolete: true
Attachment #227210 - Flags: review?(jshin1987)
Attachment #227004 - Flags: review?(jshin1987)
http://lxr.mozilla.org/mozilla/source/tools/performance/layout/perf-doc.html
I found this document. But I cannot build the Viewer.exe for the testing.

Jungshik: What did you use for the performance testing?
(In reply to comment #57)
> http://lxr.mozilla.org/mozilla/source/tools/performance/layout/perf-doc.html
> I found this document. But I cannot build the Viewer.exe for the testing.
> 
> Jungshik: What did you use for the performance testing?

I emailed you with information on pageload test. I thought you had done that test in another bug so that you'd know about it. 

(In reply to comment #55)
> I confirmed the UAX14. I understand why you write special case code for SLASH.
> But I think that we need same logic in the case of COLON, SEMICOLON, PERIOD and
> COMMA. I'll attach new patch soon.

Thank you. I'll take a look at it on Wednesday night.

Thank you, Jungshik.

I tested the performance.

Current build:

Avg. Median : 1243 msec
Average     : 1487 msec

Avg. Median : 1181 msec
Average     : 1473 msec

Avg. Median : 1184 msec
Average     : 1510 msec

Patched build:

Avg. Median : 1181 msec
Average     : 1416 msec

Avg. Median : 1205 msec
Average     : 1398 msec

Avg. Median : 1185 msec
Average     : 1610 msec

I tested on Win2k with cairo/optimized/static build. And disabling the image. Because if it is enabled, the score is blurred.

I cannot find the performance regression on this test.
intl/lwbrk/src/jisx4501class.h is generated from some other files, so shouldn't you be changing those other files instead?
See
http://groups.google.com/group/mozilla.dev.tech.layout/browse_frm/thread/bd63e8fe7d7a9baa/d9bcf8b608798f75#d9bcf8b608798f75
When I wrote that, I wasn't thinking about this bug. But the approach here --- make JISX4051 rules *always* apply --- would actually fix a lot of my problems, because then there's only a limited amount of context to worry about (two characters on either side of the break point, because of contextual analysis, except for Thai) and no cross-line analysis should be needed.

This doesn't fix all the problems. We still need to modify textframes to find those extra characters if they're not available (e.g. situations like
  <b>IDEOGRAPH</b>,<b>3000</b> where we need to do contextual analysis on text from three different textframes/text nodes. But at least that should be pretty rare so we can afford to do something slow.

I'm pretty worried that always applying JISX4051 rules is a huge change. ASCII text with punctuation that used to never break can now break. But it could be the right thing to do, and now is a good time to do experiments like that on trunk.
> This doesn't fix all the problems. We still need to modify textframes to find
> those extra characters if they're not available (e.g. situations like
>   <b>IDEOGRAPH</b>,<b>3000</b> where we need to do contextual analysis on text
> from three different textframes/text nodes. But at least that should be pretty
> rare so we can afford to do something slow.

Part of this problem is that the linebreaker sometimes just uses U_NULL when context is not available. Instead it should signal the caller that it needs more context. We should really do this for your new Latin1 method too, and pass in strings just like BreakInBetween --- right?

If you like, I can fix this part in bug 343445.
(In reply to comment #61)
> I'm pretty worried that always applying JISX4051 rules is a huge change. ASCII
> text with punctuation that used to never break can now break. But it could be
> the right thing to do, and now is a good time to do experiments like that on
> trunk.

I suspect that a "righter" thing to do if we're already making huge changes is to implement http://www.unicode.org/reports/tr14/

See bug 56652, especially bug 56652 comment 18
We would just make an 'ad-hoc' patch for 1.8 branch while making a 'righter' (or even 'the right') patch for trunk (UAX #14) if we didn't have to test any patch for 1.8 branch on trunk first. 

Given that FF3 is about a year away from now, it might  not be so bad to use trunk as a temporary test bed for 1.8 branch (before writing 'the right' patch for trunk). However, if the chance of getting this 'ad-hoc' patch in for 1.8 branch is virtually zero (because it's too late in the cycle), we'd better just go straight to 'the right' patch on trunk as Simon wrote.

One may ask: why not make 'the right' patch for trunk and try to port it back to 1.8 branch? That might be possible, but given the FF2 release schedule, I doubt it'll happen.

(In reply to comment #64)
> we'd better just go straight to 'the right' patch on trunk as Simon wrote.

Yes, finally, we should use 'the right' patch. But bug 56652 is not active. And it  was opened 6 years ago. I think we should use 'ad-hoc' way until fixed it.

# NOTE: 'mixi' is a most popular SNS in Japan.
# That has over 3 million members in March 2006.
# Some pages in the mixi are broken by this bug.
# We need to fix this bug ASAP.
(In reply to comment #65)

I may not have been clear in my comment.

> I think we should use 'ad-hoc' way until fixed it.

That's what I wrote we should do in my previous comment *unless* there's very little chance of landing this ad-hoc patch  for FF2 (1.8 branch)

As for the perf. measure, what you did is probably sufficient (I'm worried a bit about turning off images, though), but you may want to see bug 162361 comment #127. 
(In reply to comment #66)
> (In reply to comment #65)
> 
> I may not have been clear in my comment.
> 
> > I think we should use 'ad-hoc' way until fixed it.
> 
> That's what I wrote we should do in my previous comment *unless* there's very
> little chance of landing this ad-hoc patch  for FF2 (1.8 branch)

I think it for trunk too. (so, even if there is no chance for Fx2.) Because we don't have a guaranty that we can fix it on Fx3. (at least, the patched build 'righter' than current build.)
UAX#14 is mostly similar to our implementation of JISX4051 in terms of the amount of context required, except for rules LB9-11a. Those rules say we should not break in a run of spaces if there are certain punctuation characters at the beginning (and possibly the end) of the run. This is quite bad because these characters are common and we may have to scan many text nodes to find the character after the spaces (if any), and if there is such a character, back up and break earlier. But then, not breaking around spaces could be confusing for some authors and users. We need to think about the best way to deal with this. But that's not really relevant to this bug...

I think we should go ahead with this approach on trunk. If it causes us problems with ASCII text, the problems would probably apply to UAX#14 as well so we get useful information. It also happens to make things easier for me! I think it is probably going to be too risky for FF2 though.
(In reply to comment #68)
> I think it is probably going to be too risky for FF2 though.

I can understand your worry. But there is a beta2 stage on this August. Cannot test this patch on it by beta testers? I also think that this patch is risk, but it is very valuable too.
# I don't know how many people tests beta releases.
Comment on attachment 227210 [details] [diff] [review]
Patch rv1.3

r=jshin

Sorry again for the delay.
Attachment #227210 - Flags: review?(jshin1987) → review+
Comment on attachment 227210 [details] [diff] [review]
Patch rv1.3

Thank you, jshin.

Roc, would you review this? Or is rbs better reviewer?
Attachment #227210 - Flags: superreview?(roc)
Comment on attachment 227210 [details] [diff] [review]
Patch rv1.3

I'll review it because I've recently been poking around in this code and I want to see it land soonish.

+     // We don't need to check next character. Because SLASH breaks only after.
+     if (IS_ASCII_DIGIT(next))

I think you mean "check prev character", right?

One thing I think you probably should do is to write a microbenchmark testcase that just tests the linebreaker methods over some large, common texts many times, so we can measure the performance of that quite accurately. If this causes Tp regressions when we land it on trunk, then we should back out and try that experiment. In any case I think we should try that experiment before we land on branch.
Attachment #227210 - Flags: superreview?(roc) → superreview+
Attached patch Patch rv1.3.1 (obsolete) — Splinter Review
Thank you, roc. I'll try to create the testcase.
Attachment #227210 - Attachment is obsolete: true
Attachment #229086 - Flags: superreview+
Attachment #229086 - Flags: review+
Whiteboard: Pending check-in
(In reply to comment #0)
> http://xpz.chinguya.net/forums/data/xpzdoc/firefox/firefox_bug3.gif
> 
> Case A and case B in the screenshot at the URL above are almost identical and
...

Screenshot no longer works.

(In reply to comment #72)

> One thing I think you probably should do is to write a microbenchmark testcase
> that just tests the linebreaker methods over some large, common texts many
> times, so we can measure the performance of that quite accurately. 

 It might not be a good idea to determine whether to accept this change (solely) based on what you suggested. For sure, that way we can isolate the performance of our line breaker from other noises, but what we need to know is whether this change has any  performance impact on the rendering of 'real-world' web pages rather than some 'synthetic' web pages. That's why I'm mildly worried about testing with image rendering off. 

If this change is all about performance, we shouldn't allow any performance loss, but it's more about being correct so that I believe a 'mild' (if any) performance loss observed with 'synthetic' test cases (if that loss is not observed in more realistic test cases) shouldn't stop us from making this change.

I created the testcases:
http://www.d-toybox.com/mozilla/testpages/testcasesforbenchmark.zip

This has 3 pages.
1. Only URL text.
2. Only English text.
3. Only Japanese text.

The results on my environment(Win2k). (I tested 10 times.)

current build(AVG/MED):
1. 10640.5/10750.0
2. 18679.6/18663.5
3.  8254.6/ 8312.0

patched build(AVG/MED):
1. 29126.7/29148.5
2. 18862.5/18804.5
3.  8120.3/ 8102.0

In case 1 (URL text), the performance slows down. And in case 2 and case 3, the values are very similar.

I don't think that this result has serious regression in the performance. Because the case 1 is a very special case. That has very many URL text in a paragraph.(The file size is over 5M.)
(In reply to comment #75)
> That's why I'm mildly worried about testing with image rendering off. 

The test pages of the test tool are rendered quirks mode, so, the images of the pages has broken-image-box in the layout. Therefore, I think that there are no difference between the layouts(image rendering on vs off).
Checked-in to Trunk. Let's watch the tp difference in the tinderbox.
Whiteboard: Pending check-in → Checked-in to Trunk, Let's watch Tp difference.
Looks like this made luna and nye go orange (test failure).
Thank you, josh.

Error log:

> Running DomToTextConversionTest ...
> Begin: Thu Jul 13 12:26:01 2006
> cmd = perl TestOutSinks.pl
> End:   Thu Jul 13 12:26:04 2006
> ----------- Output from DomToTextConversionTest ------------- 
>   Testing simple copy case ...
>   Testing simple html to plaintext formatting ...
>   Testing non-wrapped plaintext in preformatted mode ...
>   Testing mail quoting ...
>   Testing misc. HTML output with format=flowed ...
>   Testing format=flowed output ...
>   Comparison failed at char 90:
>   -----
>   This is a mail with a couple of long lines and 
>   then a sig. This is used as test of the f
>   -----
>   Format=flowed test failed.
>   Testing HTML Table to Text ...
>   
>   ERROR: DOM SERIALIZER TEST FAILED:  simplemail.out
>   See http://www.mozilla.org/editor/serializer-tests.html for help.
> ----------- End Output from DomToTextConversionTest --------- 
> Error: DomToTextConversionTest exited with status 1
> Error: DomToTextConversionTest: exited with status 1
> 
> No More Errors
Whiteboard: Checked-in to Trunk, Let's watch Tp difference.
Attached patch Patch for test tool (obsolete) — Splinter Review
This fixes the test failure.

Now, we can break the line before/after '='.

# But I cannot test before check-in.
Attachment #229151 - Flags: review?(mrbkap)
Comment on attachment 229151 [details] [diff] [review]
Patch for test tool

I guess this is OK. Why can't you run the tests? You should be able to go into your <objdir>/dist/bin/ directory and type ./TestOutSinks.pl (or 'perl TestOutsSinks.pl' to test.
Attachment #229151 - Flags: review?(mrbkap) → review+
(In reply to comment #83)
> (From update of attachment 229151 [details] [diff] [review] [edit])
> I guess this is OK. Why can't you run the tests? You should be able to go into
> your <objdir>/dist/bin/ directory and type ./TestOutSinks.pl (or 'perl
> TestOutsSinks.pl' to test.
> 

Thank you for the review. But I cannot test it... The build needs '--enable-tests'? But the script returns the error that message means '.' is not a command or executable problem. (I don't know what message is displayed on Windows-en.)

I checked-in the patch again. sorry for the risk. I'm tracking the tinderbox now.
(In reply to comment #84)
> not a command or executable problem. (I don't know what message is displayed on

executable program.
re-backed-out. The test is failed again. I don't have no idea...

# I'll sleep soon, sorry.
Masayuki, according to the log there is a space character after the '=' character in the output that I didn't see in rev. 1.5 of simplemail.out.
(In reply to comment #75)
> (In reply to comment #72)
>  It might not be a good idea to determine whether to accept this change
> (solely) based on what you suggested. For sure, that way we can isolate the
> performance of our line breaker from other noises, but what we need to know
> is whether this change has any  performance impact on the rendering of
> 'real-world' web pages rather than some 'synthetic' web pages. That's why I'm
> mildly worried about testing with image rendering off.

I don't want to reject this change. If the Tp numbers get hurt, and I think that is quite likely based on the results of comment #76, then we should try to optimize instead of throwing the patch out.
It would be useful to also benchmark text from a regular English Web page.
Attached patch Patch fro test tool (obsolete) — Splinter Review
Thank you, jag. carrying over the review flag.
Attachment #229151 - Attachment is obsolete: true
Attachment #229213 - Flags: review+
checked-in.

# I add the build flag '--enable-tests', the script run on my environment too.
# But the tool returns error, "Unable to create a parser : 0x80040154".
# I'm trying risky check-in, sorry.
now, tinderbox is green. Let's watch the tp.
Whiteboard: Let's watch tp in Trunk tinderbox
Line breaks are very different now on this page: http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=PhoenixTinderbox&branch=HEAD&branchtype=match&dir=&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=day&mindate=&maxdate=&cvsroot=/cvsroot
when I use the default setting of a new profile and I compare a 1.9a1_2006071316 build with a 1.9a1_2006071322.
I'm telling it here because I don't know if it's intended.
This very likely caused about 5ms Tp regression on btek.

I think we should look hard at how this can be optimized for ASCII text. One thing that might help is to get rid of the virtual call, perhaps by moving linebreaker to layout. We should also look at optimizing the JISX4051 path for ASCII characters. We may be able to define some specialized, small, fast tables just for cases where both characters in CanBreakBetween are ASCII (< 128).
(In reply to comment #93)
> Line breaks are very different now on this page:

Yeah, of course. The current trunk build is similar to IE than old.
# But should not we break before/after '-' if the side of it is a numeric??
Attached patch Patch for optimizing rv1.0 (obsolete) — Splinter Review
Roc, how about this patch?

result:
(1) URL
AVG. 29896.8 -> 29288.9
MED. 30031.5 -> 29250.0

(2) English
AVG. 19370.8 -> 18887.5
MED. 19375.0 -> 18843.5
Attachment #229267 - Flags: superreview?(roc)
Attachment #229267 - Flags: review?(roc)
Whiteboard: Let's watch tp in Trunk tinderbox → Needing optimization
I tested with our test tool.

current:
Avg. Median : 1186 msec
Average     : 1568 msec
Minimum     : 530 msec
Maximum     : 10336 msec

patched:
Avg. Median : 1179 msec
Average     : 1606 msec
Minimum     : 530 msec
Maximum     : 10937 msec
Attached patch Patch for '+', '=' and '-' (obsolete) — Splinter Review
Now, '+', '=' and '-' allows to breaking the line on both before and after it.
But if there is not enough space (e.g., table layout), a line only has these characters. I.e.,

2000-01
-
01T00:00

I think that they should not allow to break on before. And hyphen is used by date separator some locales. Like slash, we should not break on both before/after if the around characters are numeric.
Attachment #229273 - Flags: review?(jshin1987)
I'm not sure exactly what you're comparing and what the numbers mean in comments #96 and #97. Probably we shold file followup bugs for optimization or further tuning of the line-break algorithm.
(In reply to comment #97)
> I tested with our test tool.

> Avg. Median : 1179 msec
> Average     : 1606 msec
> Minimum     : 530 msec
> Maximum     : 10937 msec

Min is half of median and max is 6-10 times as large ? 
Do you believe these numbers (and differences in them) mean anything? 


(In reply to comment #100)

> Min is half of median and max is 6-10 times as large ? 
> Do you believe these numbers (and differences in them) mean anything? 

Ignore the above. Sorry I was confused as to what min and max mean having run my last page load test several months ago. If I were you, I would run the test 10 times with each run loading each test page 5 times. From each of 10 test runs, I'd take the average median and apply t-test to 2 sets (one with patch and the other without patch) of those 10 numbers.

comment 96 is the result of the testcases of comment 76. They are very large simple testcases. The score means page loading time in milliseconds.

comment 97 is the result of the page loading times test tool of mozilla.org. I think that the average time is not reliable. Because the maximum score is very large in my environment. But instead of it, I think that median is reliable.

# I'll file new bugs.
Attachment #229267 - Attachment is obsolete: true
Attachment #229267 - Flags: superreview?(roc)
Attachment #229267 - Flags: review?(roc)
Attachment #229273 - Attachment is obsolete: true
Attachment #229273 - Flags: review?(jshin1987)
I filed bug 344778 and bug 244780.
Whiteboard: Needing optimization
(In reply to comment #103)
> I filed bug 344778 and bug 244780.
> 

Sorry. bug 344778 and bug 344780.
Is it a bug or a feature that it now breaks after a "."? For an example see any Bonsai query, it breaks like this in the "Who" column:
"neil%
parkwaycc.
co.uk"
In IE 6 it breaks like this:
"neil%
parkwaycc.co.uk"

(In reply to comment #105)
> Is it a bug or a feature that it now breaks after a "."? For an example see any
> Bonsai query, it breaks like this in the "Who" column:
> "neil%
> parkwaycc.
> co.uk"
> In IE 6 it breaks like this:
> "neil%
> parkwaycc.co.uk"
> 

We are discussing it in bug 344780. Please comment to it.
Unfortunately I think we can't target this for FF2. There are obviously going to be several followup issues which need to be sorted out and we just don't have time.

On trunk, perhaps we should just go ahead and implement UAX#14.
(In reply to comment #107)
> Unfortunately I think we can't target this for FF2. There are obviously going
> to be several followup issues which need to be sorted out and we just don't
> have time.

Yeah, right. There are very many issues, more than I thought.

> On trunk, perhaps we should just go ahead and implement UAX#14.

Yes. But the three regressions are good issues after implement UAX#14. I think that they may help to implement UAX#14.
Target Milestone: mozilla1.8.1 → mozilla1.9alpha
> Yes. But the three regressions are good issues after implement UAX#14. I think
> that they may help to implement UAX#14.

I'm not sure what you mean. Do you mean that after we implement UAX#14 those bugs will still exist, so we will still need to fix them?

I think we should actually back this patch out to get back to our old behaviour for now. Then let's figure out what changes need to be made to the UAX#14 spec (if any) to make it suitable for the Web. Then implement that. Does that sound OK?
(In reply to comment #109)
> > Yes. But the three regressions are good issues after implement UAX#14. I think
> > that they may help to implement UAX#14.
> 
> I'm not sure what you mean. Do you mean that after we implement UAX#14 those
> bugs will still exist, so we will still need to fix them?

Yes. (But of course, they may not be same.) I think that the our experience will help for UAX#14 works.

> I think we should actually back this patch out to get back to our old behaviour
> for now. Then let's figure out what changes need to be made to the UAX#14 spec
> (if any) to make it suitable for the Web. Then implement that. Does that sound
> OK?

I think that we should use current trunk build and check/fix the regressions. E.g., we might not be able to find the Germany problem in bug 344780 if we didn't check-in the patch. I think that the many eyes will help us.

# Note that I'm reading UAX#14 for better implementation,
# But I don't know I (or somebody) can implement new line breaker before Fx3.
# The current behavior is better than Fx2 if we cannot fix bug 56652 in Fx3.
And, nobody working on bug 56652 :-(
I don't think we should approach this by fixing bugs until people are happy. We need a written specification that controls exactly how our linebreaking works. The linebreaker code is optimized for speed, so it's not so easy to understand. Without a spec, we can't tell the difference between bugs in the linebreaker code and intended behaviour. If we just keep on hacking the code to fix bugs then we won't have a spec.

I think probably our spec should just be UAX#14 plus a list of changes. We should try to keep our changes to the minimum. This is probably a situation where whatever we do, someone will be unhappy; following UAX#14 lets us tell the unhappy people "sorry, it's the standard".

I also think that we shouldn't keep on making changes over time. This is an important compatibility issue. We should try to get it good enough for FF3 that we don't have to change it again for a long time.
Roc, but Mozilla Japan needs this fix. Therefore, I have another compromise idea.
1. back out the patch.
2. create simple UAX#14 Line Breaker, but that supports only below U+100. so we use it only in |nsTextTransformer::ScanNormalAsciiText_F|.

I think that if support range is only below U+100, I can create it quickly, and bug 344816 is not occurred, so we can lower the possibility of regression to minimal. How about this approach?
It sounds like you're looking for a fix which can go into FF2, but really, that is not realistic unfortunately :-(. 
(In reply to comment #114)
> It sounds like you're looking for a fix which can go into FF2,

No, I have already given up it.
I want only a guarantee that this bug is fixed on Fx3.
# I think that bug 56652 needs very long time, I don't think that the bug is fixed before Fx3.
(In reply to comment #115)
> I don't think that the bug is fixed before Fx3.

If it will be released in 2007Q1.
(In reply to comment #117)
> Why is it so hard?

The creation may not be difficult (but I'm not so.). But I think that the testing is very difficult. Because all languages are affected. Therefore, I think that the bug should be fixed in alpha stage or early beta stage, because the risk is very large.
(In reply to comment #118)
> (In reply to comment #117)
> > Why is it so hard?
> 
> The creation may not be difficult (but I'm not so.). But I think that the
> testing is very difficult. Because all languages are affected. Therefore, I
> think that the bug should be fixed in alpha stage or early beta stage, because
> the risk is very large.

I agree it's risky and we should land before 1.9beta1 but that isn't likely to happen this year. Can't we implement UAX#14 some time in the next few months?
(In reply to comment #119)
> Can't we implement UAX#14 some time in the next few months?

I don't know. I have IME disable working on Mac now. After it, I can work on it. But the current work needs more time. And unfortunately, I don't know that somebody works on UAX#14, now.
(In reply to comment #120)
> I have IME disable working on Mac now.

Oops, and I have some cairo/thebes work too.
How about this: let's undo the linebreaker changes now. Over the next few months if you have time to work on UAX#14 and we can get it done, that's great. If we start getting close to 1.9beta1 and we don't think we'll have UAX#14 working, then we implement a solution restricted to ASCII characters. Does that sounds reasonable?
(In reply to comment #122)
> How about this: let's undo the linebreaker changes now. Over the next few
> months if you have time to work on UAX#14 and we can get it done, that's great.
> If we start getting close to 1.9beta1 and we don't think we'll have UAX#14
> working, then we implement a solution restricted to ASCII characters. Does that
> sounds reasonable?
> 

O.K. Good! I back out the patch.
(In reply to comment #123)
> O.K. Good! I back out the patch.

Oops, sorry, I cannot back-out now, I'll back out after 3 or 4 hours.
# If you want to back-out now, please do it.
Target Milestone: mozilla1.9alpha → mozilla1.9beta
After applying patch, I found a site that won't break after normal space (0x20) and the text overlapped to the next cell.
http://de.selfhtml.org/html/referenz/attribute.htm#div

Screen Shot:
http://img254.imageshack.us/img254/9805/unbenannt28ck.jpg

Could you please check this out?
The testcases in this bug all WFM now on the trunk. Is this fixed or is more work still (and hence more testcases) needed?
Sorry for the spam, the testcases were behaving nicely because of my screen width. When I resized it, I still see the issue.
I still see the bug in the original testcase.

This would be very easy to fix now, and probably the performance impact would be minimized too since line break opportunites are stored persistently in textruns. To fix this now you'd just have to change nsLineBreaker to always call nsILineBreaker instead of only when a CJK character is detected.
roc:

Can I work on this now?
Sure. But before you write code, can you explain exactly what your plan is?
I want to change:

1. change the character class definition for some 8 bit characters (like '/'). But it should not change the serializeres behavior (see bug 344816), for this issue, I need to check the code before writing the code.

2. change the nsLineBreaker::AppendText 8bit version which should use nsContentUtils::LineBreaker()->GetJISx4051Breaks() if needed. But for performance, I may use the approach which is same as bug 229645. But it makes to down the maintenance performance...
(In reply to comment #133)
> in 2.
> bug 229645.

it is bug 344778.

Under what circumstances do you want to call GetJISx4051Breaks() for 8-bit text?

In bug 336959 there's a patch to replace the nsILineBreaker implementation with Pango on Linux (and a desire to do the same with Uniscribe and ATSUI). If we call GetJISx4051Breaks() always (bad name!) then these two patches together will mean using platform linebreaking always. What do you think about that? I'm not all that comfortable with the idea but I don't know what's better.
thank you roc.

On gecke1.9 final, the spec of line-breaking depends on platform? Is it a good choice for us? I don't know these three platforms use what specs.

O.K. I'll only change the GetJISx4051Breaks being called from nsLineBreaker::AppendText 8 bit verstion. And if these platforms doesn't have compatibility for ASCII, we can override it.
> On gecke1.9 final, the spec of line-breaking depends on platform? Is it a good
> choice for us?

No, that's why I'm not comfortable with it. But for some languages like Thai we really have to use the platform libraries. The question is: how can we divide the functionality so we use platform libraries for line breaking of certain character ranges (e.g. Thai) and use our own code for the rest? The current suggestion is "use the platform libraries to break a whitespace-delimited word if and only if it contains at least one CJK or Thai character" but that doesn't seem so good.

> O.K. I'll only change the GetJISx4051Breaks being called from
> nsLineBreaker::AppendText 8 bit verstion.

So you're going to make the 8-bit path call GetJISx4051Breaks for each word?
(In reply to comment #137)
> The question is: how can we divide
> the functionality so we use platform libraries for line breaking of certain
> character ranges (e.g. Thai) and use our own code for the rest?

I agree to this. In future (and if possible), we should implement them ourselves.

> > O.K. I'll only change the GetJISx4051Breaks being called from
> > nsLineBreaker::AppendText 8 bit verstion.
> 
> So you're going to make the 8-bit path call GetJISx4051Breaks for each word?

I'm not certain. We should not use same line breaking rules in rendering and serializer. Because for layout, we should take a priority for IE (and others) compatible. However, for serializer, we don't need to mind the compatibility. Especially, URL should not separate to two or more words in a mail.

Therefore, I think that the approach of bug 344778 that is using static function only for layout may be good. (for the serializer issue, for the performance) And it should be used from 16bit path too if a character is in ASCII.
Attached patch Retry to fix (work in progress) (obsolete) — Splinter Review
first shot of the retry.
Attachment #229086 - Attachment is obsolete: true
Attachment #229213 - Attachment is obsolete: true
Attached patch Retry to fix (work in progress) (obsolete) — Splinter Review
I changed some character's class from previous patch. I think that this is better.
Attachment #271370 - Attachment is obsolete: true
Attached file wrap specs for latest patch (obsolete) —
This table is a specs of each browsers.

The value 'A' means the line breakable After the character, and 'B' means Before. 'BA' means Before and After.

(C) which is the tail of the browser name means Character. (N) means Numeric. This means that they are around the character. E.g., "a$a" is a testcase for (C), "0$0" is a testcase for (N).

Fx3 means the latest patch's behavior. The blue cells are different from Fx2.

I use IE7 spec in some characters. But there are incompatible characters for IE7 yet. I think that there is no impact in layout compatibility.
pref tests result:

trunk build:
1.
Avg. Median : 1146 msec
Average     : 1348 msec
Minimum     : 459 msec
Maximum     : 4523 msec

2.
Avg. Median : 1155 msec
Average     : 1351 msec
Minimum     : 464 msec
Maximum     : 4500 msec

3.
Avg. Median : 1153 msec
Average     : 1349 msec
Minimum     : 467 msec
Maximum     : 4464 msec

patched build:
1.
Avg. Median : 1154 msec
Average     : 1349 msec
Minimum     : 463 msec
Maximum     : 4451 msec

2.
Avg. Median : 1154 msec
Average     : 1355 msec
Minimum     : 468 msec
Maximum     : 4485 msec

3.
Avg. Median : 1156 msec
Average     : 1356 msec
Minimum     : 467 msec
Maximum     : 4499 msec

Is there a little regression?
Attached patch Patch rv2.0 (obsolete) — Splinter Review
Let's go to review process. This patch cleared reftests. And works fine as Thunderbird's patch too.

This patch doesn't affect to serializer. Because nsILineBreaker::Prev/nsILineBreaker::Next are not breaking the ASCII word except when the text has CJK characters. This behavior is important, because if we break the URL to several words, the receiver MUAs cannot recover the original URL from the separated URL. So, we should break the URL only on Layout.

I Added some range to nsLineBreaker::IsComplexChar. Because they can break the lines, they should be complex characters.

This patch also breaks the line after SLASH if the next character is not numerics. *And* if the SLASH is a first character, it doesn't break line after SLASH. Because SLASH is used for root path, in this time, the file path will be broken as "/" and "foo/" and "bar/"... I think that they should be "/foo/" and "bar/". This change will help the LXR.

And I removed the declarations for U+2018 - U+201F from jisx4501class.txt. I think that them declarations are wrong. We should use default declaration in Unicode Database. (see bug 344780 comment 20)

And this patch prefers the IE compatibility. Because there are not standard specs.
Attachment #271375 - Attachment is obsolete: true
Attachment #271378 - Flags: superreview?(roc)
Attachment #271378 - Flags: review?(roc)
the previous table has a mistake, sorry.
Attachment #271377 - Attachment is obsolete: true
Comment on attachment 271378 [details] [diff] [review]
Patch rv2.0

+    return !((0x0030 <= u && u <= 0x0039) ||
+             (0x0041 <= u && u <= 0x005A) ||
+             (0x0061 <= u && u <= 0x007A) ||
+             (0x00C0 <= u && u <= 0x00D6) ||
+             (0x00D8 <= u && u <= 0x00F6) ||
+             (0x00F8 <= u && u <= 0x00FF));

I'd just use the first three clauses, the other ones aren't common enough to bother optimizing for I think.

Okay, this looks OK.

I want your specification table checked into CVS too, with the explanation you gave in comment #141. Thanks!!
Attachment #271378 - Flags: superreview?(roc)
Attachment #271378 - Flags: superreview+
Attachment #271378 - Flags: review?(roc)
Attachment #271378 - Flags: review+
Oh, a few more comments...

I'd like dbaron to say if he's OK with this sort of change.

You're making '=' an A ... why not disallow breaking around it entirely like IE and Opera, if you're going to change it?
(In reply to comment #146)
> Oh, a few more comments...
> 
> I'd like dbaron to say if he's OK with this sort of change.
> 
> You're making '=' an A ... why not disallow breaking around it entirely like IE
> and Opera, if you're going to change it?

If URL params too long, e.g.,

http://example.com/foo.cgi?abc=def&abc=def&abc=def&abc=def&abc=def&abc=def&abc=def&

the after of '?' cannot be broken. I think that we should break after '=' or '&'.
Attachment #271378 - Attachment is obsolete: true
checked-in.

-> FIXED
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Depends on: 388155
Depends on: 388602
This is causing very frequent assertions (bug 388602) that should be fixed promptly.
Depends on: 389056
Comment on attachment 272021 [details] [diff] [review]
patch for check-in

>     if (aSink) {
>       breakState[offset] = mAfterSpace && !isSpace &&
>         (aFlags & (offset == 0 ? BREAK_ALLOW_INITIAL : BREAK_ALLOW_INSIDE));
>     }
>     mAfterSpace = isSpace;
> 
>     if (isSpace) {
>-      // The current word can't have any complex characters inside it
>-      // because this is 8-bit text, so just ignore it
>+      if (offset > wordStart && wordHasComplexChar) {
>+        if (aFlags & BREAK_ALLOW_INSIDE) {
>+          // Save current start-of-word state because GetJISx4051Breaks will
>+          // set it to false
>+          PRPackedBool currentStart = breakState[wordStart];
Unfortunately this access to breakState is not protected by aSink and I often get these assertions trying to click on links or type in textareas:
###!!! ASSERTION: invalid array index: 'i < Length()', file ../../dist/include/xpcom\nsTArray.h, line 318
Also see Bug 388602 on that assertion.
Masayuki: why are you treating < and > as delimiters to be broken before/after? Was this a JISx4051 suggestion?
No, it was already defined as so. (Gecko1.8.x has the definition too.)
yeah, but for 1.8.x it was hardly ever used. Now it's used a lot. If there's no particular reason for it, we should probably not break around them ... IE doesn't and they shouldn't really be treated as delimiters.
I'll change it on bug 389056.
Flags: in-testsuite?
Depends on: 398165
Blocks: 177565
jwalden+bmo:

The testcases are on tree now. See bug 389056.
Woohoo!  Good stuff!  In the future, please feel free to flip the flag yourself (with the explanatory comment, of course), however.  :-)
Flags: in-testsuite? → in-testsuite+
You need to log in before you can comment on or make changes to this bug.