The default bug view has changed. See this FAQ.

Characters below U+0100 are not subject to line-breaking rules at all

RESOLVED FIXED in mozilla1.9alpha8

Status

()

Core
Layout: Text
P1
normal
RESOLVED FIXED
13 years ago
10 years ago

People

(Reporter: Jungshik Shin, Assigned: masayuki)

Tracking

(Depends on: 1 bug, Blocks: 1 bug, {intl, jp-critical})

Trunk
mozilla1.9alpha8
intl, jp-critical
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(4 attachments, 23 obsolete attachments)

1.83 KB, text/html
Details
78.92 KB, image/jpeg
Details
55.79 KB, application/xhtml+xml
Details
33.98 KB, patch
Details | Diff | Splinter Review
(Reporter)

Description

13 years ago
http://xpz.chinguya.net/forums/data/xpzdoc/firefox/firefox_bug3.gif

Case A and case B in the screenshot at the URL above are almost identical and
lines should be broken in both cases (beause there are line breaking
opportunities). However, in case B, lines are not broken. I wondered why A and B
are treated differently. 

I found that in intl/lwbrk/src/nsJISX4501.cpp, JIS X 4501 rules are not applied
unless there are CJK characters in input strings. I got rid of that condition
check, but there's no difference. Then, I took a look at call sites
(layout/html/base/src/nsTextTransformer.cpp). There are two code paths in the
file, one for 'ASCII' (but it seems to be for any characters below U+0100) and
the other for 'Unicode'. In the former,  only very simplistic line breaking
rules are applied and no call is made to APIs of nsILineBreaker. I guess 'ASCII'
path is for optimization, but I'm not sure how much time we save while totally
messing up line breaking for 'ASCII' strings. (JIS X 4051 is not perfect and we
should implement UTR #14, but JIS X 4051 is still a lot better than nothing).

We need to measure the performance with the 'ASCII' code path eliminated.
(Reporter)

Comment 1

13 years ago
Created attachment 156430 [details]
a test case

This has two sets of test cases with four cases in each set. With the current
Mozilla code, JIS X 4051 rules are only applied to strings with East Asian
characters so that all cases but the fourth one in both sets go outside the
'enclosing' div. 

With my patch to to remove the CJK (East Asian) check in nsJIS4501.cpp that I
mentioned in comment #0 applied, case 3 as well as case 4 is rendered
differently from cases 1 and 2.
(Reporter)

Updated

13 years ago
Assignee: nobody → jshin
Status: NEW → ASSIGNED
(Reporter)

Comment 2

13 years ago
Created attachment 156431 [details] [diff] [review]
a patch to demonstrate the problem

This patch does two things:

1. get rid of CJK check in nsJISX4501.cpp. With this, once nsILineBreaker APIs
are invoked, they're applied regardless of whether there's a CJK character or
not
2. change the line breaking rule around 'slash' per UTR #14 (this was what I
wanted to do initially, but it didn't work for ASCII only strings so that I
took a look at call sites)
(Reporter)

Comment 3

13 years ago
Created attachment 156435 [details]
screenshot of the test case with attachment 156431 [details] [diff] [review] applied

This screenshot demonstrates that nsILineBreaker APIs are not invoked for
strings entirely made of characters below U+0100 (1st and 2nd cases)
(Reporter)

Comment 4

13 years ago
> I guess 'ASCII'
> path is for optimization, but I'm not sure how much time we save while totally
> messing up line breaking for 'ASCII' strings. 

Robert, what do you think of this? 

Comment 5

13 years ago
>This screenshot demonstrates that nsILineBreaker APIs are not invoked for
>strings entirely made of characters below U+0100 (1st and 2nd cases)

nsTextFrame only attempts linebreaking after the first "word" (modulo that the
first word isn't the fisrt one on the line). As soon as the first word is done,
it turns its |canBreakBefore| the following words to true. 

In your screenshot with ASCII only text, all that text is considered as a single
(first) word. No surprise therefore that linebreaking isn't even attempted.

If you want to add a line breaking opportunity at '/', you might want to tweak
|nsTextTransformer::GetNextWord| so that it stops after seeing '/', thereby
allowing that fragment to be seeing as a "word" by nsTextFrame::MeasureText.

Comment 6

13 years ago
Note: I don't anticipate an excessive perf degradation on GfxWin & GfxGTK
because there is another optimization to measure several words at once on these
platforms based on an estimated number of chars that can fit.
(Reporter)

Comment 7

13 years ago
(In reply to comment #5)

> nsTextFrame only attempts linebreaking after the first "word" (modulo that the
> first word isn't the fisrt one on the line). As soon as the first word is done,
> it turns its |canBreakBefore| the following words to true. 
> 
> In your screenshot with ASCII only text, all that text is considered as a single
> (first) word. No surprise therefore that linebreaking isn't even attempted.

Even after the first word, nsILineBreaker APIs are never called for 8bit
code-path as far as I can tell. 
That is, nsTextFrame break lines ONLY at whitespaces for strings made up of
characters below U+0100. 
 
> If you want to add a line breaking opportunity at '/', you might want to tweak
> |nsTextTransformer::GetNextWord| so that it stops after seeing '/', thereby
> allowing that fragment to be seeing as a "word" by nsTextFrame::MeasureText.


It's not just '/'. There are a whole lot of line break opportunities (?, +, =,
etc)  in test cases attached, but none of them is made use of because '8bit
code-path' never invokes nsILineBreak APIs.  

Comment 8

13 years ago
If you make the ASCII path of |GetNextWord/ScanNormalAsciiText| to stop after
the whitespace or friends ('/', '+', etc, which can presumably be folded in an
API), why would you need other nsILineBreak APIs there (for ASCII text)? It
seems to me that the very fact that |GetNextWord| is stopping should be enough
to cater for the ASCII situation. Indeed all such stops would be the (only)
linebreaking opportunities (for ASCII text).
(Reporter)

Comment 9

13 years ago
As a stop-gap measure, what you suggested might be sufficient. However, even for
ASCII-only text, it's not that simple. See UAX #14
(http://www.unicode.org/reports/tr14/) or our code (JIS 4051 implementation :
http://lxr.mozilla.org/seamonkey/source/intl/lwbrk/src/nsJISx4501LineBreaker.cpp
The file is misnamed. It should have been nsIJSx4051LineBreaker.cpp) 

(Reporter)

Comment 10

13 years ago
Created attachment 164887 [details] [diff] [review]
patch v2

I implemented what rbs suggested in nsTextTransformer.cpp while keeping
attachment 156431 [details] [diff] [review] for characters above U+0100 which are not CJK.
Attachment #156431 - Attachment is obsolete: true

Comment 11

13 years ago
Are you happy with the perf? The patch only affects ScanNormalAsciiText_F. Are
you leaving ScanNormalAsciiText_B out of this intentionally?

For a solution that would cover all the cases in a definitive manner, you could
keep the prev ch, the current ch, and fold the isPrevCharBreakable test in
nsILineBreak::BreakInBetween(&prevch, 1, &ch, 1) -- or a custom fast API for
this case. What do you think? That should address your concerns for UAX #14
(assuming nsILineBreak is already up to sratch and that the API is not a perf hog).
(Reporter)

Comment 12

12 years ago
*** Bug 277610 has been marked as a duplicate of this bug. ***
(Reporter)

Updated

12 years ago
Blocks: 95067
(Reporter)

Updated

12 years ago
Blocks: 9101

Comment 13

12 years ago
Does this have any connection with Bug 56652?
(Reporter)

Comment 14

12 years ago
Created attachment 188832 [details] [diff] [review]
patch v3

This does what rbs suggested except that it does only for Scan..Ascii_F.
|Scan...Ascii_B| already has some mechanism to stop at punctuation marks so
that I decided to leave it alone for now.
Attachment #164887 - Attachment is obsolete: true
(Reporter)

Comment 15

12 years ago
Created attachment 188833 [details] [diff] [review]
patch v3 (alt)

I tried to extend the range of characters handled by a new method
(CanBreakBetweenASCII) to Latin1, but perhaps I shouldn't for now...
(Assignee)

Comment 16

12 years ago
Jungshik:

Don't you request review for v3?
(Reporter)

Comment 17

12 years ago
That patch doesn't work any more because of a recent change in line breaking
code. It also has other problems. I made a new patch last week, but haven't had
the time to test it fully (it works as far as the line beraking is concerned),
but I have yet to measure how much it slows down the rendering of pure ASCII pages. 
(Assignee)

Comment 18

12 years ago
O.K. thanks.

FYI. Mozilla Japan thinks that this bug is very important. Because in Japan,
this bug is very famous as one of annoying bugs. I hope that this bug has high
priority in your works. We hope that this bug is fixed before Firefox 2.0 release.
(Reporter)

Comment 19

12 years ago
I tried to build an optimized build with JPROF enabled only to realize that
'setupProfilingStuff()' [1] is missing so that I can't build that way.
Certainly, it used to work before.. I really need to do profiling ... I'll ask
around as to what went wrong with JPROF. 



[1]  http://lxr.mozilla.org/seamonkey/source/toolkit/xre/nsAppRunner.cpp#1997
(Reporter)

Comment 20

12 years ago
Created attachment 197981 [details] [diff] [review]
patch v4 (still work in progress)
Attachment #188832 - Attachment is obsolete: true
Attachment #188833 - Attachment is obsolete: true
(Reporter)

Comment 21

12 years ago
(In reply to comment #19)
> I tried to build an optimized build with JPROF enabled only to realize that
> 'setupProfilingStuff()' [1] is missing so that I can't build that way.

Ooops. I thought I had tools/jprof in my tree, but it appears that I haven't
checked it out since I overhauled my source tree the other day. I'm now building
an optimized build.
(Assignee)

Updated

11 years ago
Keywords: jp-critical
(Reporter)

Comment 22

11 years ago
Created attachment 212391 [details] [diff] [review]
update

still work in progress. I really need to measure the perf.
Attachment #197981 - Attachment is obsolete: true
(Reporter)

Comment 23

11 years ago
With the latest patch, line breaks work as I intended. As for SHY (soft-hyphen), lines can be broken before and after it (I need to change the rule so that lines can break only after SHY). However, on Windows where SHY is 'blackholed', it doesn't show up at all no matter where it is (i.e. even at the end of a line. I tried to prevent SHY from being pulled into 'blackhole' at the end of a text-run, but it didn't work so far.). On the other hand, SHY is visible everywhere on other platforms where bug 205387 hasn't been fixed (so that SHY is not turned into 'nothingness'). 

Comment 24

11 years ago
> I tried to prevent SHY from being pulled into 'blackhole' at the
> end of a text-run, but it didn't work so far.

What if you substitute it with MINUS -- i.e., it in the event where the linebreaker does break after SHY, it is transformed into MINUS on that spot (to pick a visible glyph), so that you don't have to go down to gfx to be stucked in its 'blackhole'.
(Reporter)

Comment 25

11 years ago
(In reply to comment #24)
> > I tried to prevent SHY from being pulled into 'blackhole' at the
> > end of a text-run, but it didn't work so far.
> 
> What if you substitute it with MINUS -- i.e., it in the event where the
> linebreaker does break after SHY, it is transformed into MINUS on that spot (to
> pick a visible glyph), so that you don't have to go down to gfx to be stucked
> in its 'blackhole'.

I thought about it, but didn't dare to look into nsTextFrame code ;-). Perhaps what I can do is :

1. Replace SHY with hyphen-minus at the end of a line probably here 
(http://lxr.mozilla.org/seamonkey/source/layout/generic/nsTextFrame.cpp#3171)

2. On platforms where bug 205387 is not fixed (I hoped fixing bug 121540 on Mac OS X would fix it for free there, but at least patch 7r1  didn't...) , I also have to get rid of SHY's at places other than at the end of a line. This part may have a performance implication....  Would it help to flag whether SHY is present or not during the 'scanning'(? transforming) phase and use the flag in |nsTextFrame::RenderString|? Sure, it would help, but the question is whether it's worth the effort. I'll measure 'pageload'  time and see how it goes.
 


(Assignee)

Comment 26

11 years ago
(In reply to comment #24)
> What if you substitute it with MINUS -- i.e., it in the event where the
> linebreaker does break after SHY, it is transformed into MINUS on that spot (to
> pick a visible glyph), so that you don't have to go down to gfx to be stucked
> in its 'blackhole'.

I think that it's good idea for rendering. However, if we do it on nsTextTransformer, isn't it changing the text at copying to clipboard, right? I think that we should not replace it on nsTextTransformer, we should replace on nsTextFrame::PrepareUnicodeText.
# Note that we can get that the nsTextFrame is the tail of the line or not so by
# |IsEndOfLine(mState)|.
(Reporter)

Comment 27

11 years ago
(In reply to comment #26)


> think that we should not replace it on nsTextTransformer, we should replace on
> nsTextFrame::PrepareUnicodeText.

afaict, rbs meant nsTextFrame (not nsTextTransformer). Also see his comments in bug 205387. After writing my previous comment, I also thought about doing something in nsTextFrame::PrepareUnicodeText (instad of nsTextFrame::RederString).
(Reporter)

Comment 28

11 years ago
(In reply to comment #23)

> end of a line. I tried to prevent SHY from being pulled into 'blackhole' at the
> end of a text-run, but it didn't work so far.). 

It turned out that there was a typo in my code leading to a compile time error in gfx/src/windows (when I ran make in 'gfx layout/generic layout/build browser/app'). With that typo fixed, I ran it again and found that SHY behaved as it should on Gfx:Win! However, what's suggested by rbs and Masayuki is a lot better than hacking Gfx code so that I'll try to do that (I've not been successful so far.)

BTW, bug 205387 is likely to be fixed on Mac OS X very soon. 
Target Milestone: --- → mozilla1.8.1
(Reporter)

Comment 29

11 years ago
Created attachment 212686 [details] [diff] [review]
a new patch with an ad-hoc Gfx:Win fix for SHY

This is not for review nor for check-in because I wanna take an XP approach in nsTextFrame instead of hackish fixes in Gfx implementations. I'm just uploading it here as a check-point. 

Perhaps, it's a good idea to fix only what this bug is about here (leaving alone SHY) and then fix the SHY problem separately in bug 9101.
Attachment #212391 - Attachment is obsolete: true
You should be aware that we need to remake nsTextFrame on the trunk, and in fact plans for this are well under way. See
http://wiki.mozilla.org/Gecko2:NewTextAPI
and
http://wiki.mozilla.org/Mozilla2:GFXTextRun

So, major hacks to nsTextFrame are really only useful if they're worth doing just for the 1.8.1 branch.
(Reporter)

Comment 31

11 years ago
(In reply to comment #30)
> You should be aware that we need to remake nsTextFrame on the trunk, and in

Don't worry. I'm very well aware of that (I have been reminded of that work almost 10 times by now). I thought I wrote this work is for 1.8.x, but apparently I didn't other than targetting this bug at 1.8.1. This  and SHY issue are very critical and should be fixed for 1.8.1/FF2. Simply, we can't wait until FF3.
(Reporter)

Comment 32

11 years ago
Created attachment 212751 [details] [diff] [review]
1.8.x branch patch

This is for 1.8.x branch. SHY-handling part is not included, but SHY almost works in that lines can break after SHY although it's 'discarded' regardless of its location (that is, even at the end of a line, it's not visible).

I wonder how much pageload degradation can be taken. According to my primilinary pageload measurements, my optimized trunk build with this patch (well, trunk is deCOMtanimated so that my trunk patch is a little different) takes 1.9% longer than the latest trunk build I got from mozilla.org (1691ms vs 1659ms). Although I used exactly the same set of build options, compiers are different so that it'll be better to compare my own build with the patch with one without the patch. The pageload perf. differnece is likely to be a little bigger for 1.8.x than for trunk because of extra overhead due to 'COMtanimation' in 1.8.x
(Assignee)

Comment 33

11 years ago
Jungshik:

We cannot change the interface of nsI*.h. See
http://developer.mozilla.org/devnews/index.php/2006/01/30/tree-rules-for-the-18-branch/
The 'pseudo-interface' means nsI*.h that is not created by IDL.
(Reporter)

Comment 34

11 years ago
(In reply to comment #33)

Thanks for remining me of that. 
 
> We cannot change the interface of nsI*.h. See
> http://developer.mozilla.org/devnews/index.php/2006/01/30/tree-rules-for-the-18-branch/
> The 'pseudo-interface' means nsI*.h that is not created by IDL.

Sigh ... This long standing bug and bug 9101 are about to be solved, but this sorta got in the way...
even adding a new method is not allowed? (adding it at the end might work? ....)  Probably, it's possible to cram a new method into |BreakInBetween| but with some perf. loss and additional ugliness... 

I'll write to drivers@ to ask what to do here before going any further.
(Assignee)

Comment 35

11 years ago
mconnor said, adding new interface instead of changing is allowed on 1.8 branch.
(Reporter)

Comment 36

11 years ago
(In reply to comment #35)
> mconnor said, adding new interface instead of changing is allowed on 1.8
> branch.
That's also what I heard on IRC and by mail. I'll add nsILineBreaker2 or nsILineBreaker_MOZILLA_1_8_BRANCH inheriting nsILineBreaker with one new method added. 

(Reporter)

Comment 37

11 years ago
Created attachment 212907 [details] [diff] [review]
1.8 branch patch with a new interface (still WIP)
Attachment #212751 - Attachment is obsolete: true
(Reporter)

Comment 38

11 years ago
With attachment 212907 [details] [diff] [review], my 1.8 branch build on Linux (with the same build options as 1.8 branch build at moz.org) is 1.6% slower than mozilla.org's 1.8 branch build in terms of pageload time. (1670ms vs 1643ms : Avg. Median) Test pages are 100% English so that some slow-down was expected. I guess the difference in pageload time of non-Western-European pages is even smaller than that or my patch might be even a little faster. 
(Reporter)

Comment 39

11 years ago
Created attachment 213337 [details] [diff] [review]
updated branch patch

The previous patch added nsILineBreakerFactory2 unnecessarily, which has a lot of ripple effects in layout/content/editor. It also changed |nsIDocument| which I was not supposed to change. Per IRC discussion, in this patch, I just added nsILineBreaker_MOZILLA_1_8_BRANCH inheriting nsILineBreaker and use QI in nsTextTransformer ctor to get nsIL...1_8_BRANCH. I could have QI'd elsewhere, but doing it in nsTextTransformer ctor seems to be least expensive perf-wise. 

I haven't yet measured perf. with this patch although I made about 10 measurements with and without the previous patch. The resuts are mixed. Most of time my optimized build with the patch is slower by a few percent points, but in one case it's faster by a few percent points than my optimized build without the patch. (in terms of pageload time: 30+ web sites all in ASCII).
Attachment #212907 - Attachment is obsolete: true

Comment 40

11 years ago
This is a general problem of breaks. Where should breaks occur when a width specification would require a break, or when a browser window size would exclude some text? 

I suggest that a break should be placed at the last point in the text line where it does not split up a word or number. (note that commas, periods, "E", "+", "i", and "-" may form part of a number). 

If there is no such place in the text, I would break after the last character which can be displayed, without any inserted hyphen. I suggest it gets very messy if you try to hyphenate on the basis of sylables, especially with multiple languages

This problem shows up in tables but it could show up for any block where there is a width specified, inherited, or implied.

Note that absolute widths may specify block widths exceeding the window, which should be respected.

tOM
(Reporter)

Comment 41

11 years ago
Masayuki, I really love to fix this bug before FF 2.0, but I'm afraid I can't squeeze out time to fix this bug for FF 2.0 (even if  it's not too late to put this in for FF 2.0). 

My current patch does resolve most of problems (including 'SHY' issue : bug 9101 where bug 205387 is fixed) more or less(for FF3, we need to take a better approach, but for FF2, this 'stop-gap' patch should do), but I don't have time to test and measure performance. Besides, I was held back by that test routines in layout/generic/nsTextTransformer.cpp (that needs to be revised to match a new line breaking rule or to be made non fatal). 

In short, if you wanna take over this bug, please go ahead. 
(Assignee)

Comment 42

11 years ago
(In reply to comment #41)
> In short, if you wanna take over this bug, please go ahead. 

O.K. If I can, I will do it.
(Assignee)

Updated

11 years ago
Assignee: jshin1987 → masayuki
Status: ASSIGNED → NEW
Priority: -- → P1
(Assignee)

Updated

11 years ago
Status: NEW → ASSIGNED
(Assignee)

Comment 43

11 years ago
I looked the code, but for SHY problem, we need large change in nsTextFrame and nsLineLayout. I think that we should fix it after refactoring it. I'll attach the new patch without SHY problem.
(Assignee)

Comment 44

11 years ago
Comment on attachment 212686 [details] [diff] [review]
a new patch with an ad-hoc Gfx:Win fix for SHY

Jungshik:
> +  else if(aChar1 == U_SLASH)
> +    c1 = NUMERIC_CLASS;

What means this?
I cannot understand this.
(Assignee)

Comment 45

11 years ago
Created attachment 226990 [details] [diff] [review]
Patch rv1.0 for URI LB

Jshin:
Could you review the patch?
I cannot found the problem of comment 44.
And we don't need to change the |ScanNormalAsciiText_B|. Because it is the reverse of |ScanNormalAsciiText_F_ForWordBreak| that is using only for selection per word. We don't need to change these functions.
Attachment #212686 - Attachment is obsolete: true
Attachment #213337 - Attachment is obsolete: true
Attachment #226990 - Flags: review?(jshin1987)
(Assignee)

Comment 46

11 years ago
Created attachment 226993 [details] [diff] [review]
Patch rv1.1 for URI LB

That breaks NBSP. This patch cares it.
Attachment #226990 - Attachment is obsolete: true
Attachment #226993 - Flags: review?(jshin1987)
Attachment #226990 - Flags: review?(jshin1987)
(Assignee)

Comment 47

11 years ago
Created attachment 227004 [details] [diff] [review]
Patch rv1.2 for URI LB

The previous patch breaks some statistics tables. Because comma and period break lines in numeric. And IE doesn't break by these characters in URI. I think that we should not break by these characters in Latin1 text. (Normally, Latin1 text may have space after comma or period in most cases.)
Attachment #226993 - Attachment is obsolete: true
Attachment #227004 - Flags: review?(jshin1987)
Attachment #226993 - Flags: review?(jshin1987)
Big textframe changes are coming, so don't start working on textframe refactorings yet ...
(Assignee)

Comment 49

11 years ago
(In reply to comment #48)
> Big textframe changes are coming, so don't start working on textframe
> refactorings yet ...

Of course, I don't start the work. Current patch that fixes only without SHY problem is changing only nsTextTransformer in layout. Or, Will be changed the nsTextTransformer by the refactoring?
A change that small would be OK.

Realistically, it may be too late to land this for FF2. this is going to be risky.
(Assignee)

Comment 51

11 years ago
(In reply to comment #50)
> Realistically, it may be too late to land this for FF2. this is going to be
> risky.

That may be right. But this bug is very unpopularity in Japan. Because some sites are broken by this bug. Therefore, I should do my best until time is over.
(Reporter)

Comment 52

11 years ago
(In reply to comment #45)

> I cannot found the problem of comment 44.

 What did you mean by the above sentence? 

(In reply to comment #43)
> I looked the code, but for SHY problem, we need large change in nsTextFrame and
> nsLineLayout. 

  For FF3, we need to do it 'the right way', but for FF2 (if we can make it), just passing SHY intact to nsLineBreaker (instead of killing it in nsLayout) and treating it the same way as a hyphen is treated  in nsLineBreaker 'almost' works. (on Windows and Mac OS X, it should work as far as rendering is concerned.) 

Nonetheless, it may be a good idea to solve one problem at a time.

(In reply to comment #47)

> The previous patch breaks some statistics tables. Because comma and period
> break lines in numeric. 

 What are some statistics tables? Did you mean the line break checking function?
That shouldn't fail if you don't touch SHY, should it? 

> And IE doesn't break by these characters in URI. I
> think that we should not break by these characters in Latin1 text. (Normally,
> Latin1 text may have space after comma or period in most cases.)

I guess UTR #14 doesn't agree with IE. 

BTW, please, do measure the performance. One of the primary reasons this bug had been  held up was that I couldn't find time to measure performance. While you're at it, you'd better make a patch for 1.8.x branch (similar to attachment 213337 [details] [diff] [review]) and measure its performance as well.
(Assignee)

Comment 53

11 years ago
(In reply to comment #52)
> (In reply to comment #45)
> 
> > I cannot found the problem of comment 44.
> 
>  What did you mean by the above sentence?

Sorry, for my strange English. I cannot understand why the code is needed. Because I cannot find any problem by removing the code.

> (In reply to comment #43)
> > I looked the code, but for SHY problem, we need large change in nsTextFrame and
> > nsLineLayout. 
> 
>   For FF3, we need to do it 'the right way', but for FF2 (if we can make it),
> just passing SHY intact to nsLineBreaker (instead of killing it in nsLayout)
> and treating it the same way as a hyphen is treated  in nsLineBreaker 'almost'
> works. (on Windows and Mac OS X, it should work as far as rendering is
> concerned.) 

I think that your idea is a risky. Your gfx hacks break some cases that are with letter-spacing, word-spacing, justification and small caps.

> (In reply to comment #47)
> 
> > The previous patch breaks some statistics tables. Because comma and period
> > break lines in numeric. 
> 
>  What are some statistics tables? Did you mean the line break checking
> function?
> That shouldn't fail if you don't touch SHY, should it? 

The period and comma break a line after them. But I think that they should not break the line. Because they are used by in numeric, domain and file name. Please wait next my comment. I'll check UAX14.

> > And IE doesn't break by these characters in URI. I
> > think that we should not break by these characters in Latin1 text. (Normally,
> > Latin1 text may have space after comma or period in most cases.)
> 
> I guess UTR #14 doesn't agree with IE. 
> 
> BTW, please, do measure the performance. One of the primary reasons this bug
> had been  held up was that I couldn't find time to measure performance. While
> you're at it, you'd better make a patch for 1.8.x branch (similar to attachment
> 213337 [edit]) and measure its performance as well.

How to measure the performance? Using tools? Or insert the code?
(Assignee)

Comment 54

11 years ago
(In reply to comment #53)
> >  What are some statistics tables?

Sorry, I forgot this question. I meant a statistical table that includes numerics in small table cells, e.g., access log viewer. I confirmed that the patch breaks the layout of Webalizer.
http://www.mrunix.net/webalizer/
(Assignee)

Comment 55

11 years ago
I confirmed the UAX14. I understand why you write special case code for SLASH. But I think that we need same logic in the case of COLON, SEMICOLON, PERIOD and COMMA. I'll attach new patch soon.
(Assignee)

Comment 56

11 years ago
Created attachment 227210 [details] [diff] [review]
Patch rv1.3

updating. I think that we should update |ContextualAnalysis| for UAX#14.
Attachment #227004 - Attachment is obsolete: true
Attachment #227210 - Flags: review?(jshin1987)
Attachment #227004 - Flags: review?(jshin1987)
(Assignee)

Comment 57

11 years ago
http://lxr.mozilla.org/mozilla/source/tools/performance/layout/perf-doc.html
I found this document. But I cannot build the Viewer.exe for the testing.

Jungshik: What did you use for the performance testing?
(Reporter)

Comment 58

11 years ago
(In reply to comment #57)
> http://lxr.mozilla.org/mozilla/source/tools/performance/layout/perf-doc.html
> I found this document. But I cannot build the Viewer.exe for the testing.
> 
> Jungshik: What did you use for the performance testing?

I emailed you with information on pageload test. I thought you had done that test in another bug so that you'd know about it. 

(In reply to comment #55)
> I confirmed the UAX14. I understand why you write special case code for SLASH.
> But I think that we need same logic in the case of COLON, SEMICOLON, PERIOD and
> COMMA. I'll attach new patch soon.

Thank you. I'll take a look at it on Wednesday night.

(Assignee)

Comment 59

11 years ago
Thank you, Jungshik.

I tested the performance.

Current build:

Avg. Median : 1243 msec
Average     : 1487 msec

Avg. Median : 1181 msec
Average     : 1473 msec

Avg. Median : 1184 msec
Average     : 1510 msec

Patched build:

Avg. Median : 1181 msec
Average     : 1416 msec

Avg. Median : 1205 msec
Average     : 1398 msec

Avg. Median : 1185 msec
Average     : 1610 msec

I tested on Win2k with cairo/optimized/static build. And disabling the image. Because if it is enabled, the score is blurred.

I cannot find the performance regression on this test.
intl/lwbrk/src/jisx4501class.h is generated from some other files, so shouldn't you be changing those other files instead?
See
http://groups.google.com/group/mozilla.dev.tech.layout/browse_frm/thread/bd63e8fe7d7a9baa/d9bcf8b608798f75#d9bcf8b608798f75
When I wrote that, I wasn't thinking about this bug. But the approach here --- make JISX4051 rules *always* apply --- would actually fix a lot of my problems, because then there's only a limited amount of context to worry about (two characters on either side of the break point, because of contextual analysis, except for Thai) and no cross-line analysis should be needed.

This doesn't fix all the problems. We still need to modify textframes to find those extra characters if they're not available (e.g. situations like
  <b>IDEOGRAPH</b>,<b>3000</b> where we need to do contextual analysis on text from three different textframes/text nodes. But at least that should be pretty rare so we can afford to do something slow.

I'm pretty worried that always applying JISX4051 rules is a huge change. ASCII text with punctuation that used to never break can now break. But it could be the right thing to do, and now is a good time to do experiments like that on trunk.
> This doesn't fix all the problems. We still need to modify textframes to find
> those extra characters if they're not available (e.g. situations like
>   <b>IDEOGRAPH</b>,<b>3000</b> where we need to do contextual analysis on text
> from three different textframes/text nodes. But at least that should be pretty
> rare so we can afford to do something slow.

Part of this problem is that the linebreaker sometimes just uses U_NULL when context is not available. Instead it should signal the caller that it needs more context. We should really do this for your new Latin1 method too, and pass in strings just like BreakInBetween --- right?

If you like, I can fix this part in bug 343445.
(In reply to comment #61)
> I'm pretty worried that always applying JISX4051 rules is a huge change. ASCII
> text with punctuation that used to never break can now break. But it could be
> the right thing to do, and now is a good time to do experiments like that on
> trunk.

I suspect that a "righter" thing to do if we're already making huge changes is to implement http://www.unicode.org/reports/tr14/

See bug 56652, especially bug 56652 comment 18
(Reporter)

Comment 64

11 years ago
We would just make an 'ad-hoc' patch for 1.8 branch while making a 'righter' (or even 'the right') patch for trunk (UAX #14) if we didn't have to test any patch for 1.8 branch on trunk first. 

Given that FF3 is about a year away from now, it might  not be so bad to use trunk as a temporary test bed for 1.8 branch (before writing 'the right' patch for trunk). However, if the chance of getting this 'ad-hoc' patch in for 1.8 branch is virtually zero (because it's too late in the cycle), we'd better just go straight to 'the right' patch on trunk as Simon wrote.

One may ask: why not make 'the right' patch for trunk and try to port it back to 1.8 branch? That might be possible, but given the FF2 release schedule, I doubt it'll happen.

(Assignee)

Comment 65

11 years ago
(In reply to comment #64)
> we'd better just go straight to 'the right' patch on trunk as Simon wrote.

Yes, finally, we should use 'the right' patch. But bug 56652 is not active. And it  was opened 6 years ago. I think we should use 'ad-hoc' way until fixed it.

# NOTE: 'mixi' is a most popular SNS in Japan.
# That has over 3 million members in March 2006.
# Some pages in the mixi are broken by this bug.
# We need to fix this bug ASAP.
(Reporter)

Comment 66

11 years ago
(In reply to comment #65)

I may not have been clear in my comment.

> I think we should use 'ad-hoc' way until fixed it.

That's what I wrote we should do in my previous comment *unless* there's very little chance of landing this ad-hoc patch  for FF2 (1.8 branch)

As for the perf. measure, what you did is probably sufficient (I'm worried a bit about turning off images, though), but you may want to see bug 162361 comment #127. 
(Assignee)

Comment 67

11 years ago
(In reply to comment #66)
> (In reply to comment #65)
> 
> I may not have been clear in my comment.
> 
> > I think we should use 'ad-hoc' way until fixed it.
> 
> That's what I wrote we should do in my previous comment *unless* there's very
> little chance of landing this ad-hoc patch  for FF2 (1.8 branch)

I think it for trunk too. (so, even if there is no chance for Fx2.) Because we don't have a guaranty that we can fix it on Fx3. (at least, the patched build 'righter' than current build.)
UAX#14 is mostly similar to our implementation of JISX4051 in terms of the amount of context required, except for rules LB9-11a. Those rules say we should not break in a run of spaces if there are certain punctuation characters at the beginning (and possibly the end) of the run. This is quite bad because these characters are common and we may have to scan many text nodes to find the character after the spaces (if any), and if there is such a character, back up and break earlier. But then, not breaking around spaces could be confusing for some authors and users. We need to think about the best way to deal with this. But that's not really relevant to this bug...

I think we should go ahead with this approach on trunk. If it causes us problems with ASCII text, the problems would probably apply to UAX#14 as well so we get useful information. It also happens to make things easier for me! I think it is probably going to be too risky for FF2 though.
(Assignee)

Comment 69

11 years ago
(In reply to comment #68)
> I think it is probably going to be too risky for FF2 though.

I can understand your worry. But there is a beta2 stage on this August. Cannot test this patch on it by beta testers? I also think that this patch is risk, but it is very valuable too.
# I don't know how many people tests beta releases.
(Reporter)

Comment 70

11 years ago
Comment on attachment 227210 [details] [diff] [review]
Patch rv1.3

r=jshin

Sorry again for the delay.
Attachment #227210 - Flags: review?(jshin1987) → review+
(Assignee)

Comment 71

11 years ago
Comment on attachment 227210 [details] [diff] [review]
Patch rv1.3

Thank you, jshin.

Roc, would you review this? Or is rbs better reviewer?
Attachment #227210 - Flags: superreview?(roc)
Comment on attachment 227210 [details] [diff] [review]
Patch rv1.3

I'll review it because I've recently been poking around in this code and I want to see it land soonish.

+     // We don't need to check next character. Because SLASH breaks only after.
+     if (IS_ASCII_DIGIT(next))

I think you mean "check prev character", right?

One thing I think you probably should do is to write a microbenchmark testcase that just tests the linebreaker methods over some large, common texts many times, so we can measure the performance of that quite accurately. If this causes Tp regressions when we land it on trunk, then we should back out and try that experiment. In any case I think we should try that experiment before we land on branch.
Attachment #227210 - Flags: superreview?(roc) → superreview+
(Assignee)

Comment 73

11 years ago
Created attachment 229086 [details] [diff] [review]
Patch rv1.3.1

Thank you, roc. I'll try to create the testcase.
Attachment #227210 - Attachment is obsolete: true
Attachment #229086 - Flags: superreview+
Attachment #229086 - Flags: review+
(Assignee)

Updated

11 years ago
Whiteboard: Pending check-in

Comment 74

11 years ago
(In reply to comment #0)
> http://xpz.chinguya.net/forums/data/xpzdoc/firefox/firefox_bug3.gif
> 
> Case A and case B in the screenshot at the URL above are almost identical and
...

Screenshot no longer works.

(Reporter)

Comment 75

11 years ago
(In reply to comment #72)

> One thing I think you probably should do is to write a microbenchmark testcase
> that just tests the linebreaker methods over some large, common texts many
> times, so we can measure the performance of that quite accurately. 

 It might not be a good idea to determine whether to accept this change (solely) based on what you suggested. For sure, that way we can isolate the performance of our line breaker from other noises, but what we need to know is whether this change has any  performance impact on the rendering of 'real-world' web pages rather than some 'synthetic' web pages. That's why I'm mildly worried about testing with image rendering off. 

If this change is all about performance, we shouldn't allow any performance loss, but it's more about being correct so that I believe a 'mild' (if any) performance loss observed with 'synthetic' test cases (if that loss is not observed in more realistic test cases) shouldn't stop us from making this change.

(Assignee)

Comment 76

11 years ago
I created the testcases:
http://www.d-toybox.com/mozilla/testpages/testcasesforbenchmark.zip

This has 3 pages.
1. Only URL text.
2. Only English text.
3. Only Japanese text.

The results on my environment(Win2k). (I tested 10 times.)

current build(AVG/MED):
1. 10640.5/10750.0
2. 18679.6/18663.5
3.  8254.6/ 8312.0

patched build(AVG/MED):
1. 29126.7/29148.5
2. 18862.5/18804.5
3.  8120.3/ 8102.0

In case 1 (URL text), the performance slows down. And in case 2 and case 3, the values are very similar.

I don't think that this result has serious regression in the performance. Because the case 1 is a very special case. That has very many URL text in a paragraph.(The file size is over 5M.)
(Assignee)

Comment 77

11 years ago
(In reply to comment #75)
> That's why I'm mildly worried about testing with image rendering off. 

The test pages of the test tool are rendered quirks mode, so, the images of the pages has broken-image-box in the layout. Therefore, I think that there are no difference between the layouts(image rendering on vs off).
(Assignee)

Comment 78

11 years ago
Checked-in to Trunk. Let's watch the tp difference in the tinderbox.
Whiteboard: Pending check-in → Checked-in to Trunk, Let's watch Tp difference.

Comment 79

11 years ago
Looks like this made luna and nye go orange (test failure).
(Assignee)

Comment 80

11 years ago
Thank you, josh.

Error log:

> Running DomToTextConversionTest ...
> Begin: Thu Jul 13 12:26:01 2006
> cmd = perl TestOutSinks.pl
> End:   Thu Jul 13 12:26:04 2006
> ----------- Output from DomToTextConversionTest ------------- 
>   Testing simple copy case ...
>   Testing simple html to plaintext formatting ...
>   Testing non-wrapped plaintext in preformatted mode ...
>   Testing mail quoting ...
>   Testing misc. HTML output with format=flowed ...
>   Testing format=flowed output ...
>   Comparison failed at char 90:
>   -----
>   This is a mail with a couple of long lines and 
>   then a sig. This is used as test of the f
>   -----
>   Format=flowed test failed.
>   Testing HTML Table to Text ...
>   
>   ERROR: DOM SERIALIZER TEST FAILED:  simplemail.out
>   See http://www.mozilla.org/editor/serializer-tests.html for help.
> ----------- End Output from DomToTextConversionTest --------- 
> Error: DomToTextConversionTest exited with status 1
> Error: DomToTextConversionTest: exited with status 1
> 
> No More Errors
(Assignee)

Comment 81

11 years ago
backed-out.
(Assignee)

Updated

11 years ago
Whiteboard: Checked-in to Trunk, Let's watch Tp difference.
(Assignee)

Comment 82

11 years ago
Created attachment 229151 [details] [diff] [review]
Patch for test tool

This fixes the test failure.

Now, we can break the line before/after '='.

# But I cannot test before check-in.
Attachment #229151 - Flags: review?(mrbkap)
Comment on attachment 229151 [details] [diff] [review]
Patch for test tool

I guess this is OK. Why can't you run the tests? You should be able to go into your <objdir>/dist/bin/ directory and type ./TestOutSinks.pl (or 'perl TestOutsSinks.pl' to test.
Attachment #229151 - Flags: review?(mrbkap) → review+
(Assignee)

Comment 84

11 years ago
(In reply to comment #83)
> (From update of attachment 229151 [details] [diff] [review] [edit])
> I guess this is OK. Why can't you run the tests? You should be able to go into
> your <objdir>/dist/bin/ directory and type ./TestOutSinks.pl (or 'perl
> TestOutsSinks.pl' to test.
> 

Thank you for the review. But I cannot test it... The build needs '--enable-tests'? But the script returns the error that message means '.' is not a command or executable problem. (I don't know what message is displayed on Windows-en.)

I checked-in the patch again. sorry for the risk. I'm tracking the tinderbox now.
(Assignee)

Comment 85

11 years ago
(In reply to comment #84)
> not a command or executable problem. (I don't know what message is displayed on

executable program.
(Assignee)

Comment 86

11 years ago
re-backed-out. The test is failed again. I don't have no idea...

# I'll sleep soon, sorry.

Comment 87

11 years ago
Masayuki, according to the log there is a space character after the '=' character in the output that I didn't see in rev. 1.5 of simplemail.out.
(In reply to comment #75)
> (In reply to comment #72)
>  It might not be a good idea to determine whether to accept this change
> (solely) based on what you suggested. For sure, that way we can isolate the
> performance of our line breaker from other noises, but what we need to know
> is whether this change has any  performance impact on the rendering of
> 'real-world' web pages rather than some 'synthetic' web pages. That's why I'm
> mildly worried about testing with image rendering off.

I don't want to reject this change. If the Tp numbers get hurt, and I think that is quite likely based on the results of comment #76, then we should try to optimize instead of throwing the patch out.
It would be useful to also benchmark text from a regular English Web page.
(Assignee)

Comment 90

11 years ago
Created attachment 229213 [details] [diff] [review]
Patch fro test tool

Thank you, jag. carrying over the review flag.
Attachment #229151 - Attachment is obsolete: true
Attachment #229213 - Flags: review+
(Assignee)

Comment 91

11 years ago
checked-in.

# I add the build flag '--enable-tests', the script run on my environment too.
# But the tool returns error, "Unable to create a parser : 0x80040154".
# I'm trying risky check-in, sorry.
Blocks: 343445
(Assignee)

Comment 92

11 years ago
now, tinderbox is green. Let's watch the tp.
Whiteboard: Let's watch tp in Trunk tinderbox
Line breaks are very different now on this page: http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=PhoenixTinderbox&branch=HEAD&branchtype=match&dir=&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=day&mindate=&maxdate=&cvsroot=/cvsroot
when I use the default setting of a new profile and I compare a 1.9a1_2006071316 build with a 1.9a1_2006071322.
I'm telling it here because I don't know if it's intended.
This very likely caused about 5ms Tp regression on btek.

I think we should look hard at how this can be optimized for ASCII text. One thing that might help is to get rid of the virtual call, perhaps by moving linebreaker to layout. We should also look at optimizing the JISX4051 path for ASCII characters. We may be able to define some specialized, small, fast tables just for cases where both characters in CanBreakBetween are ASCII (< 128).
(Assignee)

Comment 95

11 years ago
(In reply to comment #93)
> Line breaks are very different now on this page:

Yeah, of course. The current trunk build is similar to IE than old.
# But should not we break before/after '-' if the side of it is a numeric??
(Assignee)

Comment 96

11 years ago
Created attachment 229267 [details] [diff] [review]
Patch for optimizing rv1.0

Roc, how about this patch?

result:
(1) URL
AVG. 29896.8 -> 29288.9
MED. 30031.5 -> 29250.0

(2) English
AVG. 19370.8 -> 18887.5
MED. 19375.0 -> 18843.5
Attachment #229267 - Flags: superreview?(roc)
Attachment #229267 - Flags: review?(roc)
(Assignee)

Updated

11 years ago
Whiteboard: Let's watch tp in Trunk tinderbox → Needing optimization
(Assignee)

Comment 97

11 years ago
I tested with our test tool.

current:
Avg. Median : 1186 msec
Average     : 1568 msec
Minimum     : 530 msec
Maximum     : 10336 msec

patched:
Avg. Median : 1179 msec
Average     : 1606 msec
Minimum     : 530 msec
Maximum     : 10937 msec
(Assignee)

Comment 98

11 years ago
Created attachment 229273 [details] [diff] [review]
Patch for '+', '=' and '-'

Now, '+', '=' and '-' allows to breaking the line on both before and after it.
But if there is not enough space (e.g., table layout), a line only has these characters. I.e.,

2000-01
-
01T00:00

I think that they should not allow to break on before. And hyphen is used by date separator some locales. Like slash, we should not break on both before/after if the around characters are numeric.
Attachment #229273 - Flags: review?(jshin1987)
I'm not sure exactly what you're comparing and what the numbers mean in comments #96 and #97. Probably we shold file followup bugs for optimization or further tuning of the line-break algorithm.
(Reporter)

Comment 100

11 years ago
(In reply to comment #97)
> I tested with our test tool.

> Avg. Median : 1179 msec
> Average     : 1606 msec
> Minimum     : 530 msec
> Maximum     : 10937 msec

Min is half of median and max is 6-10 times as large ? 
Do you believe these numbers (and differences in them) mean anything? 


(Reporter)

Comment 101

11 years ago
(In reply to comment #100)

> Min is half of median and max is 6-10 times as large ? 
> Do you believe these numbers (and differences in them) mean anything? 

Ignore the above. Sorry I was confused as to what min and max mean having run my last page load test several months ago. If I were you, I would run the test 10 times with each run loading each test page 5 times. From each of 10 test runs, I'd take the average median and apply t-test to 2 sets (one with patch and the other without patch) of those 10 numbers.

(Assignee)

Comment 102

11 years ago
comment 96 is the result of the testcases of comment 76. They are very large simple testcases. The score means page loading time in milliseconds.

comment 97 is the result of the page loading times test tool of mozilla.org. I think that the average time is not reliable. Because the maximum score is very large in my environment. But instead of it, I think that median is reliable.

# I'll file new bugs.
(Assignee)

Updated

11 years ago
Depends on: 344778
(Assignee)

Updated

11 years ago
Attachment #229267 - Attachment is obsolete: true
Attachment #229267 - Flags: superreview?(roc)
Attachment #229267 - Flags: review?(roc)
(Assignee)

Updated

11 years ago
Depends on: 344780
(Assignee)

Updated

11 years ago
Attachment #229273 - Attachment is obsolete: true
Attachment #229273 - Flags: review?(jshin1987)
(Assignee)

Comment 103

11 years ago
I filed bug 344778 and bug 244780.
Whiteboard: Needing optimization
(Assignee)

Comment 104

11 years ago
(In reply to comment #103)
> I filed bug 344778 and bug 244780.
> 

Sorry. bug 344778 and bug 344780.
(Assignee)

Updated

11 years ago
Depends on: 344816
Is it a bug or a feature that it now breaks after a "."? For an example see any Bonsai query, it breaks like this in the "Who" column:
"neil%
parkwaycc.
co.uk"
In IE 6 it breaks like this:
"neil%
parkwaycc.co.uk"

(Assignee)

Comment 106

11 years ago
(In reply to comment #105)
> Is it a bug or a feature that it now breaks after a "."? For an example see any
> Bonsai query, it breaks like this in the "Who" column:
> "neil%
> parkwaycc.
> co.uk"
> In IE 6 it breaks like this:
> "neil%
> parkwaycc.co.uk"
> 

We are discussing it in bug 344780. Please comment to it.
Unfortunately I think we can't target this for FF2. There are obviously going to be several followup issues which need to be sorted out and we just don't have time.

On trunk, perhaps we should just go ahead and implement UAX#14.
(Assignee)

Comment 108

11 years ago
(In reply to comment #107)
> Unfortunately I think we can't target this for FF2. There are obviously going
> to be several followup issues which need to be sorted out and we just don't
> have time.

Yeah, right. There are very many issues, more than I thought.

> On trunk, perhaps we should just go ahead and implement UAX#14.

Yes. But the three regressions are good issues after implement UAX#14. I think that they may help to implement UAX#14.
Target Milestone: mozilla1.8.1 → mozilla1.9alpha
> Yes. But the three regressions are good issues after implement UAX#14. I think
> that they may help to implement UAX#14.

I'm not sure what you mean. Do you mean that after we implement UAX#14 those bugs will still exist, so we will still need to fix them?

I think we should actually back this patch out to get back to our old behaviour for now. Then let's figure out what changes need to be made to the UAX#14 spec (if any) to make it suitable for the Web. Then implement that. Does that sound OK?
(Assignee)

Comment 110

11 years ago
(In reply to comment #109)
> > Yes. But the three regressions are good issues after implement UAX#14. I think
> > that they may help to implement UAX#14.
> 
> I'm not sure what you mean. Do you mean that after we implement UAX#14 those
> bugs will still exist, so we will still need to fix them?

Yes. (But of course, they may not be same.) I think that the our experience will help for UAX#14 works.

> I think we should actually back this patch out to get back to our old behaviour
> for now. Then let's figure out what changes need to be made to the UAX#14 spec
> (if any) to make it suitable for the Web. Then implement that. Does that sound
> OK?

I think that we should use current trunk build and check/fix the regressions. E.g., we might not be able to find the Germany problem in bug 344780 if we didn't check-in the patch. I think that the many eyes will help us.

# Note that I'm reading UAX#14 for better implementation,
# But I don't know I (or somebody) can implement new line breaker before Fx3.
# The current behavior is better than Fx2 if we cannot fix bug 56652 in Fx3.
(Assignee)

Comment 111

11 years ago
And, nobody working on bug 56652 :-(
I don't think we should approach this by fixing bugs until people are happy. We need a written specification that controls exactly how our linebreaking works. The linebreaker code is optimized for speed, so it's not so easy to understand. Without a spec, we can't tell the difference between bugs in the linebreaker code and intended behaviour. If we just keep on hacking the code to fix bugs then we won't have a spec.

I think probably our spec should just be UAX#14 plus a list of changes. We should try to keep our changes to the minimum. This is probably a situation where whatever we do, someone will be unhappy; following UAX#14 lets us tell the unhappy people "sorry, it's the standard".

I also think that we shouldn't keep on making changes over time. This is an important compatibility issue. We should try to get it good enough for FF3 that we don't have to change it again for a long time.
(Assignee)

Comment 113

11 years ago
Roc, but Mozilla Japan needs this fix. Therefore, I have another compromise idea.
1. back out the patch.
2. create simple UAX#14 Line Breaker, but that supports only below U+100. so we use it only in |nsTextTransformer::ScanNormalAsciiText_F|.

I think that if support range is only below U+100, I can create it quickly, and bug 344816 is not occurred, so we can lower the possibility of regression to minimal. How about this approach?
It sounds like you're looking for a fix which can go into FF2, but really, that is not realistic unfortunately :-(. 
(Assignee)

Comment 115

11 years ago
(In reply to comment #114)
> It sounds like you're looking for a fix which can go into FF2,

No, I have already given up it.
I want only a guarantee that this bug is fixed on Fx3.
# I think that bug 56652 needs very long time, I don't think that the bug is fixed before Fx3.
(Assignee)

Comment 116

11 years ago
(In reply to comment #115)
> I don't think that the bug is fixed before Fx3.

If it will be released in 2007Q1.
Why is it so hard?
(Assignee)

Comment 118

11 years ago
(In reply to comment #117)
> Why is it so hard?

The creation may not be difficult (but I'm not so.). But I think that the testing is very difficult. Because all languages are affected. Therefore, I think that the bug should be fixed in alpha stage or early beta stage, because the risk is very large.
(In reply to comment #118)
> (In reply to comment #117)
> > Why is it so hard?
> 
> The creation may not be difficult (but I'm not so.). But I think that the
> testing is very difficult. Because all languages are affected. Therefore, I
> think that the bug should be fixed in alpha stage or early beta stage, because
> the risk is very large.

I agree it's risky and we should land before 1.9beta1 but that isn't likely to happen this year. Can't we implement UAX#14 some time in the next few months?
(Assignee)

Comment 120

11 years ago
(In reply to comment #119)
> Can't we implement UAX#14 some time in the next few months?

I don't know. I have IME disable working on Mac now. After it, I can work on it. But the current work needs more time. And unfortunately, I don't know that somebody works on UAX#14, now.
(Assignee)

Comment 121

11 years ago
(In reply to comment #120)
> I have IME disable working on Mac now.

Oops, and I have some cairo/thebes work too.
How about this: let's undo the linebreaker changes now. Over the next few months if you have time to work on UAX#14 and we can get it done, that's great. If we start getting close to 1.9beta1 and we don't think we'll have UAX#14 working, then we implement a solution restricted to ASCII characters. Does that sounds reasonable?
(Assignee)

Comment 123

11 years ago
(In reply to comment #122)
> How about this: let's undo the linebreaker changes now. Over the next few
> months if you have time to work on UAX#14 and we can get it done, that's great.
> If we start getting close to 1.9beta1 and we don't think we'll have UAX#14
> working, then we implement a solution restricted to ASCII characters. Does that
> sounds reasonable?
> 

O.K. Good! I back out the patch.
(Assignee)

Comment 124

11 years ago
(In reply to comment #123)
> O.K. Good! I back out the patch.

Oops, sorry, I cannot back-out now, I'll back out after 3 or 4 hours.
# If you want to back-out now, please do it.
No hurry.
(Assignee)

Comment 126

11 years ago
backed-out.
(Assignee)

Updated

11 years ago
Target Milestone: mozilla1.9alpha → mozilla1.9beta

Comment 127

10 years ago
After applying patch, I found a site that won't break after normal space (0x20) and the text overlapped to the next cell.
http://de.selfhtml.org/html/referenz/attribute.htm#div

Screen Shot:
http://img254.imageshack.us/img254/9805/unbenannt28ck.jpg

Could you please check this out?
The testcases in this bug all WFM now on the trunk. Is this fixed or is more work still (and hence more testcases) needed?
Sorry for the spam, the testcases were behaving nicely because of my screen width. When I resized it, I still see the issue.
I still see the bug in the original testcase.

This would be very easy to fix now, and probably the performance impact would be minimized too since line break opportunites are stored persistently in textruns. To fix this now you'd just have to change nsLineBreaker to always call nsILineBreaker instead of only when a CJK character is detected.
(Assignee)

Comment 131

10 years ago
roc:

Can I work on this now?
Sure. But before you write code, can you explain exactly what your plan is?
(Assignee)

Comment 133

10 years ago
I want to change:

1. change the character class definition for some 8 bit characters (like '/'). But it should not change the serializeres behavior (see bug 344816), for this issue, I need to check the code before writing the code.

2. change the nsLineBreaker::AppendText 8bit version which should use nsContentUtils::LineBreaker()->GetJISx4051Breaks() if needed. But for performance, I may use the approach which is same as bug 229645. But it makes to down the maintenance performance...
(Assignee)

Comment 134

10 years ago
(In reply to comment #133)
> in 2.
> bug 229645.

it is bug 344778.

Under what circumstances do you want to call GetJISx4051Breaks() for 8-bit text?

In bug 336959 there's a patch to replace the nsILineBreaker implementation with Pango on Linux (and a desire to do the same with Uniscribe and ATSUI). If we call GetJISx4051Breaks() always (bad name!) then these two patches together will mean using platform linebreaking always. What do you think about that? I'm not all that comfortable with the idea but I don't know what's better.
(Assignee)

Comment 136

10 years ago
thank you roc.

On gecke1.9 final, the spec of line-breaking depends on platform? Is it a good choice for us? I don't know these three platforms use what specs.

O.K. I'll only change the GetJISx4051Breaks being called from nsLineBreaker::AppendText 8 bit verstion. And if these platforms doesn't have compatibility for ASCII, we can override it.
> On gecke1.9 final, the spec of line-breaking depends on platform? Is it a good
> choice for us?

No, that's why I'm not comfortable with it. But for some languages like Thai we really have to use the platform libraries. The question is: how can we divide the functionality so we use platform libraries for line breaking of certain character ranges (e.g. Thai) and use our own code for the rest? The current suggestion is "use the platform libraries to break a whitespace-delimited word if and only if it contains at least one CJK or Thai character" but that doesn't seem so good.

> O.K. I'll only change the GetJISx4051Breaks being called from
> nsLineBreaker::AppendText 8 bit verstion.

So you're going to make the 8-bit path call GetJISx4051Breaks for each word?
(Assignee)

Comment 138

10 years ago
(In reply to comment #137)
> The question is: how can we divide
> the functionality so we use platform libraries for line breaking of certain
> character ranges (e.g. Thai) and use our own code for the rest?

I agree to this. In future (and if possible), we should implement them ourselves.

> > O.K. I'll only change the GetJISx4051Breaks being called from
> > nsLineBreaker::AppendText 8 bit verstion.
> 
> So you're going to make the 8-bit path call GetJISx4051Breaks for each word?

I'm not certain. We should not use same line breaking rules in rendering and serializer. Because for layout, we should take a priority for IE (and others) compatible. However, for serializer, we don't need to mind the compatibility. Especially, URL should not separate to two or more words in a mail.

Therefore, I think that the approach of bug 344778 that is using static function only for layout may be good. (for the serializer issue, for the performance) And it should be used from 16bit path too if a character is in ASCII.
(Assignee)

Comment 139

10 years ago
Created attachment 271370 [details] [diff] [review]
Retry to fix (work in progress)

first shot of the retry.
Attachment #229086 - Attachment is obsolete: true
Attachment #229213 - Attachment is obsolete: true
(Assignee)

Comment 140

10 years ago
Created attachment 271375 [details] [diff] [review]
Retry to fix (work in progress)

I changed some character's class from previous patch. I think that this is better.
Attachment #271370 - Attachment is obsolete: true
(Assignee)

Comment 141

10 years ago
Created attachment 271377 [details]
wrap specs for latest patch

This table is a specs of each browsers.

The value 'A' means the line breakable After the character, and 'B' means Before. 'BA' means Before and After.

(C) which is the tail of the browser name means Character. (N) means Numeric. This means that they are around the character. E.g., "a$a" is a testcase for (C), "0$0" is a testcase for (N).

Fx3 means the latest patch's behavior. The blue cells are different from Fx2.

I use IE7 spec in some characters. But there are incompatible characters for IE7 yet. I think that there is no impact in layout compatibility.
(Assignee)

Comment 142

10 years ago
pref tests result:

trunk build:
1.
Avg. Median : 1146 msec
Average     : 1348 msec
Minimum     : 459 msec
Maximum     : 4523 msec

2.
Avg. Median : 1155 msec
Average     : 1351 msec
Minimum     : 464 msec
Maximum     : 4500 msec

3.
Avg. Median : 1153 msec
Average     : 1349 msec
Minimum     : 467 msec
Maximum     : 4464 msec

patched build:
1.
Avg. Median : 1154 msec
Average     : 1349 msec
Minimum     : 463 msec
Maximum     : 4451 msec

2.
Avg. Median : 1154 msec
Average     : 1355 msec
Minimum     : 468 msec
Maximum     : 4485 msec

3.
Avg. Median : 1156 msec
Average     : 1356 msec
Minimum     : 467 msec
Maximum     : 4499 msec

Is there a little regression?
(Assignee)

Comment 143

10 years ago
Created attachment 271378 [details] [diff] [review]
Patch rv2.0

Let's go to review process. This patch cleared reftests. And works fine as Thunderbird's patch too.

This patch doesn't affect to serializer. Because nsILineBreaker::Prev/nsILineBreaker::Next are not breaking the ASCII word except when the text has CJK characters. This behavior is important, because if we break the URL to several words, the receiver MUAs cannot recover the original URL from the separated URL. So, we should break the URL only on Layout.

I Added some range to nsLineBreaker::IsComplexChar. Because they can break the lines, they should be complex characters.

This patch also breaks the line after SLASH if the next character is not numerics. *And* if the SLASH is a first character, it doesn't break line after SLASH. Because SLASH is used for root path, in this time, the file path will be broken as "/" and "foo/" and "bar/"... I think that they should be "/foo/" and "bar/". This change will help the LXR.

And I removed the declarations for U+2018 - U+201F from jisx4501class.txt. I think that them declarations are wrong. We should use default declaration in Unicode Database. (see bug 344780 comment 20)

And this patch prefers the IE compatibility. Because there are not standard specs.
Attachment #271375 - Attachment is obsolete: true
Attachment #271378 - Flags: superreview?(roc)
Attachment #271378 - Flags: review?(roc)
(Assignee)

Comment 144

10 years ago
Created attachment 271380 [details]
wrap specs for latest patch

the previous table has a mistake, sorry.
Attachment #271377 - Attachment is obsolete: true
Comment on attachment 271378 [details] [diff] [review]
Patch rv2.0

+    return !((0x0030 <= u && u <= 0x0039) ||
+             (0x0041 <= u && u <= 0x005A) ||
+             (0x0061 <= u && u <= 0x007A) ||
+             (0x00C0 <= u && u <= 0x00D6) ||
+             (0x00D8 <= u && u <= 0x00F6) ||
+             (0x00F8 <= u && u <= 0x00FF));

I'd just use the first three clauses, the other ones aren't common enough to bother optimizing for I think.

Okay, this looks OK.

I want your specification table checked into CVS too, with the explanation you gave in comment #141. Thanks!!
Attachment #271378 - Flags: superreview?(roc)
Attachment #271378 - Flags: superreview+
Attachment #271378 - Flags: review?(roc)
Attachment #271378 - Flags: review+
Oh, a few more comments...

I'd like dbaron to say if he's OK with this sort of change.

You're making '=' an A ... why not disallow breaking around it entirely like IE and Opera, if you're going to change it?
(Assignee)

Comment 147

10 years ago
(In reply to comment #146)
> Oh, a few more comments...
> 
> I'd like dbaron to say if he's OK with this sort of change.
> 
> You're making '=' an A ... why not disallow breaking around it entirely like IE
> and Opera, if you're going to change it?

If URL params too long, e.g.,

http://example.com/foo.cgi?abc=def&abc=def&abc=def&abc=def&abc=def&abc=def&abc=def&

the after of '?' cannot be broken. I think that we should break after '=' or '&'.
OK
(Assignee)

Comment 149

10 years ago
Created attachment 272021 [details] [diff] [review]
patch for check-in
Attachment #271378 - Attachment is obsolete: true
(Assignee)

Comment 150

10 years ago
checked-in.

-> FIXED
Status: ASSIGNED → RESOLVED
Last Resolved: 10 years ago
Resolution: --- → FIXED
(Assignee)

Updated

10 years ago
Blocks: 218580
(Assignee)

Updated

10 years ago
Blocks: 136839

Updated

10 years ago
Depends on: 388155

Updated

10 years ago
Depends on: 388602
This is causing very frequent assertions (bug 388602) that should be fixed promptly.

Updated

10 years ago
Depends on: 389056
Comment on attachment 272021 [details] [diff] [review]
patch for check-in

>     if (aSink) {
>       breakState[offset] = mAfterSpace && !isSpace &&
>         (aFlags & (offset == 0 ? BREAK_ALLOW_INITIAL : BREAK_ALLOW_INSIDE));
>     }
>     mAfterSpace = isSpace;
> 
>     if (isSpace) {
>-      // The current word can't have any complex characters inside it
>-      // because this is 8-bit text, so just ignore it
>+      if (offset > wordStart && wordHasComplexChar) {
>+        if (aFlags & BREAK_ALLOW_INSIDE) {
>+          // Save current start-of-word state because GetJISx4051Breaks will
>+          // set it to false
>+          PRPackedBool currentStart = breakState[wordStart];
Unfortunately this access to breakState is not protected by aSink and I often get these assertions trying to click on links or type in textareas:
###!!! ASSERTION: invalid array index: 'i < Length()', file ../../dist/include/xpcom\nsTArray.h, line 318
Also see Bug 388602 on that assertion.
Masayuki: why are you treating < and > as delimiters to be broken before/after? Was this a JISx4051 suggestion?
(Assignee)

Comment 155

10 years ago
No, it was already defined as so. (Gecko1.8.x has the definition too.)
yeah, but for 1.8.x it was hardly ever used. Now it's used a lot. If there's no particular reason for it, we should probably not break around them ... IE doesn't and they shouldn't really be treated as delimiters.
(Assignee)

Updated

10 years ago
Depends on: 389595
(Assignee)

Comment 157

10 years ago
I'll change it on bug 389056.
(Assignee)

Updated

10 years ago
Depends on: 390920
Flags: in-testsuite?
(Assignee)

Updated

10 years ago
Depends on: 399321

Updated

10 years ago
Depends on: 398165

Updated

10 years ago
Blocks: 177565
(Assignee)

Comment 158

10 years ago
jwalden+bmo:

The testcases are on tree now. See bug 389056.
Woohoo!  Good stuff!  In the future, please feel free to flip the flag yourself (with the explanatory comment, of course), however.  :-)
Flags: in-testsuite? → in-testsuite+
You need to log in before you can comment on or make changes to this bug.