Closed Bug 336959 Opened 18 years ago Closed 17 years ago

Line Breaking with Pango/Uniscribe

Categories

(Core :: Internationalization, enhancement)

x86
Linux
enhancement
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: thep, Assigned: thep)

References

(Blocks 2 open bugs)

Details

Attachments

(13 files, 13 obsolete files)

13.21 KB, patch
roc
: review+
Details | Diff | Splinter Review
13.21 KB, patch
Details | Diff | Splinter Review
13.28 KB, patch
Details | Diff | Splinter Review
13.23 KB, patch
roc
: review+
Details | Diff | Splinter Review
13.17 KB, patch
Details | Diff | Splinter Review
6.24 KB, patch
Details | Diff | Splinter Review
13.05 KB, patch
Details | Diff | Splinter Review
21.32 KB, patch
Details | Diff | Splinter Review
34.11 KB, patch
roc
: review+
Details | Diff | Splinter Review
34.50 KB, patch
roc
: review+
Details | Diff | Splinter Review
551 bytes, patch
dbaron
: review+
dbaron
: superreview+
Details | Diff | Splinter Review
826 bytes, patch
roc
: review+
Details | Diff | Splinter Review
1.10 KB, patch
roc
: review+
damons
: approval1.9+
Details | Diff | Splinter Review
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8) Gecko/20060419 Epiphany/2.14 Firefox/2.0a1
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8) Gecko/20060419 Epiphany/2.14 Firefox/2.0a1

As pango has already been used for rendering, it's also possible to use the pango_break() facility for a generic line breaking interface. This may be a sensible way for some solution like libthai [http://libthai.sourceforge.net] to be plugged in through Pango framework, along with solutions for other languages in the future.

To fullfill this approach on Windows, where Pango is rarely available, Uniscribe is just a natural choice.

As a result, this will make Gecko attempt to utilize line breaking API's that are already available on the platforms it runs on, in addition to its built-in engines.

Reproducible: Always
Attached patch An experimental Pango patch (obsolete) — Splinter Review
I propose this patch (against Trunk) as a proof of concept. To try it, you need libthai and pango-libthai packages from http://libthai.sourceforge.net, or libthai0 and pango-libthai if you are using Debian sid/etch or Ubuntu dapper.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Oops. I forgot to include a change in a Makefile in previous patch. This one should work.
Attachment #221147 - Attachment is obsolete: true
related: Bug 157967 - Make Gecko interoperate better with advanced typography systems such as ATSUI, Uniscribe, Pango & STSF
Known problem: performance is comparably worse than when using built-in engines, such as what proposed in bug 7969. This is because the itemization task is done twice, once in nsJISx4501LineWrapper, and again in nsPangoLineWrapper (for non-western and non-CJK languages). So, the overhead is doubled.
Can we put "[line-breaking]" in "Status Whiteboard" filed ?

(just as Bug #157967 "Make Gecko interoperate better with advanced typography systems such as ATSUI, Uniscribe, Pango & STSF")
Blocks: thai
I have adjusted the patch to reduce steps for performance's sake:

- Repleace nsJISx4051LineBreaker when Pango is enabled, instead of tapping
  calls from it, to remove duplicated tasks of scripts analysis in both
  engines. I also believe Pango's capabilities are sophisticated enough for
  the replacement.

- Simply use pango_get_log_attrs(), instead of two-fold pango_itemize() and 
  pango_break(). This reduces one more step, and simplifies logical attributes
  accesses a bit.

- Add GetJISx4051Breaks() method, according to recent change in nsILineBreaker.

With pango 1.16 built with libthai support (to be available soon in Debian
experimental and probably Ubuntu feisty+1), you won't need pango-libthai
to test Thai line breaking any more.

Performance seems acceptable now with this patch.
Attachment #221202 - Attachment is obsolete: true
roc, how would this relate to your textrun work? If at all, just guessing here.
The new textframe code uses only GetJISx4051Breaks from nsILineBreaker (via content/base/src/nsLineBreaker.cpp). You can build it by setting MOZ_ENABLE_NEW_TEXTFRAME = 1 in layout/generic/Makefile.in. (nsILineBreaker's old methods are still used by the content serializer and some editor functions but we could probably easily fix that.) You probably should enable the new textframe and do your work with it because that's what will ship in 1.9.

The new textframe should actually work better for you in the end because it is designed to handle languages like Thai that require unbounded context for linebreaking. Basically nsLineBreaker gets access to the entire paragraph (sometimes less, when the text is broken up by non-text elements such as form elements or images). The current implementation first breaks things up by spaces and then if the text contains CJK characters it calls GetJISx4051Breaks to break up a run of non-spaces. I think we should expand that check in nsLineBreaker to include characters from other languages like Thai that break between non-spaces. Then your patch here should work well. We should probably rename GetJISx4051Breaks though. If this approach works for you then I think we should land your patch soon.

It's a bit unfortunate that we're doing Pango itemization in gfxPangoFontGroup for measurement and rendering --- in many cases this will mean itemization is happening twice for the same string. The problem is that the strings we send down for rendering aren't always the same as the strings you want to use for line breaking. In particular the strings for line breaking can be longer because the strings for rendering stop at font changes and direction changes. If this turns out to be a performance issue for you I think we should probably look at some kind of caching scheme for the results of itemization so if we're lucky enough to see the same string in both places we can reuse the Pango itemization.
Thanks for the hints. I have tried the new text frame with my Pango patch as you suggested. It works well with my Thai sample text, by a quick hack on IS_CJK_CHAR() in content/base/src/nsLineBreaker.cpp to cover Thai range (We want a better change than this, of course):

Index: nsLineBreaker.cpp
===================================================================
RCS file: /cvsroot/mozilla/content/base/src/nsLineBreaker.cpp,v
retrieving revision 1.3
diff -u -r1.3 nsLineBreaker.cpp
--- nsLineBreaker.cpp   24 Mar 2007 11:07:36 -0000      1.3
+++ nsLineBreaker.cpp   30 Mar 2007 03:50:09 -0000
@@ -60,7 +60,8 @@
 static inline int
 IS_CJK_CHAR(PRUnichar u)
 {
-  return (0x1100 <= u && u <= 0x11ff) ||
+  return (0x0e01 <= u && u <= 0x0e5f) ||
+         (0x1100 <= u && u <= 0x11ff) ||
          (0x2e80 <= u && u <= 0xd7ff) ||
          (0xf900 <= u && u <= 0xfaff) ||
          (0xff00 <= u && u <= 0xffef);

Performance improves a lot!  Judging from the time it takes to render a large off-line sample page, it now feels like openning normal documents with ordinary text editors, compared to a long pause with 100% CPU load in the old text frame. When I previously said the performance was acceptable, I meant average pages. But with new text frame, it applies to large pages as well.

However, I find it hangs on some certain real page, with the console message:

*** glibc detected *** free(): invalid next size (fast): 0x0000000001f175c0 ***

But that may not be related to this bug, as it occurs with or without the Pango patch. It deserves another bug instead. If you like to try, the page in problem is: http://linux.thai.net/planet/ (It's a dynamic page with constant updates, so the bug may depend on when it's viewed.)
I've found a little flaw in my previous patch about range checking in iterator. It's never triggered the tests so far, but it should be fixed anyway.
Attachment #260003 - Attachment is obsolete: true
(In reply to comment #10)
> It's never triggered the tests so far, but it should be fixed anyway.

I mean, it's never triggered "in" the tests so far.
It looks like pango_get_log_attrs sets up mLogAttrs with one PangoLogAttr per character, not necessarily one per UTF16 code.  These don't correspond if there are surrogates present in the UTF16.

mAttrsLen can be determined with something like g_utf8_strlen(text, 0) + 1, but matching up indices between arrays is going to be a little more fun.

Bug #378695 is related to this issue.
Pretty fun. I have updated the patch so it iterates UTF-16 strings with UTF16CharEnumerator. Thanks for your comment.
Attachment #260102 - Attachment is obsolete: true
Comment on attachment 263111 [details] [diff] [review]
Updated patch to handle UTF-16 properly

May I request for a review?
Attachment #263111 - Flags: review?(roc)
-  return (0x1100 <= u && u <= 0x11ff) ||
+  return (0x0e01 <= u && u <= 0x0e5f) ||

Could you add some comments indicating which characters these ranges correspond to?

+ * The Initial Developer of the Original Code is
+ * Netscape Communications Corporation.
+ * Portions created by the Initial Developer are Copyright (C) 1998
+ * the Initial Developer. All Rights Reserved.

You should change this :-)

With the new text frame, nsPangoLineBreaker::GetJISx4051Breaks is the only method that really matters. (We should rename it, actually, to just GetBreaks ... if you want to do a patch for that, that'd be great! :-) ) I think you could simplify the code to make Next, Prev and BreakInBetween be implemented in terms of that method. Use nsAutoTArray to allocate buffers so you can avoid heap allocations in the common case, like this:

  nsAutoTArray<PRPackedBool,200> buffer;
  if (!buffer.AppendElements(N))
    return ...; // out of memory
  ... buffer.Elements() ...

If you do that, I don't think you need nsPangoLogAttrList. You can just write the iteration inline since there will only be one iteration.

This code doesn't handle 0 bytes in the string, because pango_break stops at nulls. You need an outer loop over the bytes, like here:
http://lxr.mozilla.org/seamonkey/source/gfx/thebes/src/gfxPangoFonts.cpp#1056
The same issue probably arises with g_utf16_to_utf8 etc. Maybe use NS_ConvertUTF16toUTF8 instead.

+  for (nsPangoLogAttrList::Iterator i = logAttrs[0]; *i; ++i)
+  {
+    const PRUnichar* q = p;
+    UTF16CharEnumerator::NextChar(&p, aText + aLen, &err);
+    if (err)
+      break;
+    while (q < p - 1)
+      aBreakBefore[q++ - aText] = PR_FALSE;
+    aBreakBefore[q - aText] = (*i)->is_line_break;
+  }

I think there's a better way to write this loop. Treat it as a while loop where you keep an index into the logAttrs array and a index to the current PRUnichar. You just need to increment the current PRUnichar index an extra time when you see a low surrogate. You don't need to use UTF16CharEnumerator. Also I think this code is putting the line break opportunity in the middle of the surrogate pair, but it should go before the surrogate pair.

+  pango_get_log_attrs(text, -1, -1, pango_context_get_language(pPangoContext),

So this is basically using the default language, right? Would there be any benefit to passing more accurate language information? We do have some and we could change the API to pass it in.
Attached patch Patch with rearranged code (obsolete) — Splinter Review
Thank you very much for your instructive comments. In this patch, I have tried to follow all, except for the method renaming and the last question regarding Pango language parameter. Yes, source looks much simpler so far. :-)
Attachment #263111 - Attachment is obsolete: true
Attachment #263111 - Flags: review?(roc)
(In reply to comment #15)

> With the new text frame, nsPangoLineBreaker::GetJISx4051Breaks is the only
> method that really matters. (We should rename it, actually, to just GetBreaks
> ... if you want to do a patch for that, that'd be great! :-) )

That would mean patching many other sources. Maybe, it's better done in another patch?

> +  pango_get_log_attrs(text, -1, -1, pango_context_get_language(pPangoContext),
> 
> So this is basically using the default language, right? Would there be any
> benefit to passing more accurate language information? We do have some and we
> could change the API to pass it in.

That should help reduce unnecessary map building. However, this reminds me of using pango_language_get_default() and getting rid of dependency on GDK. I'll adjust the patch soon, to get rid of unnecessary CXXFLAGS and LDFLAGS in some places. And when the API is changed, the call can then be immediately replaced with the passed in info.
Attachment #265483 - Attachment is obsolete: true
> Maybe, it's better done in another patch?

Yes, definitely.

+  if (!breakState.AppendElements(aLen + 1))

This doesn't need to be + 1 ... we won't fill in more than aLen elements

+    if (breakState[++aPos])

... and this tests breakState[aLen], but it shouldn't

+      if (NS_IS_LOW_SURROGATE(aText[u16Offset++]))
+        aBreakBefore[u16Offset++] = PR_FALSE;

I'd prefer the updates to u16Offset to be done in their own statements ... it's clearer.
Masayuki, how do you feel about Pango's line breaking for CJK?
(In reply to comment #19)
> +  if (!breakState.AppendElements(aLen + 1))
> 
> This doesn't need to be + 1 ... we won't fill in more than aLen elements

So, we won't cover the case when the last character is a word end, which can be wrapped after? That's why Pango needs len + 1 elements of PangoLogAttr. Well, but if we have different convention here, I'll fix it.

> +    if (breakState[++aPos])
> 
> ... and this tests breakState[aLen], but it shouldn't

This is related to above point. Same question.

> +      if (NS_IS_LOW_SURROGATE(aText[u16Offset++]))
> +        aBreakBefore[u16Offset++] = PR_FALSE;
> 
> I'd prefer the updates to u16Offset to be done in their own statements ... it's
> clearer.

Fine. I can break it into series of statements in the next patch.
> So, we won't cover the case when the last character is a word end, which can be
> wrapped after?

GetJIS4051Breaks only fills in aLen elements. Also, nsLineBreaker doesn't look at the first element. GetJIS4051Breaks is only supposed to compute breaks *inside* the whitespace-delimited "word". Breaks at the ends will be handled by nsLineBreaker. (There almost always will be break opportunities before and after the "word", except when the word is next to the start or end of a paragraph, or a replaced element such as an image.)

When Next() gets to the end of the text, I don't think we should return "yes, you can break after this text" ... we don't know what the next character is, it could be ZWNJ, right? We should always return NS_LINEBREAKER_NEED_MORE_TEXT.
OK. So, this is the updated patch.
Attachment #265490 - Attachment is obsolete: true
Comment on attachment 265499 [details] [diff] [review]
Patch with no access to extra break pos

great! I think we should go ahead.
Attachment #265499 - Flags: superreview+
Attachment #265499 - Flags: review+
Checked in! Thanks very much!!!

I'll close this bug. If you (or someone else) wants to do Uniscribe or ATSUI in a similar way, that can be a separate bug.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Reopening. I backed the patch out because it busted Tinderbox. pango_language_get_default was introduced in Pango 1.16 which is too new for us.

Should we go for the GDK dependency or is there another way to get a good default language?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Can we just get current locale and convert it with pango_language_from_string()?

Is there a safe API in Mozilla to get current locale?
We don't want page rendering to depend on the user's locale settings.

We have language information in the page, but language can vary across characters, even characters in the same word.

Maybe we could use pango_language_from_string("en"), if that does the right thing? Or maybe just pass in an empty string "" if that gives us a usable language?
I think the language information in the page is closest to what we need. pango_get_log_attr() does analyze script/language for every character, too. The passed-in parameter is just the default value.

Yes, we can temporarily use pango_language_from_string("en") to fix tinderbox, and then adjust (and rename) the API to get that info later.
Re-submit the patch, with pango_language_get_default() replaced.
I'm with roc here, we really shouldn't make our rendering depend on the user settings. That would sound like a regional hack to our rendering behaviour, and as such, something we should really only do if we have a well-known benefit.

Page language would be closer, but still put us into the area where we can't reliably test what layout other users are going to see for very similar pages. And I have met people doing studies on the decleration habits of lang="", most pages don't have it, and of those pages that do, most are wrong. So it's not going to be a constructive hint in the first place.

As a side question, should we cache the value of pango_get_language_from_string("en") outside the tight loop?
(In reply to comment #31)
> I'm with roc here, we really shouldn't make our rendering depend on the user
> settings. That would sound like a regional hack to our rendering behaviour, and
> as such, something we should really only do if we have a well-known benefit.

I'm not against roc regarding the locale dependency issue, either. The use of locale was just from my first thought to write some code as pango_language_get_default() replacement. (Internally, it determines language from system's LC_CTYPE.)

Basing on locale is just a heuristic. For example, chance is that user will open web pages of his/her own language, in case the script can be used to write more than one language. But that's just a guess after all, in case we have no other choice.

> Page language would be closer, but still put us into the area where we can't
> reliably test what layout other users are going to see for very similar pages.
> And I have met people doing studies on the decleration habits of lang="", most
> pages don't have it, and of those pages that do, most are wrong. So it's not
> going to be a constructive hint in the first place.

Although it's the expected behavior? When the page author finds the page is line-wrapped improperly, a natural guess is to set the language, just like usual experience from using word processors, isn't it? I think it would be a surprise to get no difference when setting the right language to the page.

> As a side question, should we cache the value of
> pango_get_language_from_string("en") outside the tight loop?

Umm.. It may depend on the decision to depend on the page language. If we do, that caching would be removed soon anyway. Besides, I think pango_language_from_string() also does some caching already.
We might as well make pango_language_from_string get cached in a static. So,

  static PangoLanguage* language = pango_language_from_string("en");

With that change, let's reland the patch.

The problem with using the page language is that every DOM element can have a different language. So the text passed into the linebreaker can be composed of different languages. Does Pango have a way to handle that?
I hope the language we choose here is not very important ... for the CJK and Thai cases where Pango is being invoked, I expect Pango to normally just ignore the language and figure things out based on the Unicode character ranges in the text. We shouldn't add the complexity of passing down DOM language information unless we have a testcase where we actually need it.
Before we really add the cache, let me remind again that pango_language_from_string() also caches this with a hash table.

Is it worth doing the cache again? Probably, because "en" is a very short string. Passing through hashing may not help much.

Regarding Pango's processing of languages/scripts in strings, it's like this, from my understanding:

The passed-in language determines a map of engines. The script characterized by Unicode character range then determines the engine to load from the language's map. Note the distinction between language and script here. German and French can have different maps, although they are both written with Latin script, for example. Then, for scripts that are not in the language map, there is another "default language" map for each script.

So, the difference between passing a fixed language "en" and the value from the real context here is that only "default language" for each character will be assumed in the former case, while "language-specific" engines will be invoked in the latter.

Yes, for CJK and Thai, that difference doesn't matter. Because Thai script is only used to write Thai language, and CJK scripts, despite some characters sharing among different languages, seem to have common rules as reflected in JISx4051. Just to keep in mind that our decision here is based on some assumptions.
Updated patch, with PangoLanguage caching added.
+  static PangoLanguage* language = nsnull;
+  if (nsnull == language)
+    language = pango_language_from_string("en");

You can just write
  static PangoLanguage* language = pango_language_from_string("en");

It will only happen once.
To get my thinking right here, does pango support syllabification? Hopefully picking the right of the two  translations of Silbentrennung. What I mean is, breaking the line in the middle of a word? Then I'd get the importance of feeding in the right language.

That doesn't change the sad fact that most authors wouldn't think about adding language to their pages in the first place, or even fix their CMS to not send lang="en". :-(

Anyway, we don't support syllabification at all, so that's not our problem. The lack of language declaration is probably a good argument to not start supporting it, too.
(In reply to comment #37)
> +  static PangoLanguage* language = nsnull;
> +  if (nsnull == language)
> +    language = pango_language_from_string("en");
> 
> You can just write
>   static PangoLanguage* language = pango_language_from_string("en");
> 
> It will only happen once.

Err.. I just thought of portability when writing. I have a habit of not initializing static variables with non-load-time data.
(In reply to comment #38)
> To get my thinking right here, does pango support syllabification? Hopefully
> picking the right of the two  translations of Silbentrennung. What I mean is,
> breaking the line in the middle of a word? Then I'd get the importance of
> feeding in the right language.

My argument does not apply to European languages, of course. The German and French case was just for illustration's sake. There can be many combinations of languages and scripts in other regions of the world. Pango's design seems to address such conditions.

> That doesn't change the sad fact that most authors wouldn't think about adding
> language to their pages in the first place, or even fix their CMS to not send
> lang="en". :-(
> 
> Anyway, we don't support syllabification at all, so that's not our problem. The
> lack of language declaration is probably a good argument to not start
> supporting it, too.

Right. Our current assumption is valid at some degree. What I mean to say is that there can be cases still unknown to us regarding languages of the world.
All right. I give up my portability concern.
I'm not aware of any system we build on where this was ever a problem. I vaguely recall some system in the past may have had locking problems in a multithreaded context but this will only be called on the main thread anyway.
My concern may apply to old systems, where compilers/loaders are primitive. Some compiler may even reject such initialization. To me, the if-block pattern of initialization is commonplace in other projects' codes. And I think it's safe to do so. Besides portability issue, performance also counts. The extra comparison with zero is not so expensive, while lazy initializations in many places can sum up to reduced load time, as unnecessary initializations are not done. So, it has become my habit.

But as the call is not so significant here, I'm OK with either way. Please go on.
checked in again.
Status: REOPENED → RESOLVED
Closed: 17 years ago17 years ago
Resolution: --- → FIXED
> Some compiler may even reject such initialization.

The C++ standard requires this to work, so those compilers would be badly broken.

> lazy initializations in many places can sum up to reduced load time

This is lazy initialization. The C++ standard says that the initialization happens the first time the program reaches the line where the static variable is declared.
Backed out. This broke several tinderbox tests, perhaps because we haven't switched over to new-textframe yet...
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Looks like the nsTextTransformer self-tests were failing on fxdbug-linux-tbox:

 ###!!! ASSERTION: Illegal value (length > position): 'aLen > aPos', file /builds/tinderbox/Fx-Trunk-Debug/Linux_2.6.9-42.ELsmp_Depend/mozilla/intl/lwbrk/src/nsPangoLineBreaker.cpp, line 102
  ###!!! ASSERTION: Illegal value (length > position): 'aLen > aPos', file /builds/tinderbox/Fx-Trunk-Debug/Linux_2.6.9-42.ELsmp_Depend/mozilla/intl/lwbrk/src/nsPangoLineBreaker.cpp, line 102
  ###!!! ASSERTION: Illegal value (length > position): 'aLen > aPos', file /builds/tinderbox/Fx-Trunk-Debug/Linux_2.6.9-42.ELsmp_Depend/mozilla/intl/lwbrk/src/nsPangoLineBreaker.cpp, line 102
  nsTextTransformer: self test 6 failed
  nsTextTransformer: self test 7 failed
  ###!!! ASSERTION: Illegal value (length > position): 'aLen > aPos', file /builds/tinderbox/Fx-Trunk-Debug/Linux_2.6.9-42.ELsmp_Depend/mozilla/intl/lwbrk/src/nsPangoLineBreaker.cpp, line 102
  ###!!! ASSERTION: Illegal value (length > position): 'aLen > aPos', file /builds/tinderbox/Fx-Trunk-Debug/Linux_2.6.9-42.ELsmp_Depend/mozilla/intl/lwbrk/src/nsPangoLineBreaker.cpp, line 102
  ###!!! ASSERTION: Illegal value (length > position): 'aLen > aPos', file /builds/tinderbox/Fx-Trunk-Debug/Linux_2.6.9-42.ELsmp_Depend/mozilla/intl/lwbrk/src/nsPangoLineBreaker.cpp, line 102
  nsTextTransformer: self test 8 failed
  nsTextTransformer: self test 9 failed
  nsTextTransformer: self test 10 failed
  nsTextTransformer: self test 11 failed
  nsTextTransformer: self test 12 failed
  nsTextTransformer: self test 13 failed
  nsTextTransformer: self test 14 failed
  ###!!! ABORT: file /builds/tinderbox/Fx-Trunk-Debug/Linux_2.6.9-42.ELsmp_Depend/mozilla/layout/generic/nsTextTransformer.cpp, line 1794
 
Line 102 is in nsPangoLineBreaker::Prev(). Probably, aLen == aPos are passed? Does changing the assertion to aLen >= aPos help?

Moreover, I find an excessive assertion (aLen > aPos) in nsPangoLineBreaker::GetJISx4051Breaks(). That should be removed as well.
It should be valid for Prev() to pass aPos == aLen. And aPos is undefined in GetJISx4051Breaks().

Also, I wonder if Pango's is_line_break logical attribute matches what the self tests assumes.
If you make a debug build with the old textframe, then run it with a breakpoint in the nsTextTransformer self tests, you should be able to verify that they run and pass.
For test 6, with text = "Œnce xy\tzy",
 - Pango's is_line_break gives "Œnce |xy\t|zy"
 - the self test expects "Œnce| |xy|\t|zy"
where '|' represents line break position. Is that really expected for line breaking?
If so, the fix could be simply replacing "is_line_break" with "is_line_break || is_word_end".
The Pango behaviour is closer to what we expect with the new textframe. If test 6 is the only test that fails, we should just comment it out. This code will not be used after we turn on the new textframe anyway.
Not only test 6, the adjustment quiets all failed tests (6-14).
Isn't that adjustment wrong, though? Isn't is_line_break exactly what we want?
I think so. I'm just not sure about difference in conventions between Pango and Mozilla codes here. If Pango's result is preferred, let's obsolete my last patch and change the self tests instead.
Do you think it's still worth fixing the expected values of the self tests? Or would that cause failure on non-pango builds anyway, and the tests should just be commented out?
I think we should just not check this in until new textframe is turned on and nsTextTransformer is removed. That should be soon.
Comment on attachment 265957 [details] [diff] [review]
Patch with line break condition adjusted

Obsolete the patch.
Attachment #265957 - Attachment is obsolete: true
IS_CJK_CHAR really should be renamed to something like IS_COMPLEX_LINEBREAK_CHAR. A misleading function name will lead to problems sooner or later.
(In reply to comment #60)
> IS_CJK_CHAR really should be renamed to something like
> IS_COMPLEX_LINEBREAK_CHAR. A misleading function name will lead to problems
> sooner or later.

agreed.
(In reply to comment #60)
> IS_CJK_CHAR really should be renamed to something like
> IS_COMPLEX_LINEBREAK_CHAR. A misleading function name will lead to problems
> sooner or later.

That would also mean renaming nsLineBreaker::mCurrentWordContainsCJK as well as some local variables, or it would leave doubts to code readers how come they are related.

But mCurrentWordContainsComplexLinebreakChar is a very long name. Would there be a meaningful shorter name?
Functions and variables renaming; assertion fixes; initial PangoLanguage.
Oops. I forgot other backed out parts.
Attachment #267096 - Attachment is obsolete: true
I would just call it mCurrentWordIsComplex and IS_COMPLEX_CHAR. Throw in some comments to explain what "complex" means in this context.
Attached patch Updated patchSplinter Review
I see you've already changed that in trunk. So, I update the summarized patch against trunk accordingly.
Attachment #267098 - Attachment is obsolete: true
Comment on attachment 268080 [details] [diff] [review]
Updated patch

Now that the new text layout has been enabled in trunk for a while, I hope it's time to check-in.
Attachment #268080 - Flags: review?(roc)
now I'm working on bug 255990 which breaks an URL to multi words.

This patch can break the URL to several words? If so, we should not use this approach. Because there are no standard specs for line breaking. Therefore, we should care for compatibility with other browsers (especially WinIE).

And the current "our" line breaking spac is not JIS X 4051. based on it, but we customized the character classes.

Therefore, this patch breaks compatibility with Linux and others. It is not good for us, it will make to confuse the bug trackers.

I think that the native line breaker should be only used for poor part like Thai.
And note that we need to handle the non-natural languages for line breaking. e.g., URL, file path and fragment of code. We cannot use JIS X 4051 and UAX#14 directly.
(In reply to comment #68)
> now I'm working on bug 255990 which breaks an URL to multi words.
> 
> This patch can break the URL to several words? If so, we should not use this
> approach. Because there are no standard specs for line breaking. Therefore, we
> should care for compatibility with other browsers (especially WinIE).

Breaking URL requires your patch in bug 255990 so nsPangoLineBreaker gets called for the long continuous Latin chunk. After that, Pango's line breaker will break the URL based on UAX #14.

> And the current "our" line breaking spac is not JIS X 4051. based on it, but we
> customized the character classes.
> 
> Therefore, this patch breaks compatibility with Linux and others. It is not
> good for us, it will make to confuse the bug trackers.
> 
> I think that the native line breaker should be only used for poor part like
> Thai.

Fine. I wasn't aware of your attempt to enhance nsJISx4051LineBreaker when filing this bug. For me, Pango's line breaking is acceptable, as it has UAX #14 as its base, and I assumed its coverage can replace nsJISx4051LineBreaker functionality.

Well, if compatibility really matters here for line breaking behavior, and if we have enough time before FF3, your approach may be the right thing to do. With this, I have some issues:

A. Customization of some Thai characters needs to be added in bug 255990. (Some Thai characters prohibit line breaks before them, even with leading space, for example.)

B. Only call Thai line break routine for Thai chunks. This can be done either via platform's line breakers (Pango, Uniscribe, ATSUI), or by calling Thai library (libthai) directly. (Currently, it provides implementation for Thai line breaking in Pango.) For the latter choice, LibThai source is available under LGPL. (http://libthai.sf.net/) So, it's possible to compile it on other platforms than Linux as well. This could bypass Pango analysis layer and gain a little performance. But build process would need to be rearranged somehow to achieve that.

What do you think?

(For me, it means getting back to bug 7969, with bug 255990 combined.)
Note, however, that Thai line breaking is very much in need. You can see how long we have been waiting for bug 7969 to be fixed. This single bug has hindered all Firefox promotions in Thailand for years. So, if extending nsJISx4051LineBreaker takes too long, may I request for the Pango patch to be chosen for FF3?
I think that UAX #14 may not be best choice for web page rendering. Because it is not specified in CSS specs that UAs should use UAX #14. E.g., The normal paragraphs which has enough width for the content can be rendered as naturally for many people if we use UAX #14. But HTML has table which cell's width is auto in most times. If the many characters can break the line, there are too narrow cells than WinIE. For inhibiting this issue, we need near behavior of IE. So, it's not UAX #14 even if WinIE uses UAX #14 :-(

However, I think that the usability of Gecko should be preferred than compatibility for WinIE. Therefore, I agree that nsJISx4051LineBreaker uses native libraries in several language. (And I hope to land it before Fx3 too :-))
Here's an idea for how to handle this:
-- define a set of "difficult" characters (let's just say Thai for now)
-- in nsJISx4051LineBreaker::GetJISx4051Breaks, if the string contains any difficult characters, then call the platform API (say Pango) on the whole string and use the platform breakability values to determine whether there is a break opportunity before or after each "difficult" character. The break results for positions not adjacent to a "difficult" character remain controlled by the built-in breaking code.

How does that sound? Will it work well for Thai?
I think that it is a best solution for now.

Now, I fixed bug 255990. It breaks |nsPangoLineBreaker|, sorry. (I added a method to the |nsILineBreaker|, but I didn't change |nsPangoLineBreaker|.

If we use comment 73's solution, do we need to use nsPangoLineBreaker? Should we implement more simple modules for it? But nsPangoLineBreaker may be needed by some users who build themselves. (e.g., they may want more performance than compatibility with WinIE.)
I don't think we need to support using Pango for everything. If someone *really* wants to do that they can patch their own build however they like.

What I suggest is to create a native platform linebreaker interface in gfxPlatform.h (because we already use Pango etc in gfx). It really only needs one method, taking a Unicode string and an array of booleans and filling in the array. Then we can call that from intl's linebreaker when difficult characters are detected.
So, it seems line breaking compatibility inside Thai chunks is not an issue here, when it's handled by platform breakers. (I don't think it matters, either.)

If so, let's go on with that scheme.
I now have a first version, which does not build, because intl is built in lower tier (necko) than gfx (gecko). So, gfx headers are not available yet when building intl. Would you have some suggestion? Can it be moved?
OK. I try moving intl from necko to gecko, and it builds. This is a first, untested patch. I can't test it now because current trunk build segfaults on load.
I hope that you use -u8p for readability of the patch :-)
I don't think we can move intl from necko to gecko. Sorry.

How about making the platform linebreaker interface its own C++ interface, XPCOM compatible by inheriting from nsISupports, put the interface header file in xpcom/system, and make the implementation an XPCOM service?
(In reply to comment #80)
> I hope that you use -u8p for readability of the patch :-)

OK. I'll do that next time. Thanks.
 
(In reply to comment #81)
> I don't think we can move intl from necko to gecko. Sorry.
> 
> How about making the platform linebreaker interface its own C++ interface,
> XPCOM compatible by inheriting from nsISupports, put the interface header file
> in xpcom/system, and make the implementation an XPCOM service?

And put the platform implementations in, say, intl/lwbrk/components?
(In reply to comment #82)

> And put the platform implementations in, say, intl/lwbrk/components?

Probably, it's better built in toolkit tier under, say, toolkit/system, rather than hacked to get back into intl/lwbrk/components again after gecko tier. I'll prepare another patch based on this.
How about putting the component in widget or gfx, where we already have platform specific code? In gfx we already use Pango so that's probably not a bad idea.
Actually no. All this messing around is not worth it. Just put the Pango code in its own file in intl/lwbrk/src and use ifdef MOZ_ENABLE_PANGO to control whether it gets built or whether a stub implementation that does nothing gets used instead.
Yeah, I was about to ask about this option. Besides code complexity, I worried about the chance of missing Pango component in some installation even if Pango is enabled. Performance implication of too many abstractions in tight loop is also another issue.

So, let's get back.
Umm.. OTOH, while working on bug 7969, I also used to have a wish that there had been some pluggable interface for Thai line breakers. Samphan proposed in bug 7969 comment 41 to replace, rather than extend, nsJISx4501LineBreaker with his engine in an extension just because it lacked such generic channel. My initial patch in this bug was also based on this reason.

That may be one supporting reason, if any, for componentization, as there are many Thai line break engines available, in addition to platform breakers.

Not to push this approach, though. Just FYI.
That plugging could happen at the Pango level, couldn't it?
Yes, if Pango is intalled on *all* platforms.
Oh, I thought you were talking about plugging in multiple Thai line breakers that are already available on the platform.

Why would we ever want to ship multiple Thai line breakers ourselves?
No, it's not about shipping it ourselves. I mean extensibility with extensions. In fact, there have been at least 3-4 custom solutions available while waiting for bug 7969. Most Thai users do not use mozilla.org-released Firefox as is, nor binary packages provided by distros [with Ubuntu as an exception], as the lack of proper line breaking rendered it unusable. And the need was so high that many developers came up with their own solutions on Linux, Windows, and Mac. Some users may still prefer their own favorite one even if Pango support is shipped with standard mozilla.org release, esp. for non-Unix platforms.

I cannot guess what would happen after Pango (or Uniscribe/ATSUI) support is included upstream. Some may not like the quality it provides, esp. when they cannot do anything with it for the cases of Uniscribe/ATSUI. Some may try to push their solutions upstream after that, or they may just continue overriding nsJISx4501LineBreaker, who knows? And if it's the case, extensibility for Thai line breaker can minimize the incompatibility.

But I can't speak for them, anyway. So I said I didn't try to push it. We can go on with hard-wiring and wait for further reports.
This patch uses MOZ_ENABLE_PANGO as make flag to choose Thai line breaker implementation. It falls back to rulebrk.c for non-Pango builds.

Patch still untested, because trunk default build still segfaults on start up.
masayuki, note on JISx4051 table: I think it doesn't make sense to have Thai characters breakable with all classes. For example, we shouldn't break between Thai character and quotation mark, exclamation mark, question mark, and parentheses. It makes more sense to treat it like Latin by default. (This includes no breaking between Thai characters, too. So it still makes some sense even without the "special handling".)

In above patch, I modified the loops with the Thai breakability assumption removed.
> trunk default build still segfaults on start up.

It does? What bug is this?
On my Debian sid amd64 box:

$ dist/bin/firefox
dist/bin/run-mozilla.sh: line 131: 18712 Segmentation fault      "$prog" ${1+"$@"}

Normally, I "make install" it before running, but this time, it complains:

/home/thep/cvs/mozilla_cvs/mozilla/config/nsinstall -t -m 644 xrecore.h nsXULAppAPI.h /home/foxy/include/firefox-3.0a7pre/xulapp
/home/thep/cvs/mozilla_cvs/mozilla/config/nsinstall -t -m 644 _xpidlgen/xulapp.xpt /home/foxy/lib/firefox-3.0a7pre/components
/home/thep/cvs/mozilla_cvs/mozilla/config/nsinstall -t -m 644 nsINativeAppSupport.idl nsIXULRuntime.idl /home/foxy/share/idl/firefox-3.0a7pre
/home/thep/cvs/mozilla_cvs/mozilla/config/nsinstall -t -m 644 _xpidlgen/nsINativeAppSupport.h _xpidlgen/nsIXULRuntime.h /home/foxy/include/firefox-3.0a7pre/xulapp
/home/thep/cvs/mozilla_cvs/mozilla/config/nsinstall -R -m 644  /home/foxy/lib/firefox-3.0a7pre
usage: /home/thep/cvs/mozilla_cvs/mozilla/config/nsinstall [-C cwd] [-L linkprefix] [-m mode] [-o owner] [-g group]
                                                           [-DdltR] file [file ...] directory
make[3]: *** [install] Error 2
make[3]: Leaving directory `/home/thep/cvs/mozilla_cvs/mozilla/toolkit/xre'
make[2]: *** [install] Error 2
make[2]: Leaving directory `/home/thep/cvs/mozilla_cvs/mozilla/toolkit'
make[1]: *** [install] Error 2
make[1]: Leaving directory `/home/thep/cvs/mozilla_cvs/mozilla'
make: *** [install] Error 2
Can you get a stack trace for that crash? Or a regression range? You should file a bug in any case.
Oddly enough, when I use --enable-debug to get more info, it doesn't crash! But the "make install" problem remains.

Anyway, I can now run it from the uninstalled copy. And the patch works.
In this patch, I also modify JIS x4051 pair table for Thai, so that it's not broken before closing parenthesis and after openning parenthesis, etc. With this, Thai punctuation marks and digits are also classified in separation from alphabets.
(In reply to comment #97)
> Oddly enough, when I use --enable-debug to get more info, it doesn't crash! But
> the "make install" problem remains.

To be sure, I just "make clean" and rebuild everything again. It now doesn't crash with or without --enable-debug. So, there must have been something wrong in my working copy in former builds.

But the "make install" problem should be real. I'll file a bug soon.
"make install" is hardly ever used, I'm not surprised it doesn't work.

There's no point in keeping Next, Prev and BreakInBetween working with their separate implementations. BreakInBetween is never used anymore, just delete it from the interface (and rev the UUID). I suggest you replace Prev and Next with implementations that just select the space-delimited word containing the requested aPos, call GetJISx4051Breaks on that word, and compute their results from the returned boolean array. These are only called by stuff like content serialization now, where performance is not an issue as long as it's still O(N). At some point we should get rid of the callers to Prev and Next, and have them use nsLineBreaker.h instead, but that's probably more work than you want to do right now.

-     c = CLASS_THAI;
+     if (0x0E50 <= u && u <= 0x0E59) {
+       c = 6;  // Thai digits
+     } else {
+       switch (u)
+       {
+         case 0x0E3F: // Thai Currency Symbol Baht
+	   c = 0; break;
+         case 0x0E2F: // Thai Paiyan Noi
+         case 0x0E46: // Thai Mai Yamok
+         case 0x0E5A: // Thai Angkhan Khu
+         case 0x0E5B: // Thai Khomut
+           c = 1; break;
+         case 0x0E4F: // Thai Fongman
+           c = 8; break;

Could this be done better using GETCLASSFROMTABLE and a new table?

+      breakState.AppendElements(end - cur);

Check for OOM here.

+nsGetComplexLineBreaks(const PRUnichar* aText, PRUint32 aLength,
+                       PRPackedBool* aBreakBefore);

Call this NS_GetComplexLineBreaks, otherwise it looks like a type to me.

+      while (end < aLength && CLASS_THAI == GetClass(aChars[end]))
+        ++end;
+      for (PRUint32 i = cur; i < end; ++i)
+        aBreakBefore[i] = breakState[i - cur];

I think we generally try to use {} around any statement-blocks that aren't break, continue or return.

+      cur = end;

I think this is wrong. 'cur' gets incremented after this and aBreakBefore[end] won't be set. You probably want to set cur to end - 1 instead.

I think this code sets aBreakBefore[cur] to always be false, right? Should aBreakBefore[cur] and aBreakBefore[end] be set to the class-based break values, or be taken from the complex line breaker? If you want to use the complex line breaker for those values, then I think you should pass the whole string to the complex line breaker so it has all the context it might need. In this case you'll want to make sure we only call the complex breaker once.

Instead of making this stuff Thai-specific, should we rename CLASS_THAI to CLASS_COMPLEX and similarly for other names? Aren't there other languages that need similar processing, like Khmer and Lao? (We don't have to support those now but if we can make it easy to support them later without doing much extra work, that'd be nice.)
(In reply to comment #100)

> There's no point in keeping Next, Prev and BreakInBetween working with their
> separate implementations. BreakInBetween is never used anymore, just delete it
> from the interface (and rev the UUID).

I'll do so in next patch.

> I suggest you replace Prev and Next with
> implementations that just select the space-delimited word containing the
> requested aPos, call GetJISx4051Breaks on that word, and compute their results
> from the returned boolean array. These are only called by stuff like content
> serialization now, where performance is not an issue as long as it's still
> O(N). At some point we should get rid of the callers to Prev and Next, and have
> them use nsLineBreaker.h instead, but that's probably more work than you want
> to do right now.

Oh, yes. I tried to be less intrusive in last patch. But the change to Prev() and Next() you suggest is logical.

> -     c = CLASS_THAI;
> +     if (0x0E50 <= u && u <= 0x0E59) {
> +       c = 6;  // Thai digits
> +     } else {
> +       switch (u)
> +       {
> +         case 0x0E3F: // Thai Currency Symbol Baht
> +          c = 0; break;
> +         case 0x0E2F: // Thai Paiyan Noi
> +         case 0x0E46: // Thai Mai Yamok
> +         case 0x0E5A: // Thai Angkhan Khu
> +         case 0x0E5B: // Thai Khomut
> +           c = 1; break;
> +         case 0x0E4F: // Thai Fongman
> +           c = 8; break;
> 
> Could this be done better using GETCLASSFROMTABLE and a new table?

This would involve changing intl/lwbrk/tools/jisx4501class.txt, right? I'll try that.

> +      breakState.AppendElements(end - cur);
> 
> Check for OOM here.

OK. Probably, it should fallback to somthing logical rather than just return from the function? (It's void function, and no way to return the status.)

> +nsGetComplexLineBreaks(const PRUnichar* aText, PRUint32 aLength,
> +                       PRPackedBool* aBreakBefore);
> 
> Call this NS_GetComplexLineBreaks, otherwise it looks like a type to me.

OK.

> +      while (end < aLength && CLASS_THAI == GetClass(aChars[end]))
> +        ++end;
> +      for (PRUint32 i = cur; i < end; ++i)
> +        aBreakBefore[i] = breakState[i - cur];
> 
> I think we generally try to use {} around any statement-blocks that aren't
> break, continue or return.

OK, including this as well, right?

       while (cur > begin)
         if (breakState[--cur - begin])
           return cur;

> +      cur = end;
> 
> I think this is wrong. 'cur' gets incremented after this and aBreakBefore[end]
> won't be set. You probably want to set cur to end - 1 instead.

You're right. It should have been end - 1.

> I think this code sets aBreakBefore[cur] to always be false, right? Should
> aBreakBefore[cur] and aBreakBefore[end] be set to the class-based break values,
> or be taken from the complex line breaker? If you want to use the complex line
> breaker for those values, then I think you should pass the whole string to the
> complex line breaker so it has all the context it might need. In this case
> you'll want to make sure we only call the complex breaker once.

With cur = end - 1 fix above, aBreakBefore[end] should be set in next round. And for aBreakBefore[cur], it should be already set by this code before entering Thai chunk:

    if (cur > 0) {
      NS_ASSERTION(CLASS_COMPLEX != lastClass || CLASS_COMPLEX != cl,
                   "Loop should have prevented two adjacent Thai chars here");
      allowBreak = GetPair(lastClass, cl);
    } else {
      allowBreak = PR_FALSE;
    }
    aBreakBefore[cur] = allowBreak;

Note that it's forced to be PR_FALSE only at string beginning. (This may require Prev() and Next() to set it themselves before calling GetJISx4051Breaks() if I modify it as described above.)

> Instead of making this stuff Thai-specific, should we rename CLASS_THAI to
> CLASS_COMPLEX and similarly for other names? Aren't there other languages that
> need similar processing, like Khmer and Lao? (We don't have to support those
> now but if we can make it easy to support them later without doing much extra
> work, that'd be nice.)

That's fine. But note that this implies that we will enforce a requirement of a middle layer like Pango to separate Thai/Lao/Khmer texts before calling appropriate language engines. But we can imitate that by a separate breaker wrapper if actually in need. (I'm talking about the lack of supports for such languages in Windows and Mac, due to their small market sizes. So, Uniscribe/ATSUI may not be sufficient for them, and some other solutions may be in need.)
(In reply to comment #101)
> (In reply to comment #100)
> > Could this be done better using GETCLASSFROMTABLE and a new table?
> 
> This would involve changing intl/lwbrk/tools/jisx4501class.txt, right? I'll
> try that.

You'll need to modify http://mxr.mozilla.org/seamonkey/source/intl/lwbrk/tools/anzx4501.pl too (see where it prints out the arrays).

> > +      breakState.AppendElements(end - cur);
> > 
> > Check for OOM here.
> 
> OK. Probably, it should fallback to somthing logical rather than just return
> from the function? (It's void function, and no way to return the status.)

Yeah, fill the array in with either true or false, whichever makes more sense to you.

> > I think we generally try to use {} around any statement-blocks that aren't
> > break, continue or return.
> 
> OK, including this as well, right?
> 
>        while (cur > begin)
>          if (breakState[--cur - begin])
>            return cur;

No, that's a return.

> With cur = end - 1 fix above, aBreakBefore[end] should be set in next round.

Right.

> And for aBreakBefore[cur], it should be already set by this code before
> entering Thai chunk:
> 
>     if (cur > 0) {
>       NS_ASSERTION(CLASS_COMPLEX != lastClass || CLASS_COMPLEX != cl,
>                    "Loop should have prevented two adjacent Thai chars here");
>       allowBreak = GetPair(lastClass, cl);
>     } else {
>       allowBreak = PR_FALSE;
>     }
>     aBreakBefore[cur] = allowBreak;

Nope, because your "i" loop is overwriting it. You probably want to start your i loop at cur+1.

> > Instead of making this stuff Thai-specific, should we rename CLASS_THAI to
> > CLASS_COMPLEX and similarly for other names? Aren't there other languages
> > that need similar processing, like Khmer and Lao? (We don't have to support
> > those now but if we can make it easy to support them later without doing
> > much extra work, that'd be nice.)
> 
> That's fine. But note that this implies that we will enforce a requirement of > a middle layer like Pango to separate Thai/Lao/Khmer texts before calling
> appropriate language engines. But we can imitate that by a separate breaker
> wrapper if actually in need. (I'm talking about the lack of supports for such
> languages in Windows and Mac, due to their small market sizes. So,
> Uniscribe/ATSUI may not be sufficient for them, and some other solutions may
> be in need.)

Yeah. We can write platform-specific NS_GetComplexLineBreaks implementations that do whatever is necessary for the platform.
Patch updated according to above comments. Classes for Lao, which is also in 0EXX range, are also added.
Attachment #272399 - Attachment is obsolete: true
+           (0x0e01 <= u && u <= 0x0e5f) || // Thai

Is this now Thai+Lao?

We should make GetClass entirely table driven. We could have a 256-entry table that just maps the high byte of the character to a class table. (Each entry could be one byte, indexing an array of class tables.) I think we'd need 5 more class tables as well as the pointer table but that's only 5*128+256 = 896 more bytes. We don't need to do that in this bug though.

You could share the implementation of Next and Prev in a static function that takes a direction flag (I like to make it an integer which is either -1 or 1). The loop over the breakState array could be written

  do {
    aPos += aDirection;
  } while (aPos >= begin && aPos < end && !breakState[aPos - begin]);

+    // out of memory, behave as if there were no complex line breaker
+    for (PRUint32 i = 0; i < aLength; ++i)
+      aBreakBefore[i] = IS_SPACE(aText[i]);

I wouldn't try this hard just for out-of-memory. Just set aBreakBefore to all PR_FALSE. Anyway, IS_SPACE can't be true because IS_COMPLEX is true for all characters passed in here, right?

Nearly there! Just one more iteration required I hope :-)
(In reply to comment #104)
> +           (0x0e01 <= u && u <= 0x0e5f) || // Thai
> 
> Is this now Thai+Lao?

Umm.. Yes, partially for Lao, by the classes info just added for completeness. Let's pass Lao in for the superficial support, then.

> We should make GetClass entirely table driven. We could have a 256-entry table
> that just maps the high byte of the character to a class table. (Each entry
> could be one byte, indexing an array of class tables.) I think we'd need 5 more
> class tables as well as the pointer table but that's only 5*128+256 = 896 more
> bytes. We don't need to do that in this bug though.

Right. I'm afraid it would need regression test for CJK as well, which I'm not so sure I can handle it well.

> You could share the implementation of Next and Prev in a static function that
> takes a direction flag (I like to make it an integer which is either -1 or 1).
> The loop over the breakState array could be written
> 
>   do {
>     aPos += aDirection;
>   } while (aPos >= begin && aPos < end && !breakState[aPos - begin]);

Oops. My loop condition in Prev() "while (aPos > 0)" was actually wrong. :P

However, since aPos and begin are unsigned, I think it can only be "aPos > begin", or we risk underflow on decrementing when begin == 0. Besides, breakState[aPos - begin] is always false when aPos == begin from GetJISx4051Breaks() assumption. So, when aPos reaches zero, it can be word start or really unbreakable. We should return NS_LINEBREAKER_NEED_MORE_TEXT in such case. But if aPos reaches non-zero begin, it's chunk start and thus breakable.

The shared function can't be static, though, I think. We need to call GetJISx4051Breaks() method.

> +    // out of memory, behave as if there were no complex line breaker
> +    for (PRUint32 i = 0; i < aLength; ++i)
> +      aBreakBefore[i] = IS_SPACE(aText[i]);
> 
> I wouldn't try this hard just for out-of-memory. Just set aBreakBefore to all
> PR_FALSE. Anyway, IS_SPACE can't be true because IS_COMPLEX is true for all
> characters passed in here, right?

I tried to decouple it by generalizing it. But, yes, it could be assumed there is no space passed in.
> However, since aPos and begin are unsigned

You can convert them to signed...

> So, when aPos reaches zero, it can be word
> start or really unbreakable.

Right, so at the end you need an extra conditional to decide what result to return.

> The shared function can't be static, though, I think. We need to call
> GetJISx4051Breaks() method.

Sure OK.
With various constraints, the loop looks quite ugly:

    while (aDirection > 0 || aPos > begin) {
      aPos += aDirection;
      if (aPos >= end || breakState[aPos - begin])
        break;
    }

The constraints are:

a) (aPos > begin) needs to be checked before decrementing, to prevent unsigned wrap-around. This converts the loop from do-while to while-do.

b) (aPos == end) can be true in case of Prev(), while (aPos == begin) can also be true for Next(). So, mere (aPos > begin) loop condition will break the edge case for Next(), and checking (aPos < end - 1) before incrementing will break the edge case for Prev(). The former introduces (aDirection > 0 ||) to the loop, and the latter tears the (aPos >= end) loop condition to be done after incrementing.

If above loop is still readable to you, I'll go on with it.
(In reply to comment #106)
> > However, since aPos and begin are unsigned
> 
> You can convert them to signed...

Hmm.. The function returns PRInt32 anyway. So, we don't need to worry about the range. OK. I'll retry with that.
Attachment #272597 - Attachment is patch: true
Attachment #272597 - Attachment mime type: text/x-patch → text/plain
Comment on attachment 272597 [details] [diff] [review]
updated patch, with WordMove() shared implementation

+  for (end = aPos + 1; end < aLen && !IS_SPACE(aText[end]); ++end)

Cast aLen to PRInt32(aLen) here to avoid a gcc warning.

Otherwise looks good. Attach the final patch and get someone to land it for you. I will if no-one else is able.

That person should CVS remove nsPangoLineBreaker.h/.cpp too.
Attachment #272597 - Flags: superreview+
Attachment #272597 - Flags: review+
wait, it looks like that the prev/next can break ASCII words even if the text doesn't have CJK characters?
see bug 344816. we should not break ASCII words in Serializers.
I'm not sure I understand the problem. You mean Prev() and Next() should try to analyze complex line breaks only when at least one CJK/complex character is found in the string?
Yes, we should not break the lines in Prev/Next if they doesn't have CJK/complext charcters. But logically, this approach is wrong. I think that we should add a param for GetJISx4051Breaks like my patch for bug 344816.

I want to hear the roc's suggestion.
The change not to break non-CJK/complex texts should be simple: just borrow the fall back code in WordMove() when it's the case. If GetJISx4051Breaks() is not used by serializers, no need to touch it, I think.
right; in WordMove(), after you've computed begin and end (or possibly while you're computing them), scan over the word and see if any characters are IS_CJK_CHAR. If none are, don't allow breaking in the word and don't call GetJISx4051Breaks. GetJISx4051Breaks does not need to change.

One thing I realized is that the term "complex" is overused; we use it in nsLineBreaker.h to mean any character requiring nsILineBreaker (pseudo-JIS4051) processing. With this patch we're using it again to mean characters that are too complex for JIS4051 and require platform-supplied language-specific processing (Thai/Lao). Theppitak, I bet you're tired of making changes to this patch --- sorry --- but when you fix the WordMove() issue, could you rename the use of "complex" here to something else? How about CLASS_EXTENDED, NS_GetExtendedLineBreaks etc?
> scan over the word and see if any characters are IS_CJK_CHAR

That would give us compatibility with existing behaviour, but you could also check for IS_COMPLEX_CHAR/IS_EXTENDED_CHAR if you want the serializer to start breaking Thai/Lao.
In fact, I share some mind with Masayuki that being explicit can make things clearer. I wonder if the test cases in Bug 255990 are addressed with this complex chars detection. (I saw some Korean characters in URLs in attachment 156435 [details].)

An idea I get after rethinking is to reopen Bug 344816 and adjust Masayuki's patch after closing this bug. Is that sound?

Regarding the term "complex", well, I think the precise meaning of CTL doesn't include CJK. What's really "complex" are languages like Thai, Lao, Khmer, and even Arabic. But if you insist on changing it, I can do it in another patch, of course. (I'll not be tired, until it's committed. Because I'll continue being much more tired afterwards if it isn't.)
(In reply to comment #119)
> An idea I get after rethinking is to reopen Bug 344816 and adjust Masayuki's
> patch after closing this bug. Is that sound?

I worry that the things that the URLs are broken at pasting block the testing (or daily use :-) ) of tb nightly builds.
If so, I should follow roc's comment for compatibility with existing behavior. You may address it later by reopenning Bug 344816 if you still like to add the extra argument.
> If so, I should follow roc's comment for compatibility with existing behavior.
> You may address it later by reopenning Bug 344816 if you still like to add the
> extra argument.

I agree.

Ok, don't change the COMPLEX names for now.
Comment on attachment 272730 [details] [diff] [review]
updated patch with complex line break avoided for non-JISx4051 chars

OK let's land this!
Attachment #272730 - Flags: superreview+
Attachment #272730 - Flags: review+
checked in!
Status: REOPENED → RESOLVED
Closed: 17 years ago17 years ago
Resolution: --- → FIXED
This patch breaks compilation on linux x86_64 with the following error message:

/usr/bin/ld: ../lwbrk/src/liblwbrk_s.a(nsPangoBreaker.o): relocation R_X86_64_PC32 against `pango_language_from_string' can not be used when making a shared object; recompile with -fPIC

That's weird, because I was using Debian sid amd64 when preparing & testing the patch..
What sort of build, Kai? static? libxul?
SeaMonkey, debug, not static, not libxul

mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/../obj-sm-debug
mk_add_options MOZ_CO_PROJECT=suite
ac_add_options --enable-application=suite
ac_add_options --enable-extensions=default,tasks
ac_add_options --disable-tests

ac_add_options --enable-default-toolkit=cairo-gtk2
ac_add_options --enable-pango
ac_add_options --enable-svg
ac_add_options --enable-canvas

ac_add_options --disable-elf-dynstr-gc

ac_add_options --without-system-nspr
ac_add_options --without-system-nss
ac_add_options --without-system-jpeg
ac_add_options --without-system-zlib
ac_add_options --without-system-png

ac_add_options --enable-debug
ac_add_options --enable-optimize=no

mk_add_options MOZ_MAKE_FLAGS=-j3


Hmm, maybe I should no longer use cairo-gtk2?
I forgot why I selected that in the past.
Let me try gtk2.
checking if app-specific confvars.sh exists... /home/kaie/moz/head-312/mozilla/suite/confvars.sh
configure: error: Toolkit must be cairo-gtk2.
*** Fix above errors and then restart with               "make -f client.mk build"
make[1]: *** [configure] Error 1
You probably need to add the pango headers to config/system-headers.  See, for example, bug 385692.
Presumably this will fix it, but I didn't do any testing.
Attachment #273111 - Flags: review?(roc)
Your patch doesn't seem to help me:

gmake[6]: Entering directory `/home/kaie/moz/head-312/obj-sm-debug/intl/lwbrk/src'
nsSampleWordBreaker.cpp
nsJISx4501LineBreaker.cpp
nsSemanticUnitScanner.cpp
gmake[6]: *** No rule to make target `nsPangoBreaker.o', needed by `liblwbrk_s.a'.  Stop.
gmake[6]: *** Waiting for unfinished jobs....
/home/kaie/moz/head-312/mozilla/intl/lwbrk/src/nsJISx4501LineBreaker.cpp:45:30: error: nsComplexBreaker.h: No such file or directory
/home/kaie/moz/head-312/mozilla/intl/lwbrk/src/nsJISx4501LineBreaker.cpp: In member function ‘virtual void nsJISx4051LineBreaker::GetJISx4051Breaks(const PRUnichar*, PRUint32, PRPackedBool*)’:
/home/kaie/moz/head-312/mozilla/intl/lwbrk/src/nsJISx4501LineBreaker.cpp:517: error: ‘NS_GetComplexLineBreaks’ was not declared in this scope
gmake[6]: *** [nsJISx4501LineBreaker.o] Error 1
Hm, it should address the problem in comment 126, I don't know what's up with the error you just pasted.
(In reply to comment #134)
> Hm, it should address the problem in comment 126, I don't know what's up with
> the error you just pasted.
> 

It's a new error that seems to be introduced by your patch.

When I compile trunk as is, I get the error I reported in comment 126.

When I compile trunk with the patch proposed in comment 132, I get the error from comment 133.

When I back out the patch originally checked in for this bug, my build completes ok.
Ah, okay.  Maybe I should actually try compiling on this x86_64 system I have sitting right here.  :)
Hm, I can't reproduce any build bustage on my x86_64 build.
Comment on attachment 273111 [details] [diff] [review]
add pango/pango-break.h to system-headers [checked in]

That patch did fix it for me (I wrote the same patch before seeing this), but I needed to clobber nsPangoBreaker.o to get it to take effect.
Attachment #273111 - Flags: superreview+
Attachment #273111 - Flags: review+
Comment on attachment 273111 [details] [diff] [review]
add pango/pango-break.h to system-headers [checked in]

Checked in.  Turns out I have gcc 4.1.2, which is why I couldn't reproduce this.
Attachment #273111 - Attachment description: add pango/pango-break.h to system-headers → add pango/pango-break.h to system-headers [checked in]
Attachment #273111 - Flags: review?(roc)
I just find that a return line is missing in my last patch. In nsPangoBreaker.cpp, the out-of-memory fallback should return immediately when it's done.
(In reply to comment #138)
> That patch did fix it for me (I wrote the same patch before seeing this), but I
> needed to clobber nsPangoBreaker.o to get it to take effect.

Confirming, now it works for me, too. Thanks.
Bug 389520 has been filed for Mac.
Bug #390048 has been filed for Window
While the last patch is not checked-in yet, I update it for consistency with the patches proposed in the other two bugs.
Comment on attachment 274594 [details] [diff] [review]
Patch with pre-memset() and then return on failure

To really close this bug, could either this patch or Attachment #273382 [details] [diff] be checked-in?
Attachment #274594 - Flags: review?(roc)
Attachment #274594 - Flags: superreview+
Attachment #274594 - Flags: review?(roc)
Attachment #274594 - Flags: review+
Attachment #274594 - Flags: approval1.9?
Attachment #274594 - Flags: approval1.9? → approval1.9+
Assignee: smontagu → thep
QA Contact: amyy → i18n
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: