Closed Bug 425915 Opened 16 years ago Closed 4 years ago

word boundary detection for Thai text

Categories

(Core :: Layout: Text and Fonts, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla77
Tracking Status
firefox77 --- fixed

People

(Reporter: arthit, Assigned: jfkthame)

References

(Blocks 1 open bug)

Details

(Keywords: intl)

Attachments

(10 files, 8 obsolete files)

7.90 KB, patch
roc
: review+
Details | Diff | Splinter Review
8.15 KB, patch
roc
: review+
Details | Diff | Splinter Review
9.22 KB, patch
roc
: review+
Details | Diff | Splinter Review
15.19 KB, patch
roc
: review+
Details | Diff | Splinter Review
15.08 KB, patch
roc
: review+
Details | Diff | Splinter Review
14.94 KB, patch
roc
: review+
Details | Diff | Splinter Review
59 bytes, text/x-review-board-request
Details
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
Word boundary detection is needed for various text-related functions in Firefox,
including:
- navigation/selection/editing by keyboard: caret moving by word
- selection by mouse: double-clicking on text should select word (not the whole text between whitespaces)


Navigation by keyboard case:

input text:  ทดสอบการทำงาน ภาษาไทย-ลาว

expected caret stops (estimated), from left to right: ทดสอบ|การ|ทำงาน| ภาษา|ไทย-|ลาว|

current caret stops, from left to right: ทดสอบการทำงาน| ภาษาไทย-|ลาว|

| = caret position



Selection by mouse double-clicking case:

text:  ทดสอบการทำงาน

double-click any point over "ทดสอบ" text should highlight the whole word "ทดสอบ" and not others*


*also depends on dictionary, if we use dictionary-based approach for word boundary detection.

found on Mac OS X, Linux, Windows
Blocks: thai
Severity: normal → enhancement
FYI.

Safari 3.1.2 on Mac OS X has no problem on this test case, both keyboard and mouse selection.
Component: Layout: CTL → Layout: Text
QA Contact: arthit → layout.fonts-and-text
Prabhat, are you still working on this?

Can someone test this case with Firefox 4, and confirm the problem still exists?
Assignee: prabhat.hegde → nobody
Windows XP / Firefox 4 beta 6

Still work not correct
Still exists on 4.0b6 Ubuntu 10.04
Firefox 4.0b8pre on Ubuntu 10.10 has problem on this case.
goffity

Still exists Firefox 3.6.11 Ubuntu 10.04
TonHor ,, i found this bug ,too

Firefox 3.6.11 on Windows 7 in this test case
I use Firefox 4 beta on Window 7.
this bug not fix.
The people here at Barcamp Bangkok are saying this bug is a major problem in daily-usage of Firefox. Enhancement -> Major.
Severity: enhancement → major
Still exists Firefox 3.6.11 Windows 7
David or Masayuki, do you know where in the codebase we'd need to change in order to fix this? A contributor has a Thai word-boundary detection algorithm he's implemented in Java, and is interested in converting it to something we can use.
In the code in intl/lwbrk/src/, there is integration with platform-specific line-breaking code for line breaking in certain languagse.  See the ifdefs in the Makefile.in in that directory that choose an appropriate platform implementation of NS_GetComplexLineBreaks.

However, it sounds like this bug isn't really about line breaking, but rather about selection movement and extension; I think the relevant codepath may be that code in nsSelection.cpp calls nsIFrame::PeekOffset which in turn calls nsTextFrame::PeekOffsetWord.
I'd also note that we probably don't want more than one way of doing Thai word boundary detection; we already have a setup for relying on the platform for it for line breaking, and we should probably use the same code here.  (If it needs improvement or replacement, that should probably be a separate bug.)
Arthit said earlier today that libthai has code for word boundary detection, but that we're not using it. Maybe the solution is to have the relevant selection code use libthai instead of whatever it's using now?
roc may have some more advice.
Cc'ing Theppitak, who wrote the linebreaking code for Thai. Perhaps he knows the best way forward here, or would be willing to write a patch for this?
(In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) (away: 9/6 - 9/10, JST) from comment #15)
> Isn't using nsIWordBreaker?
> http://mxr.mozilla.org/mozilla-central/source/layout/generic/
> nsTextFrameThebes.cpp#5705
> 
> Its implementation is here:
> http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/
> nsSampleWordBreaker.cpp

Having tried some quick change for verification, I can confirm that it is
the right place to do (with the former one moved to
layout/generic/nsTextFrame.cpp, line 6827). 

However, the current interface is obviously inefficient for the costly Thai
word analysis:

layout/generic/nsTextFrame.cpp, line 6823:
---8<---
  nsIWordBreaker* wordBreaker = nsContentUtils::WordBreaker();
  for (int32_t i = 0; i <= textLen; ++i) {
    int32_t indexInText = i + textStart;
    mWordBreaks[i] |=
      wordBreaker->BreakInBetween(aContext.get(), indexInText,
                                  aContext.get() + indexInText,
                                  aContext.Length() - indexInText);
  }
---8<---

The other methods, FindWord() and NextWord(), look more practical, although
something like nsJISx4051LineBreaker::GetJISx4051Breaks() is the most
optimal.

> I guess that you need to use some native APIs on each platform like line
> breaker for such complex languages. See bug 336959 and
> https://wiki.mozilla.org/Gecko:Line_Breaking#Complex_scripts

Yes, we can even call the existing NS_GetComplexLineBreaks(), as there is
no significant difference between line break and word break for Thai.
This first patch uses NextWord() instead of BreakInBetween(), and only patch the NextWord() method, not the others. Just for further discussion.
Thanks Theppitak!

Masayuki, are you the right person to take a look at Theppitak's approach in this patch?
Flags: needinfo?(masayuki)
This patch implements Thai word break in all methods. All are tested with ad hoc modifications on ClusterIterator c-tor. (All three methods are possible to use. Only NextWord() is chosen in the proposed patch.)
Attachment #8485374 - Attachment is obsolete: true
Comment on attachment 8485802 [details] [diff] [review]
Updated patch, with all methods implemented

Requesting for patch review.
Attachment #8485802 - Flags: review?(roc)
(In reply to Dietrich Ayala (:dietrich) from comment #20)
> Thanks Theppitak!
> 
> Masayuki, are you the right person to take a look at Theppitak's approach in
> this patch?

I'm not so familiar around the complex script part. However, if nobody will review it, I'll try.

Theppitack:

Please create diff files with |-U 8 -p|. That makes patches easy to review. Anyway, thank you for your work.

# And sorry for the delay to reply due to my vacation and Japanese vacation season.
Flags: needinfo?(masayuki)
Attached patch Reformatted patch (obsolete) — Splinter Review
I've reformatted the patch as Masayuki suggested. Please review it. Thanks.
Attachment #8485802 - Attachment is obsolete: true
Attachment #8485802 - Flags: review?(roc)
Attachment #8489813 - Flags: review?(masayuki)
Comment on attachment 8489813 [details] [diff] [review]
Reformatted patch

Roc or smontague is better reviewer. If they are too busy, I'll try, though.
Attachment #8489813 - Flags: review?(masayuki) → review?(roc)
Comment on attachment 8489813 [details] [diff] [review]
Reformatted patch

Review of attachment 8489813 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/lwbrk/nsSampleWordBreaker.cpp
@@ +30,5 @@
> +  uint8_t c2 = this->GetClass(aText2[0]);
> +  if (kWbClassThaiLetter == c1 && kWbClassThaiLetter == c2) {
> +    nsAutoString text(aText1, aTextLen1);
> +    text.Append(aText2, aTextLen2);
> +    uint8_t *breakBefore = new uint8_t[aTextLen1 + aTextLen2];

Use
  nsTArray<uint8_t> breakBefore;
  breakBefore.SetLength(aTextLen1 + aTextLen2);
then pass breakBefore.Elements(). The implicit destructor will clean up for you.

You could make it an nsAutoTArray to avoid some heap allocations.

@@ +130,5 @@
>       }
>    }
>    if(kWbClassThaiLetter == c)
>    {
> +    uint8_t *breakBefore = new uint8_t[range.mEnd - range.mBegin];

Same here.

@@ +169,5 @@
>         break;
>    }
>    if(kWbClassThaiLetter == c1)
>    {
> +    uint8_t *breakBefore = new uint8_t[aLen - aPos];

and here.

@@ +177,5 @@
> +      i++;
> +    }
> +    delete[] breakBefore;
> +    if (i < cur - aPos)
> +      return aPos + i;

This still isn't a very good interface. It's quadratic in the number of words in the text.

Seems to me we need a new method in nsIWordBreaker that does exactly what the loop in nsTextFrame does: fills in a word-breaks array for the entire string. Which, conveniently, is what NS_GetComplexLineBreaks does.

Can you do that? If not, needinfo me and I'll try to find someone else to help.
Attachment #8489813 - Flags: review?(roc) → review-
Attached patch Patch using nsTArray (obsolete) — Splinter Review
This patch uses nsTArray as roc suggested. It looks cleaner now.

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #26)
> Seems to me we need a new method in nsIWordBreaker that does exactly what
> the loop in nsTextFrame does: fills in a word-breaks array for the entire
> string. Which, conveniently, is what NS_GetComplexLineBreaks does.
> 
> Can you do that? If not, needinfo me and I'll try to find someone else to
> help.

I'll try to do it first. Here's what I think needed to do:
- Add a new method GetWordBreaks() to nsIWordBreaker (with uuid regenerated).
- Add the method in nsSampleWordBreaker and implement it.
- Modify the ClusterIterator c-tor to call the new method instead.
Attachment #8489813 - Attachment is obsolete: true
This patch now adds the new nsIWordBreaker::GetWordBreaks() method, implements it in nsSampleWordBreaker, and modifies ClusterIterator to use it.
Attachment #8490581 - Attachment is obsolete: true
Attachment #8490656 - Flags: review?(roc)
Comment on attachment 8490656 [details] [diff] [review]
Patch with new GetWordBreaks() method

Review of attachment 8490656 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/lwbrk/nsIWordBreaker.h
@@ +33,5 @@
>                                 uint32_t aOffset) = 0;
>    virtual int32_t NextWord(const char16_t* aText, uint32_t aLen, 
>                             uint32_t aPos) = 0;
> +  virtual void GetWordBreaks(const char16_t* aText, uint32_t aLen,
> +                             bool* aBreakBefore) = 0;

Document that some elements of aBreakBefore are set to true and the rest are untouched.

::: intl/lwbrk/nsSampleWordBreaker.cpp
@@ +202,5 @@
> +  {
> +    for (curEnd = curBegin + 1; curEnd < aLen; curEnd++)
> +    {
> +      c2 = this->GetClass(aText[curEnd]);
> +      if (c2 != c1) 

remove trailing whitespace

@@ +214,5 @@
> +                              breakBefore.Elements());
> +      for (uint32_t i = curBegin + 1; i < curEnd; i++)
> +      { 
> +        aBreakBefore[i] = breakBefore[i - curBegin];
> +      } 

remove trailing whitespace
Attachment #8490656 - Flags: review?(roc) → review+
Attached patch Edited patchSplinter Review
Edited patch according to roc's comments.
Attachment #8490729 - Flags: review?(roc)
build busted and test failures: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=cf586bd6a24e

Thep, could you please take a look?
Assignee: nobody → thep
Status: NEW → ASSIGNED
(In reply to Ekanan Ketunuti from comment #31)
> build busted and test failures:
> https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=cf586bd6a24e
> 
> Thep, could you please take a look?

Oh, I've ignored textStart value in ClusterItarator c-tor in the latest patch. Fixing soon.
Attached patch Correcting build & test failures (obsolete) — Splinter Review
This patch fixes:
- The incorrect filling of word breaks without textStart concerned, which could cause wrong behavior when aDirection < 0.
- The missing inclusion of nsTArray.h

test_bug496275.html now passes in my box. I'm not sure about others.
I was wrong when saying that test_bug496275.html was passed in my box. I just forgot how to really run mochitest when saying so.

This patch:
- fixes test_bug496275.html failure, the difference is that aBreakBefore[0] was missed compared to the original BreakInBetween() solution;
- fixes compiler warning about signedness of i in nsSampleWordBreaker::NextWord();
- instead of directly assigning aBreakBefore elements in nsSampleWordBreaker::GetWordBreak(), only turns on the true elements and leave the others untouched (by using the OR operation), to really conform to what documented;
- does not assign aBreakBefore[aLen] by guarding the curEnd index.
Attachment #8492025 - Attachment is obsolete: true
Comment on attachment 8492232 [details] [diff] [review]
Fix more warnings & mochitest failure

This doesn't look better than previous one. https://tbpl.mozilla.org/?tree=Try&rev=c960bbf840e6
Attached patch Really fix mochitest failure (obsolete) — Splinter Review
OK. It must have worked by chance due to my debugging process.

It appears ClusterIterator::mWordBreaks[] must be calculated up to the position *next to* the last character, not just at the last one. (Hence the "i <= textLen", not "i < textLen", loop condition in the old code)

So I've adjusted the amount of calculated data as such. (The guard against aBreakBefore[aLen] writing in the last patch is also reverted.) It should really pass the mochitest now.
Attachment #8492232 - Attachment is obsolete: true
Comment on attachment 8492541 [details] [diff] [review]
Really fix mochitest failure

ok, i've pushed to try. https://tbpl.mozilla.org/?tree=Try&rev=d701648021c0

if that all comes back clean, you'll want to upload a new patch with hg metadata (incl your name, email, commit msg) as seen per https://developer.mozilla.org/en-US/docs/Creating_a_patch_that_can_be_checked_in and re-request roc for review again (for sure).
Attached patch Formatted patch for review (obsolete) — Splinter Review
Patch formatted for review & commit. Please review again. Thanks.
Attachment #8492541 - Attachment is obsolete: true
Attachment #8492627 - Flags: review?(roc)
Comment on attachment 8492627 [details] [diff] [review]
Formatted patch for review

Review of attachment 8492627 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/lwbrk/nsIWordBreaker.h
@@ +40,5 @@
> +   * Note that aBreakBefore[] is assumed to be initially filled with false.
> +   * Only word break positions are set to true, and the rest are untouched.
> +   *
> +   * Note also that aBreakBefore is assumed to be of size aLen + 1,
> +   * to include the position after the last character. */

Since aBreakBefore[aLen] is always set to true, I think it would be simpler to say that aBreakBefore has aLen elements. Then in   ClusterIterator::ClusterIterator set mWordBreaks[textStart + textLen] to true.

::: intl/lwbrk/nsSampleWordBreaker.cpp
@@ +131,5 @@
>       }
>    }
>    if(kWbClassThaiLetter == c)
>    {
> +    nsTArray<uint8_t> breakBefore;

Make this nsAutoTArray<uint8_t,100>

@@ +174,5 @@
>         break;
>    }
>    if(kWbClassThaiLetter == c1)
>    {
> +    nsTArray<uint8_t> breakBefore;

Make this nsAutoTArray<uint8_t,100>

@@ +209,5 @@
> +        break;
> +    }
> +    if (kWbClassThaiLetter == c1)
> +    {
> +      nsTArray<uint8_t> breakBefore;

Make this nsAutoTArray<uint8_t,100>

::: layout/generic/nsTextFrame.cpp
@@ +6837,5 @@
>      mFrag->AppendTo(str, textOffset, textLen);
>      aContext.Insert(str, 0);
>    }
>    nsIWordBreaker* wordBreaker = nsContentUtils::WordBreaker();
> +  nsTArray<bool> breakBefore;

Make this nsAutoTArray<bool,100>

@@ +6839,5 @@
>    }
>    nsIWordBreaker* wordBreaker = nsContentUtils::WordBreaker();
> +  nsTArray<bool> breakBefore;
> +  breakBefore.SetLength(aContext.Length() + 1);
> +  memset(breakBefore.Elements(), false, breakBefore.Length()*sizeof(bool));

Call PodZero

@@ +6843,5 @@
> +  memset(breakBefore.Elements(), false, breakBefore.Length()*sizeof(bool));
> +  wordBreaker->GetWordBreaks(aContext.get(), aContext.Length(),
> +                             breakBefore.Elements());
> +  memcpy(mWordBreaks.Elements(), breakBefore.Elements() + textStart,
> +         textLen + 1);

Call PodCopy
Attachment #8492627 - Flags: review?(roc) → review-
Updated patch after roc's review.

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #40)
> ::: intl/lwbrk/nsIWordBreaker.h
> @@ +40,5 @@
> > +   * Note that aBreakBefore[] is assumed to be initially filled with false.
> > +   * Only word break positions are set to true, and the rest are untouched.
> > +   *
> > +   * Note also that aBreakBefore is assumed to be of size aLen + 1,
> > +   * to include the position after the last character. */
> 
> Since aBreakBefore[aLen] is always set to true, I think it would be simpler
> to say that aBreakBefore has aLen elements. Then in  
> ClusterIterator::ClusterIterator set mWordBreaks[textStart + textLen] to
> true.

OK, done. But I think it should be mWordBreaks[textLen] instead.

> Make this nsAutoTArray<uint8_t,100>

Done in all the four commented places, plus another one in BreakInBetween().

> @@ +6839,5 @@
> >    }
> >    nsIWordBreaker* wordBreaker = nsContentUtils::WordBreaker();
> > +  nsTArray<bool> breakBefore;
> > +  breakBefore.SetLength(aContext.Length() + 1);
> > +  memset(breakBefore.Elements(), false, breakBefore.Length()*sizeof(bool));
> 
> Call PodZero
> 
> @@ +6843,5 @@
> > +  memset(breakBefore.Elements(), false, breakBefore.Length()*sizeof(bool));
> > +  wordBreaker->GetWordBreaks(aContext.get(), aContext.Length(),
> > +                             breakBefore.Elements());
> > +  memcpy(mWordBreaks.Elements(), breakBefore.Elements() + textStart,
> > +         textLen + 1);
> 
> Call PodCopy

Both are done. Should I do so to line 6822, too?:

  memset(mWordBreaks.Elements(), false, (textLen + 1)*sizeof(bool));

(In fact, there are other two places in PropertyProvider::GetHyphenationBreaks() at line 3233 and 3260, but I'm afraid that would add too much to the scope of the patch.)
Attachment #8492627 - Attachment is obsolete: true
Attachment #8492842 - Flags: review?(roc)
Comment on attachment 8492842 [details] [diff] [review]
Updated patch for review & commit

Review of attachment 8492842 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good!
Attachment #8492842 - Flags: review?(roc) → review+
Comment on attachment 8492842 [details] [diff] [review]
Updated patch for review & commit

Thep, m-oth still orange. https://tbpl.mozilla.org/?tree=Try&rev=b509f97ec344
Still, getting pretty close. :)
I have problem running both tests in my box (I built it outside source tree):

At top build dir:

$ $TOP_SRC_DIR/mach mochitest-plain test_eventemitter_basic.html
From _tests: Kept 20139 existing; Added/updated 0; Removed 0 files and 0 directories.
No tests could be found in the path specified. Please specify a path that is a test file or is a directory containing tests.

$ $TOP_SRC_DIR/mach mochitest-plain _tests/testing/mochitest/chrome/toolkit/devtools/tests/mochitest/test_eventemitter_basic.html
From _tests: Kept 20139 existing; Added/updated 0; Removed 0 files and 0 directories.
No tests could be found in the path specified. Please specify a path that is a test file or is a directory containing tests.

$ $TOP_SRC_DIR/mach mochitest-plain test_virtualcursor_text.html 
From _tests: Kept 20139 existing; Added/updated 0; Removed 0 files and 0 directories.
No tests could be found in the path specified. Please specify a path that is a test file or is a directory containing tests.

$ $TOP_SRC_DIR/mach mochitest-plain _tests/testing/mochitest/a11y/accessible/tests/mochitest/pivot/test_virtualcursor_text.html 
From _tests: Kept 20139 existing; Added/updated 0; Removed 0 files and 0 directories.
No tests could be found in the path specified. Please specify a path that is a test file or is a directory containing tests.

Any hint? Thanks.
I think the command you're looking for would probably be

$ $TOP_SRC_DIR/mach mochitest-plain $TOP_SRC_DIR/toolkit/devtools/tests/mochitest/test_eventemitter_basic.html

i.e. you should provide the path to the test file in the source tree.

(Or run

$ ./mach mochitest-plain toolkit/devtools/tests/mochitest/test_eventemitter_basic.html

from the top of the SOURCE tree, assuming your mozconfig file tells it where to find the objdir; I think that's the usual/expected approach.)
(In reply to Jonathan Kew (:jfkthame) from comment #45)
> I think the command you're looking for would probably be
> 
> $ $TOP_SRC_DIR/mach mochitest-plain
> $TOP_SRC_DIR/toolkit/devtools/tests/mochitest/test_eventemitter_basic.html
> 
> i.e. you should provide the path to the test file in the source tree.
> 
> (Or run
> 
> $ ./mach mochitest-plain
> toolkit/devtools/tests/mochitest/test_eventemitter_basic.html
> 
> from the top of the SOURCE tree, assuming your mozconfig file tells it where
> to find the objdir; I think that's the usual/expected approach.)

I did that, too, with no success. Note that I succesfully ran test_bug496275.html in the previous round. But the same method does not apply to the last two tests.
Oh, these aren't "plain" mochitests... for test_eventemitter_basic.html, you'd need to run mochitest-chrome, and for test_virtualcursor_text.html, it'd be mochitest-a11y.

Or take the easy way out - just use "mach mochitest <path-to-test-file>" and let mach figure out which flavor of test it is:

$ ./mach mochitest toolkit/devtools/tests/mochitest/test_eventemitter_basic.html

$ ./mach mochitest accessible/tests/mochitest/pivot/test_virtualcursor_text.html
OK. I can run the mochitests successfully now. Thanks for your advice.

Unlike what's seen on tbpl, test_eventemitter_basic.html passes.
But test_virtualcursor_text.html fails alike. I'll check the latter, then.
Not related to the a11y mochitest failures, this patch fixes another problem I have found myself: word-wise backword movements in previous patches occasionally move too far. This was because ClusterIterator::NextCluster() sometimes moved past word break position without returning.

I have also added a mochitest for this bug. Note that the test will be failed if the change to ClusterIterator::NextCluster() is not applied.

I'd like to get the additional change reviewed before working further on the a11y mochitest failures.
Attachment #8495019 - Flags: review?(roc)
Having thought on it more thoroughly, now I think I have done it more correctly.

Here's the aContext contents and the text portion of interest
(C = context, T = text of interest):

- when aDirection > 0:

             |<--textLen-->|
  +----------+-------------+
  |CCCCCCCCCC|TTTTTTTTTTTTT|
  +----------+-------------+
              ^             ^
              textStart     textStart+textLen

- when aDirection < 0:

  |<--textLen-->|
  +-------------+----------+
  |TTTTTTTTTTTTT|CCCCCCCCCC|
  +-------------+----------+
   ^             ^
   textStart     textStart+textLen

- generic model:

             |<--textLen-->|
  +----------+-------------+----------+
  |CCCCCCCCCC|TTTTTTTTTTTTT|CCCCCCCCCC|
  +----------+-------------+----------+
              ^             ^
              textStart     textStart+textLen

GetWordBreaks() calculates word break positions on the total aContext string of length aContext.Length() (no additional slot at the end).

Then, we copy the resulting word break positions from textStart to textStart+textLen, i.e. of length textLen+1, into mWordBreaks.

Note that even though GetWordBreaks() only calculates at most aContext.Length() positions, we need to allocate aContext.Length()+1 slots for breakBefore array, so we can still safely copy textLen+1 slots when the text of interest is at the end of aContext. The default false value as initialized is in accordance to what the old code using BreakInBetween() did.

That's how it should really work. The previous guesses with true values on position 0 and textLen+1 were based on special cases, which don't always apply in general.

Please review again. Thanks.
Attachment #8496395 - Flags: review?(roc)
OK. In ClusterIterator c-tor, mWordBreaks[0] or mWordBreaks[textLen] may be initialized with true, and using PodCopy will overwrite it. To behave exactly the same as previous code, we need to OR the word break positions instead of assigning them.

This passes the two failed mochitests. I can't run the os-x mochitest, though. I'm using Linux.

Please review it again. Thanks.
Attachment #8496946 - Flags: review?(roc)
Comment on attachment 8496946 [details] [diff] [review]
Fix more mochitest failures

still seeing mochitest failure: https://tbpl.mozilla.org/?tree=Try&rev=eb160289a0b5 

test_wordboundary.html:
  498 INFO TEST-UNEXPECTED-PASS | chrome://mochitests/content/a11y/accessible/tests/mochitest/text/test_wordboundary.html | getTextBeforeOffset (word end): wrong text (got 'hello', expected: 'hello'), offset: 6, id: 'e2' ; - expected FAIL
  500 INFO TEST-UNEXPECTED-PASS | chrome://mochitests/content/a11y/accessible/tests/mochitest/text/test_wordboundary.html | getTextBeforeOffset (word end): wrong end offset(got '5', expected: '5'), offset: 6, id: 'e2' ; - expected FAIL
  601 INFO TEST-UNEXPECTED-PASS | chrome://mochitests/content/a11y/accessible/tests/mochitest/text/test_wordboundary.html | getTextAtOffset (word end): wrong text (got ' ', expected: ' '), offset: 5, id: 'e2' ; - expected FAIL
  602 INFO TEST-UNEXPECTED-PASS | chrome://mochitests/content/a11y/accessible/tests/mochitest/text/test_wordboundary.html | getTextAtOffset (word end): wrong start offset(got '5', expected: '5'), offset: 5, id: 'e2' ; - expected FAIL
  613 INFO TEST-UNEXPECTED-PASS | chrome://mochitests/content/a11y/accessible/tests/mochitest/text/test_wordboundary.html | getTextAtOffset (word end): wrong text (got ' ', expected: ' '), offset: 6, id: 'e2' ; - expected FAIL
  614 INFO TEST-UNEXPECTED-PASS | chrome://mochitests/content/a11y/accessible/tests/mochitest/text/test_wordboundary.html | getTextAtOffset (word end): wrong start offset(got '5', expected: '5'), offset: 6, id: 'e2' ; - expected FAIL
  715 INFO TEST-UNEXPECTED-PASS | chrome://mochitests/content/a11y/accessible/tests/mochitest/text/test_wordboundary.html | getTextAfterOffset (word end): wrong text (got ' ', expected: ' '), offset: 5, id: 'e2' ; - expected FAIL
  716 INFO TEST-UNEXPECTED-PASS | chrome://mochitests/content/a11y/accessible/tests/mochitest/text/test_wordboundary.html | getTextAfterOffset (word end): wrong start offset(got '5', expected: '5'), offset: 5, id: 'e2' ; - expected FAIL
https://tbpl.mozilla.org/php/getParsedLog.php?id=49156080&tree=Try


test_movement_by_words.html on OS X:
  156 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_movement_by_words.html | Left movement broken with eatSpace=false, stopAtPunctuation=true in "<b>Log out</b> roc"; sel.anchorNode.parentNode=[object HTMLDivElement] - got 2, expected 1
  257 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_movement_by_words.html | Left movement broken with eatSpace=true, stopAtPunctuation=true in "<b>Log out</b> roc"; sel.anchorNode.parentNode=[object HTMLDivElement] - got 2, expected 1
  358 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_movement_by_words.html | Left movement broken with eatSpace=false, stopAtPunctuation=false in "<b>Log out</b> roc"; sel.anchorNode.parentNode=[object HTMLDivElement] - got 2, expected 1
  459 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_movement_by_words.html | Left movement broken with eatSpace=true, stopAtPunctuation=false in "<b>Log out</b> roc"; sel.anchorNode.parentNode=[object HTMLDivElement] - got 2, expected 1
https://tbpl.mozilla.org/php/getParsedLog.php?id=49174837&tree=Try

test_movement_by_words.html on B2G:
  328 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_movement_by_words.html | Left movement broken with eatSpace=false, stopAtPunctuation=true in "<b>Log out</b>  roc"; sel.anchorNode.parentNode=[object HTMLDivElement] - got 2, expected 1
  429 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_movement_by_words.html | Left movement broken with eatSpace=true, stopAtPunctuation=true in "<b>Log out</b>  roc"; sel.anchorNode.parentNode=[object HTMLDivElement] - got 2, expected 1
  530 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_movement_by_words.html | Left movement broken with eatSpace=false, stopAtPunctuation=false in "<b>Log out</b>  roc"; sel.anchorNode.parentNode=[object HTMLDivElement] - got 2, expected 1
  631 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_movement_by_words.html | Left movement broken with eatSpace=true, stopAtPunctuation=false in "<b>Log out</b>  roc"; sel.anchorNode.parentNode=[object HTMLDivElement] - got 2, expected 1
https://tbpl.mozilla.org/php/getParsedLog.php?id=49169174&tree=Try

test_bug425915.html on B2G:
  157 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0d: Wrong anchor offset. - got 11, expected 8
  159 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0d: Wrong focus offset. - got 11, expected 8
  161 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0e: Wrong anchor offset. - got 16, expected 11
  163 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0e: Wrong focus offset. - got 16, expected 11
  165 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0f: Wrong anchor offset. - got 19, expected 16
  167 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0f: Wrong focus offset. - got 19, expected 16
  169 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0g: Wrong anchor offset. - got 28, expected 19
  171 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0g: Wrong focus offset. - got 28, expected 19
  173 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0h: Wrong anchor offset. - got 35, expected 28
  175 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0h: Wrong focus offset. - got 35, expected 28
  177 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0i: Wrong anchor offset. - got 40, expected 32
  179 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0i: Wrong focus offset. - got 40, expected 32
  181 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0j: Wrong anchor offset. - got 40, expected 35
  183 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0j: Wrong focus offset. - got 40, expected 35
  193 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 1b: Wrong anchor offset. - got 29, expected 32
  195 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 1b: Wrong focus offset. - got 29, expected 32
  197 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 1c: Wrong anchor offset. - got 20, expected 29
  199 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 1c: Wrong focus offset. - got 20, expected 29
  201 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 1d: Wrong anchor offset. - got 17, expected 20
  203 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 1d: Wrong focus offset. - got 17, expected 20
https://tbpl.mozilla.org/php/getParsedLog.php?id=49169174&tree=Try
I've been busy preparing for a conference right now. I'll be back to work on it soon after that.
Hi Theppitak!

I confirmed today with some Mozillians in Bangkok that this bug still exists. Do you have the time to update the patch to contemporary mozilla-central and trying again?

Thanks!

Dietrich
Flags: needinfo?(thep)
My machine that was used to build mozilla has been out of order. And I'm currently engaged with some job right now. So, I'm afraid I can't work on it yet.

Meanwhile, I'm happy if somebody could finish it.
How about we give Theppitak level one access so he can build and run tests on the mozilla infrastructure?

Theppitak, if you happy to finish this please follow https://www.mozilla.org/en-US/about/governance/policies/commit/.
Would this patch fix the issue shown here, where Google Docs is breaking at the wrong place?

https://i.imgur.com/gNiX5fm.png

Or is that a different bug?
(In reply to Dietrich Ayala (:dietrich) from comment #59)
> Would this patch fix the issue shown here, where Google Docs is breaking at
> the wrong place?
> 
> https://i.imgur.com/gNiX5fm.png
> 
> Or is that a different bug?

it's definitely different bug.
I got pinged on this bug by the thai community on our l10n workshop, and I've spent a few to see if I can get the patch to apply and build. The answer is yes, so I've uploaded the result to review-board and pushed to try.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c742f505f548cdfcf18d728872b52e861b3256cc

I know from local testing that the actual test in this patch fails now, but probably because we restructured stuff in test infra.

From testing at the event, the double-click selection works for thai, it doesn't work for lao, though.

I wonder if three years later, with icu in our code, we can actually implement this in a way that works for all lanuages with this problem?

Note, I won't be able to actually do anything useful beyond C++ nits, but this should give folks a head start.
... doesn't even builds on windows, in https://hg.mozilla.org/try/rev/7861406f8916b00b65a1bd737ca9566f74ac40dd#l2.144


15:07:23     INFO -  Warning: C4805 in z:\build\build\src\intl\lwbrk\nsSampleWordBreaker.cpp: '|=': unsafe mix of type 'bool' and type 'unsigned char' in operation
15:07:23     INFO -  z:/build/build/src/intl/lwbrk/nsSampleWordBreaker.cpp(218): warning C4805: '|=': unsafe mix of type 'bool' and type 'unsigned char' in operation

(In reply to Axel Hecht [:Pike] from comment #62)
> I wonder if three years later, with icu in our code, we can actually
> implement this in a way that works for all lanuages with this problem?

jfkthame, would you know?
Flags: needinfo?(jfkthame)
(In reply to Axel Hecht [:Pike] from comment #63)
> ... doesn't even builds on windows, in
> https://hg.mozilla.org/try/rev/7861406f8916b00b65a1bd737ca9566f74ac40dd#l2.
> 144
> 
> 
> 15:07:23     INFO -  Warning: C4805 in
> z:\build\build\src\intl\lwbrk\nsSampleWordBreaker.cpp: '|=': unsafe mix of
> type 'bool' and type 'unsigned char' in operation
> 15:07:23     INFO - 
> z:/build/build/src/intl/lwbrk/nsSampleWordBreaker.cpp(218): warning C4805:
> '|=': unsafe mix of type 'bool' and type 'unsigned char' in operation

That should be easily fixable, e.g. by casting the bool to a uint8_t.

> (In reply to Axel Hecht [:Pike] from comment #62)
> > I wonder if three years later, with icu in our code, we can actually
> > implement this in a way that works for all lanuages with this problem?
> 
> jfkthame, would you know?

Well.... ICU has line- and word-breaking support, though currently I believe we exclude that from our build. So enabling it would increase our code (and data) size, as it'll pull in the Thai dictionary that is used to identify word boundaries. I'm not sure what the total size impact would be, but given that we're still waiting to shoehorn ICU into the Android build, I'd be hesitant to do it right now.

If we do (eventually) move to an ICU-backed solution, that should work for whatever languages ICU supports -- I think that includes several of the SEAsian languages that have similar issues. The same would be true, though, for a solution based on platform-native breaking (like we currently use for line-break): although the patch here specifically checks for Thai letters, it could perfectly well do the same for Lao, Khmer, Burmese, etc., and pass those through to the platform breaker as well; then we'd get support for whatever languages the host platform implements.

Eventually, a re-write of both line- and word-breaking to use ICU services consistently across all platforms (and eliminate platform-specific code here) would probably be a good thing. But until we tackle that, something along the lines of the patch here is probably the easiest interim solution.
Flags: needinfo?(jfkthame)
Depends on: 1423593

Our monthly meetings with the Thai community keeps bringing up this bug. This issue is a pretty major usability issue and hinders bringing new contributors to the community.

Word boundary detection is used for various text-related functions
in Firefox, notably caret keyboard navigation, mouse double-click
selection, and three-tap dictionary lookup on macOS. The current
"naive" word breaker use boundary between different scripts, which
would break languages such as Thai, Mandarin, and Japanese (mixed
Hiragana and Han character [Kanji]) into only large chuck of text.

This patch utilizes complex script services provided by platform
that are already being used by moz::intl::LineBreaker, adapting
similar code for moz::intl::WordBreaker.

I'd like to bump up this issue as it affects Mandarin and Japanese hiragana as well. Double-clicking, option+ arrow key caret movement, and most importantly, dictionary lookup, are all selecting the whole chunk of text.

Additional Scenarios for Mandarin and Japanese

Han characters generally doesn’t work well:

山林開放後|,|我們與動物的距離 (current behavior)
山林|開放|後|,|我們|與|動物|的|距離 (expected)

Japanese words are more complicated, as there could be Han and Hiragana in a single word (but not Katakana):

雨|と|気温上昇|で|雪解|けが|進|む (current behavior, grouping word by script type)
雨|と|気温|上昇|で|雪解け|が|進む (expected, native behavior in macOS)

Current Status

I’ve refreshed the patch on Phabricator, built and tested against macOS 10.14. It works for Thai examples provided above, but due to liberal line-breaking rules in East Asian scripts, NS_GetComplexLineBreaks basically means break-all.

Proposed Solution

Probably related to Bug 1275486. For complex scripts and Han/Hiragana characters, I think we need to pass different parameter to platform text engine, which in macOS’ case would be kCFStringTokenizerUnitWord instead of kCFStringTokenizerUnitSentence. This would probably be implemented in a function similar to NS_GetComplexLineBreaks (say, NS_GetComplexWordBoundary).

Thoughts?

Ideally, we would have a behavior that is (a) the same across platforms, (b) the same across browsers, and (c) something that we could standardize. How does what you propose fit these criteria?

... and I suppose it's also worth asking how it aligns with what CSS Text Level 3 specifies.

@dbaron Thanks for pointing these out.

My understanding is that LineBreaker is used to implement CSS word-break and line-break, hence it is properly documented and carefully distinguish approaches on different scripts. It is the class that calls into platform-specific language service for better line breaking (see intl/lwbrk/LineBreaker.cpp line 1072–84).

WordBreaker, however, is neither used in rendering, nor does it follow any web standard previously. A full grep in source shows that it is only used for caret, find in page, spellchecker, dictionary lookup, and other UI functions. Thus, it is important for them to be consistent with native platform behaviors. The previous suggestion is basically giving alphabets a fast track, and pass the rest to the system.

That's right, according to my understanding. The terminology is a bit confusing, because e.g. the CSS word-break property is one of the things that controls line breaking for layout, but that's not what Gecko's WordBreaker is for -- it provides word boundary analysis for various user-interaction purposes. I don't think CSS Text really has anything to say about this.

What might become relevant here is that there's a proposal to add a text segmentation API to JavaScript, which would provide word-boundary services (among other things). If this happens, it will probably involve reimplementing what WordBreaker does in a more thorough and better-standardized way, possibly based on ICU or a Rust implementation of UAX 29. If/when that happens, it's likely to supersede anything done here.

However, that's still under discussion and the way forward is not entirely clear. In the meantime, I think that if we can improve the behavior for languages such as Thai by means of a fairly straightforward patch to the existing code, this would be a valuable interim solution. It's unfortunate that Thep's work several years ago came so close to being ready to land, but we weren't able to get over the finish line at that time.

I've just pushed the current patch from phabricator to tryserver to see how things look across platforms: https://treeherder.mozilla.org/#/jobs?repo=try&revision=68211de31fd686e359ce3147900ae4fe65420954.

@jfkthame Thank you for summarize it up. UAX 29 is a pretty good read, and from this note under 4.1.1:

For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.

I'd say if we ever implement UAX 29, it would be a proper replacement for the current WordBreaker script guessing hack only. Unless we’re going full ICU with dictionaries, we still need to pass them to the platform just like you mentioned.

I’ve never used try server before! I’ll see if I could make Japanese text work, based from the current test results.

May related:

W3C Southeast Asian Layout task force (sealreq) is now developing document for complex text layout languages (Thai, Lao, Khmer) here https://github.com/w3c/sealreq/

Poren: I took your patch and made some minor fixes to resolve crashes/failures that showed up on tryserver, and then I have also updated it to handle the various Southeast Asian writing systems that do not use interword spaces (Khmer, Lao, etc) in addition to Thai, based on the Unicode script property of the characters. This seems to be working pretty well in my (limited) testing.

I omitted sending Han & Hiragana text through the complex breaker, as I'm not sure using line-break positions as word boundaries will work so well there; it tends to be too eager to break. Maybe that would still be better than the current behavior, but it's less clear to me. Really, we need to call a slightly different platform API (or integrate a real text segmentation component that knows about Chinese & Japanese). Anyhow, what I suggest is that in this bug we just target the SEAsian alphabets (like Thai) that do not have word spaces; if we have ideas for how to handle Japanese/Chinese better, we can do a followup bug about that.

If you'd like to test this version of the patch, and let me know of any further issues/suggestions, that would be really helpful - thanks!

Flags: needinfo?(ren.chiang)
Assignee: thep → jfkthame
Flags: needinfo?(thep)

Just to note: there are other aspects of the WordBreaker code that should really be updated, too -- e.g. it doesn't handle surrogate pairs, which means it won't properly recognize supplementary-plane characters. But for clarity, I haven't included that here; this patch just aims to address the specific issue of "borrowing" the complex line-breaker behavior for these SEAsian scripts, without disrupting anything else. Other fixes/improvements should be separate bugs (or may be superseded by a new text-segmentation component, if/when we adopt an ICU- or Rust-based implementation for that).

Attachment #9141047 - Attachment description: Bug 425915 - Use complex line breaker to identify word boundaries in SEAsian languages without interword spaces. → Bug 425915 - Use complex line breaker to identify word boundaries in SEAsian languages without interword spaces. r=m_kato
Pushed by jkew@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/73938a98d8fc
Use complex line breaker to identify word boundaries in SEAsian languages without interword spaces. r=m_kato
https://hg.mozilla.org/integration/autoland/rev/e6483e810796
Add Thai movement-by-word testcases. r=m_kato
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla77

Nice to see this fixed, thanks so much Jonathan!

It's been ~10 years since Barcamp Bangkok, fond memories of working with local hackers there and helping to file and push forward these types of issues that matter so much to daily use of Firefox in Thai language :D

Jonathan: Thank you so much, glad to see it merged!

I finally had time to revisit the codebase (given the epidemic), and not until now did I realize why you mentioned the upcoming IntlSegmenter component. It totally makes sense now. 😅

I'll start a new bug to track possible mitigations for CJK scripts, as IntlSegmenter probably won’t land for at least half a year? Bug 345823 seems stale.

Flags: needinfo?(ren.chiang)
Regressions: 1672269
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: