425915 - word boundary detection for Thai text

Reporter

Description

•

16 years ago

Word boundary detection is needed for various text-related functions in Firefox,
including:
- navigation/selection/editing by keyboard: caret moving by word
- selection by mouse: double-clicking on text should select word (not the whole text between whitespaces)


Navigation by keyboard case:

input text:  ทดสอบการทำงาน ภาษาไทย-ลาว

expected caret stops (estimated), from left to right: ทดสอบ|การ|ทำงาน| ภาษา|ไทย-|ลาว|

current caret stops, from left to right: ทดสอบการทำงาน| ภาษาไทย-|ลาว|

| = caret position



Selection by mouse double-clicking case:

text:  ทดสอบการทำงาน

double-click any point over "ทดสอบ" text should highlight the whole word "ทดสอบ" and not others*


*also depends on dictionary, if we use dictionary-based approach for word boundary detection.

found on Mac OS X, Linux, Windows

Arthit Suriyawongkul

Reporter

Updated

•

16 years ago

Blocks: thai

Severity: normal → enhancement

Keng Susumpow

Comment 1

•

16 years ago

FYI.

Safari 3.1.2 on Mac OS X has no problem on this test case, both keyboard and mouse selection.

timeless

Updated

•

16 years ago

Component: Layout: CTL → Layout: Text

QA Contact: arthit → layout.fonts-and-text

Dietrich Ayala (:dietrich)

Comment 2

•

14 years ago

Prabhat, are you still working on this?

Can someone test this case with Firefox 4, and confirm the problem still exists?

Assignee: prabhat.hegde → nobody

HudchewMan

Comment 3

•

14 years ago

Windows XP / Firefox 4 beta 6

Still work not correct

Chakrit Wichian

Comment 4

•

14 years ago

Still exists on 4.0b6 Ubuntu 10.04

Kamonpop Jarujit

Comment 5

•

14 years ago

Firefox 4.0b8pre on Ubuntu 10.10 has problem on this case.

goffity

Comment 6

•

14 years ago

goffity

Still exists Firefox 3.6.11 Ubuntu 10.04

tonhor

Comment 7

•

14 years ago

TonHor ,, i found this bug ,too

Firefox 3.6.11 on Windows 7 in this test case

Pongsak Sarapukdee

Comment 8

•

14 years ago

I use Firefox 4 beta on Window 7.
this bug not fix.

Dietrich Ayala (:dietrich)

Comment 9

•

14 years ago

The people here at Barcamp Bangkok are saying this bug is a major problem in daily-usage of Firefox. Enhancement -> Major.

Severity: enhancement → major

1412

Comment 10

•

14 years ago

Still exists Firefox 3.6.11 Windows 7

Dietrich Ayala (:dietrich)

Comment 11

•

14 years ago

David or Masayuki, do you know where in the codebase we'd need to change in order to fix this? A contributor has a Thai word-boundary detection algorithm he's implemented in Java, and is interested in converting it to something we can use.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 12

•

14 years ago

In the code in intl/lwbrk/src/, there is integration with platform-specific line-breaking code for line breaking in certain languagse.  See the ifdefs in the Makefile.in in that directory that choose an appropriate platform implementation of NS_GetComplexLineBreaks.

However, it sounds like this bug isn't really about line breaking, but rather about selection movement and extension; I think the relevant codepath may be that code in nsSelection.cpp calls nsIFrame::PeekOffset which in turn calls nsTextFrame::PeekOffsetWord.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 13

•

14 years ago

I'd also note that we probably don't want more than one way of doing Thai word boundary detection; we already have a setup for relying on the platform for it for line breaking, and we should probably use the same code here.  (If it needs improvement or replacement, that should probably be a separate bug.)

Dietrich Ayala (:dietrich)

Comment 14

•

14 years ago

Arthit said earlier today that libthai has code for word boundary detection, but that we're not using it. Maybe the solution is to have the relevant selection code use libthai instead of whatever it's using now?

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 15

•

14 years ago

Isn't using nsIWordBreaker?
http://mxr.mozilla.org/mozilla-central/source/layout/generic/nsTextFrameThebes.cpp#5705

Its implementation is here:
http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/nsSampleWordBreaker.cpp

I guess that you need to use some native APIs on each platform like line breaker for such complex languages. See bug 336959 and https://wiki.mozilla.org/Gecko:Line_Breaking#Complex_scripts

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 16

•

14 years ago

roc may have some more advice.

Dietrich Ayala (:dietrich)

Comment 17

•

14 years ago

Cc'ing Theppitak, who wrote the linebreaking code for Thai. Perhaps he knows the best way forward here, or would be willing to write a patch for this?

Theppitak Karoonboonyanan

Comment 18

•

10 years ago

(In reply to Masayuki Nakano (:masayuki) (Mozilla Japan) (away: 9/6 - 9/10, JST) from comment #15)
> Isn't using nsIWordBreaker?
> http://mxr.mozilla.org/mozilla-central/source/layout/generic/
> nsTextFrameThebes.cpp#5705
> 
> Its implementation is here:
> http://mxr.mozilla.org/mozilla-central/source/intl/lwbrk/src/
> nsSampleWordBreaker.cpp

Having tried some quick change for verification, I can confirm that it is
the right place to do (with the former one moved to
layout/generic/nsTextFrame.cpp, line 6827). 

However, the current interface is obviously inefficient for the costly Thai
word analysis:

layout/generic/nsTextFrame.cpp, line 6823:
---8<---
  nsIWordBreaker* wordBreaker = nsContentUtils::WordBreaker();
  for (int32_t i = 0; i <= textLen; ++i) {
    int32_t indexInText = i + textStart;
    mWordBreaks[i] |=
      wordBreaker->BreakInBetween(aContext.get(), indexInText,
                                  aContext.get() + indexInText,
                                  aContext.Length() - indexInText);
  }
---8<---

The other methods, FindWord() and NextWord(), look more practical, although
something like nsJISx4051LineBreaker::GetJISx4051Breaks() is the most
optimal.

> I guess that you need to use some native APIs on each platform like line
> breaker for such complex languages. See bug 336959 and
> https://wiki.mozilla.org/Gecko:Line_Breaking#Complex_scripts

Yes, we can even call the existing NS_GetComplexLineBreaks(), as there is
no significant difference between line break and word break for Thai.

Theppitak Karoonboonyanan

Comment 19

•

10 years ago

Attached patch First patch, using NextWord() method instead of BreakInBetween() (obsolete) — Details — Splinter Review

This first patch uses NextWord() instead of BreakInBetween(), and only patch the NextWord() method, not the others. Just for further discussion.

Dietrich Ayala (:dietrich)

Comment 20

•

10 years ago

Thanks Theppitak!

Masayuki, are you the right person to take a look at Theppitak's approach in this patch?

Flags: needinfo?(masayuki)

Theppitak Karoonboonyanan

Comment 21

•

10 years ago

Attached patch Updated patch, with all methods implemented (obsolete) — Details — Splinter Review

This patch implements Thai word break in all methods. All are tested with ad hoc modifications on ClusterIterator c-tor. (All three methods are possible to use. Only NextWord() is chosen in the proposed patch.)

Attachment #8485374 - Attachment is obsolete: true

Theppitak Karoonboonyanan

Comment 22

•

10 years ago

Comment on attachment 8485802 [details] [diff] [review]
Updated patch, with all methods implemented

Requesting for patch review.

Attachment #8485802 - Flags: review?(roc)

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 23

•

10 years ago

(In reply to Dietrich Ayala (:dietrich) from comment #20)
> Thanks Theppitak!
> 
> Masayuki, are you the right person to take a look at Theppitak's approach in
> this patch?

I'm not so familiar around the complex script part. However, if nobody will review it, I'll try.

Theppitack:

Please create diff files with |-U 8 -p|. That makes patches easy to review. Anyway, thank you for your work.

# And sorry for the delay to reply due to my vacation and Japanese vacation season.

Flags: needinfo?(masayuki)

Theppitak Karoonboonyanan

Comment 24

•

10 years ago

Attached patch Reformatted patch (obsolete) — Details — Splinter Review

I've reformatted the patch as Masayuki suggested. Please review it. Thanks.

Attachment #8485802 - Attachment is obsolete: true

Attachment #8485802 - Flags: review?(roc)

Attachment #8489813 - Flags: review?(masayuki)

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 25

•

10 years ago

Comment on attachment 8489813 [details] [diff] [review]
Reformatted patch

Roc or smontague is better reviewer. If they are too busy, I'll try, though.

Attachment #8489813 - Flags: review?(masayuki) → review?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 26

•

10 years ago

Comment on attachment 8489813 [details] [diff] [review]
Reformatted patch

Review of attachment 8489813 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/lwbrk/nsSampleWordBreaker.cpp
@@ +30,5 @@
> +  uint8_t c2 = this->GetClass(aText2[0]);
> +  if (kWbClassThaiLetter == c1 && kWbClassThaiLetter == c2) {
> +    nsAutoString text(aText1, aTextLen1);
> +    text.Append(aText2, aTextLen2);
> +    uint8_t *breakBefore = new uint8_t[aTextLen1 + aTextLen2];

Use
  nsTArray<uint8_t> breakBefore;
  breakBefore.SetLength(aTextLen1 + aTextLen2);
then pass breakBefore.Elements(). The implicit destructor will clean up for you.

You could make it an nsAutoTArray to avoid some heap allocations.

@@ +130,5 @@
>       }
>    }
>    if(kWbClassThaiLetter == c)
>    {
> +    uint8_t *breakBefore = new uint8_t[range.mEnd - range.mBegin];

Same here.

@@ +169,5 @@
>         break;
>    }
>    if(kWbClassThaiLetter == c1)
>    {
> +    uint8_t *breakBefore = new uint8_t[aLen - aPos];

and here.

@@ +177,5 @@
> +      i++;
> +    }
> +    delete[] breakBefore;
> +    if (i < cur - aPos)
> +      return aPos + i;

This still isn't a very good interface. It's quadratic in the number of words in the text.

Seems to me we need a new method in nsIWordBreaker that does exactly what the loop in nsTextFrame does: fills in a word-breaks array for the entire string. Which, conveniently, is what NS_GetComplexLineBreaks does.

Can you do that? If not, needinfo me and I'll try to find someone else to help.

Attachment #8489813 - Flags: review?(roc) → review-

Theppitak Karoonboonyanan

Comment 27

•

10 years ago

Attached patch Patch using nsTArray (obsolete) — Details — Splinter Review

This patch uses nsTArray as roc suggested. It looks cleaner now.

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #26)
> Seems to me we need a new method in nsIWordBreaker that does exactly what
> the loop in nsTextFrame does: fills in a word-breaks array for the entire
> string. Which, conveniently, is what NS_GetComplexLineBreaks does.
> 
> Can you do that? If not, needinfo me and I'll try to find someone else to
> help.

I'll try to do it first. Here's what I think needed to do:
- Add a new method GetWordBreaks() to nsIWordBreaker (with uuid regenerated).
- Add the method in nsSampleWordBreaker and implement it.
- Modify the ClusterIterator c-tor to call the new method instead.

Attachment #8489813 - Attachment is obsolete: true

Theppitak Karoonboonyanan

Comment 28

•

10 years ago

Attached patch Patch with new GetWordBreaks() method — Details — Splinter Review

This patch now adds the new nsIWordBreaker::GetWordBreaks() method, implements it in nsSampleWordBreaker, and modifies ClusterIterator to use it.

Attachment #8490581 - Attachment is obsolete: true

Attachment #8490656 - Flags: review?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 29

•

10 years ago

Comment on attachment 8490656 [details] [diff] [review]
Patch with new GetWordBreaks() method

Review of attachment 8490656 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/lwbrk/nsIWordBreaker.h
@@ +33,5 @@
>                                 uint32_t aOffset) = 0;
>    virtual int32_t NextWord(const char16_t* aText, uint32_t aLen, 
>                             uint32_t aPos) = 0;
> +  virtual void GetWordBreaks(const char16_t* aText, uint32_t aLen,
> +                             bool* aBreakBefore) = 0;

Document that some elements of aBreakBefore are set to true and the rest are untouched.

::: intl/lwbrk/nsSampleWordBreaker.cpp
@@ +202,5 @@
> +  {
> +    for (curEnd = curBegin + 1; curEnd < aLen; curEnd++)
> +    {
> +      c2 = this->GetClass(aText[curEnd]);
> +      if (c2 != c1) 

remove trailing whitespace

@@ +214,5 @@
> +                              breakBefore.Elements());
> +      for (uint32_t i = curBegin + 1; i < curEnd; i++)
> +      { 
> +        aBreakBefore[i] = breakBefore[i - curBegin];
> +      } 

remove trailing whitespace

Attachment #8490656 - Flags: review?(roc) → review+

Theppitak Karoonboonyanan

Comment 30

•

10 years ago

Attached patch Edited patch — Details — Splinter Review

Edited patch according to roc's comments.

Attachment #8490729 - Flags: review?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

10 years ago

Attachment #8490729 - Flags: review?(roc) → review+

Ekanan Ketunuti

Comment 31

•

10 years ago

build busted and test failures: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=cf586bd6a24e

Thep, could you please take a look?

Assignee: nobody → thep

Status: NEW → ASSIGNED

Theppitak Karoonboonyanan

Comment 32

•

10 years ago

(In reply to Ekanan Ketunuti from comment #31)
> build busted and test failures:
> https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=cf586bd6a24e
> 
> Thep, could you please take a look?

Oh, I've ignored textStart value in ClusterItarator c-tor in the latest patch. Fixing soon.

Theppitak Karoonboonyanan

Comment 33

•

10 years ago

Attached patch Correcting build & test failures (obsolete) — Details — Splinter Review

This patch fixes:
- The incorrect filling of word breaks without textStart concerned, which could cause wrong behavior when aDirection < 0.
- The missing inclusion of nsTArray.h

test_bug496275.html now passes in my box. I'm not sure about others.

Ekanan Ketunuti

Comment 34

•

10 years ago

Try: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=cbe97091d49a

Theppitak Karoonboonyanan

Comment 35

•

10 years ago

Attached patch Fix more warnings & mochitest failure (obsolete) — Details — Splinter Review

I was wrong when saying that test_bug496275.html was passed in my box. I just forgot how to really run mochitest when saying so.

This patch:
- fixes test_bug496275.html failure, the difference is that aBreakBefore[0] was missed compared to the original BreakInBetween() solution;
- fixes compiler warning about signedness of i in nsSampleWordBreaker::NextWord();
- instead of directly assigning aBreakBefore elements in nsSampleWordBreaker::GetWordBreak(), only turns on the true elements and leave the others untouched (by using the OR operation), to really conform to what documented;
- does not assign aBreakBefore[aLen] by guarding the curEnd index.

Attachment #8492025 - Attachment is obsolete: true

Ekanan Ketunuti

Comment 36

•

10 years ago

Comment on attachment 8492232 [details] [diff] [review]
Fix more warnings & mochitest failure

This doesn't look better than previous one. https://tbpl.mozilla.org/?tree=Try&rev=c960bbf840e6

Theppitak Karoonboonyanan

Comment 37

•

10 years ago

Attached patch Really fix mochitest failure (obsolete) — Details — Splinter Review

OK. It must have worked by chance due to my debugging process.

It appears ClusterIterator::mWordBreaks[] must be calculated up to the position *next to* the last character, not just at the last one. (Hence the "i <= textLen", not "i < textLen", loop condition in the old code)

So I've adjusted the amount of calculated data as such. (The guard against aBreakBefore[aLen] writing in the last patch is also reverted.) It should really pass the mochitest now.

Attachment #8492232 - Attachment is obsolete: true

Ekanan Ketunuti

Comment 38

•

10 years ago

Comment on attachment 8492541 [details] [diff] [review]
Really fix mochitest failure

ok, i've pushed to try. https://tbpl.mozilla.org/?tree=Try&rev=d701648021c0

if that all comes back clean, you'll want to upload a new patch with hg metadata (incl your name, email, commit msg) as seen per https://developer.mozilla.org/en-US/docs/Creating_a_patch_that_can_be_checked_in and re-request roc for review again (for sure).

Theppitak Karoonboonyanan

Comment 39

•

10 years ago

Attached patch Formatted patch for review (obsolete) — Details — Splinter Review

Patch formatted for review & commit. Please review again. Thanks.

Attachment #8492541 - Attachment is obsolete: true

Attachment #8492627 - Flags: review?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 40

•

10 years ago

Comment on attachment 8492627 [details] [diff] [review]
Formatted patch for review

Review of attachment 8492627 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/lwbrk/nsIWordBreaker.h
@@ +40,5 @@
> +   * Note that aBreakBefore[] is assumed to be initially filled with false.
> +   * Only word break positions are set to true, and the rest are untouched.
> +   *
> +   * Note also that aBreakBefore is assumed to be of size aLen + 1,
> +   * to include the position after the last character. */

Since aBreakBefore[aLen] is always set to true, I think it would be simpler to say that aBreakBefore has aLen elements. Then in   ClusterIterator::ClusterIterator set mWordBreaks[textStart + textLen] to true.

::: intl/lwbrk/nsSampleWordBreaker.cpp
@@ +131,5 @@
>       }
>    }
>    if(kWbClassThaiLetter == c)
>    {
> +    nsTArray<uint8_t> breakBefore;

Make this nsAutoTArray<uint8_t,100>

@@ +174,5 @@
>         break;
>    }
>    if(kWbClassThaiLetter == c1)
>    {
> +    nsTArray<uint8_t> breakBefore;

Make this nsAutoTArray<uint8_t,100>

@@ +209,5 @@
> +        break;
> +    }
> +    if (kWbClassThaiLetter == c1)
> +    {
> +      nsTArray<uint8_t> breakBefore;

Make this nsAutoTArray<uint8_t,100>

::: layout/generic/nsTextFrame.cpp
@@ +6837,5 @@
>      mFrag->AppendTo(str, textOffset, textLen);
>      aContext.Insert(str, 0);
>    }
>    nsIWordBreaker* wordBreaker = nsContentUtils::WordBreaker();
> +  nsTArray<bool> breakBefore;

Make this nsAutoTArray<bool,100>

@@ +6839,5 @@
>    }
>    nsIWordBreaker* wordBreaker = nsContentUtils::WordBreaker();
> +  nsTArray<bool> breakBefore;
> +  breakBefore.SetLength(aContext.Length() + 1);
> +  memset(breakBefore.Elements(), false, breakBefore.Length()*sizeof(bool));

Call PodZero

@@ +6843,5 @@
> +  memset(breakBefore.Elements(), false, breakBefore.Length()*sizeof(bool));
> +  wordBreaker->GetWordBreaks(aContext.get(), aContext.Length(),
> +                             breakBefore.Elements());
> +  memcpy(mWordBreaks.Elements(), breakBefore.Elements() + textStart,
> +         textLen + 1);

Call PodCopy

Attachment #8492627 - Flags: review?(roc) → review-

Theppitak Karoonboonyanan

Comment 41

•

10 years ago

Attached patch Updated patch for review & commit — Details — Splinter Review

Updated patch after roc's review.

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #40)
> ::: intl/lwbrk/nsIWordBreaker.h
> @@ +40,5 @@
> > +   * Note that aBreakBefore[] is assumed to be initially filled with false.
> > +   * Only word break positions are set to true, and the rest are untouched.
> > +   *
> > +   * Note also that aBreakBefore is assumed to be of size aLen + 1,
> > +   * to include the position after the last character. */
> 
> Since aBreakBefore[aLen] is always set to true, I think it would be simpler
> to say that aBreakBefore has aLen elements. Then in  
> ClusterIterator::ClusterIterator set mWordBreaks[textStart + textLen] to
> true.

OK, done. But I think it should be mWordBreaks[textLen] instead.

> Make this nsAutoTArray<uint8_t,100>

Done in all the four commented places, plus another one in BreakInBetween().

> @@ +6839,5 @@
> >    }
> >    nsIWordBreaker* wordBreaker = nsContentUtils::WordBreaker();
> > +  nsTArray<bool> breakBefore;
> > +  breakBefore.SetLength(aContext.Length() + 1);
> > +  memset(breakBefore.Elements(), false, breakBefore.Length()*sizeof(bool));
> 
> Call PodZero
> 
> @@ +6843,5 @@
> > +  memset(breakBefore.Elements(), false, breakBefore.Length()*sizeof(bool));
> > +  wordBreaker->GetWordBreaks(aContext.get(), aContext.Length(),
> > +                             breakBefore.Elements());
> > +  memcpy(mWordBreaks.Elements(), breakBefore.Elements() + textStart,
> > +         textLen + 1);
> 
> Call PodCopy

Both are done. Should I do so to line 6822, too?:

  memset(mWordBreaks.Elements(), false, (textLen + 1)*sizeof(bool));

(In fact, there are other two places in PropertyProvider::GetHyphenationBreaks() at line 3233 and 3260, but I'm afraid that would add too much to the scope of the patch.)

Attachment #8492627 - Attachment is obsolete: true

Attachment #8492842 - Flags: review?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 42

•

10 years ago

Comment on attachment 8492842 [details] [diff] [review]
Updated patch for review & commit

Review of attachment 8492842 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good!

Attachment #8492842 - Flags: review?(roc) → review+

Ekanan Ketunuti

Comment 43

•

10 years ago

Comment on attachment 8492842 [details] [diff] [review]
Updated patch for review & commit

Thep, m-oth still orange. https://tbpl.mozilla.org/?tree=Try&rev=b509f97ec344
Still, getting pretty close. :)

Theppitak Karoonboonyanan

Comment 44

•

10 years ago

I have problem running both tests in my box (I built it outside source tree):

At top build dir:

$ $TOP_SRC_DIR/mach mochitest-plain test_eventemitter_basic.html
From _tests: Kept 20139 existing; Added/updated 0; Removed 0 files and 0 directories.
No tests could be found in the path specified. Please specify a path that is a test file or is a directory containing tests.

$ $TOP_SRC_DIR/mach mochitest-plain _tests/testing/mochitest/chrome/toolkit/devtools/tests/mochitest/test_eventemitter_basic.html
From _tests: Kept 20139 existing; Added/updated 0; Removed 0 files and 0 directories.
No tests could be found in the path specified. Please specify a path that is a test file or is a directory containing tests.

$ $TOP_SRC_DIR/mach mochitest-plain test_virtualcursor_text.html 
From _tests: Kept 20139 existing; Added/updated 0; Removed 0 files and 0 directories.
No tests could be found in the path specified. Please specify a path that is a test file or is a directory containing tests.

$ $TOP_SRC_DIR/mach mochitest-plain _tests/testing/mochitest/a11y/accessible/tests/mochitest/pivot/test_virtualcursor_text.html 
From _tests: Kept 20139 existing; Added/updated 0; Removed 0 files and 0 directories.
No tests could be found in the path specified. Please specify a path that is a test file or is a directory containing tests.

Any hint? Thanks.

Jonathan Kew [:jfkthame]

Assignee

Comment 45

•

10 years ago

I think the command you're looking for would probably be

$ $TOP_SRC_DIR/mach mochitest-plain $TOP_SRC_DIR/toolkit/devtools/tests/mochitest/test_eventemitter_basic.html

i.e. you should provide the path to the test file in the source tree.

(Or run

$ ./mach mochitest-plain toolkit/devtools/tests/mochitest/test_eventemitter_basic.html

from the top of the SOURCE tree, assuming your mozconfig file tells it where to find the objdir; I think that's the usual/expected approach.)

Theppitak Karoonboonyanan

Comment 46

•

10 years ago

(In reply to Jonathan Kew (:jfkthame) from comment #45)
> I think the command you're looking for would probably be
> 
> $ $TOP_SRC_DIR/mach mochitest-plain
> $TOP_SRC_DIR/toolkit/devtools/tests/mochitest/test_eventemitter_basic.html
> 
> i.e. you should provide the path to the test file in the source tree.
> 
> (Or run
> 
> $ ./mach mochitest-plain
> toolkit/devtools/tests/mochitest/test_eventemitter_basic.html
> 
> from the top of the SOURCE tree, assuming your mozconfig file tells it where
> to find the objdir; I think that's the usual/expected approach.)

I did that, too, with no success. Note that I succesfully ran test_bug496275.html in the previous round. But the same method does not apply to the last two tests.

Jonathan Kew [:jfkthame]

Assignee

Comment 47

•

10 years ago

Oh, these aren't "plain" mochitests... for test_eventemitter_basic.html, you'd need to run mochitest-chrome, and for test_virtualcursor_text.html, it'd be mochitest-a11y.

Or take the easy way out - just use "mach mochitest <path-to-test-file>" and let mach figure out which flavor of test it is:

$ ./mach mochitest toolkit/devtools/tests/mochitest/test_eventemitter_basic.html

$ ./mach mochitest accessible/tests/mochitest/pivot/test_virtualcursor_text.html

Theppitak Karoonboonyanan

Comment 48

•

10 years ago

OK. I can run the mochitests successfully now. Thanks for your advice.

Unlike what's seen on tbpl, test_eventemitter_basic.html passes.
But test_virtualcursor_text.html fails alike. I'll check the latter, then.

Theppitak Karoonboonyanan

Comment 49

•

10 years ago

Attached patch Fix word-back behavior — Details — Splinter Review

Not related to the a11y mochitest failures, this patch fixes another problem I have found myself: word-wise backword movements in previous patches occasionally move too far. This was because ClusterIterator::NextCluster() sometimes moved past word break position without returning.

I have also added a mochitest for this bug. Note that the test will be failed if the change to ClusterIterator::NextCluster() is not applied.

I'd like to get the additional change reviewed before working further on the a11y mochitest failures.

Attachment #8495019 - Flags: review?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

10 years ago

Attachment #8495019 - Flags: review?(roc) → review+

Theppitak Karoonboonyanan

Comment 50

•

10 years ago

Attached patch Fix a11y mochitest — Details — Splinter Review

Having thought on it more thoroughly, now I think I have done it more correctly.

Here's the aContext contents and the text portion of interest
(C = context, T = text of interest):

- when aDirection > 0:

             |<--textLen-->|
  +----------+-------------+
  |CCCCCCCCCC|TTTTTTTTTTTTT|
  +----------+-------------+
              ^             ^
              textStart     textStart+textLen

- when aDirection < 0:

  |<--textLen-->|
  +-------------+----------+
  |TTTTTTTTTTTTT|CCCCCCCCCC|
  +-------------+----------+
   ^             ^
   textStart     textStart+textLen

- generic model:

             |<--textLen-->|
  +----------+-------------+----------+
  |CCCCCCCCCC|TTTTTTTTTTTTT|CCCCCCCCCC|
  +----------+-------------+----------+
              ^             ^
              textStart     textStart+textLen

GetWordBreaks() calculates word break positions on the total aContext string of length aContext.Length() (no additional slot at the end).

Then, we copy the resulting word break positions from textStart to textStart+textLen, i.e. of length textLen+1, into mWordBreaks.

Note that even though GetWordBreaks() only calculates at most aContext.Length() positions, we need to allocate aContext.Length()+1 slots for breakBefore array, so we can still safely copy textLen+1 slots when the text of interest is at the end of aContext. The default false value as initialized is in accordance to what the old code using BreakInBetween() did.

That's how it should really work. The previous guesses with true values on position 0 and textLen+1 were based on special cases, which don't always apply in general.

Please review again. Thanks.

Attachment #8496395 - Flags: review?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

10 years ago

Attachment #8496395 - Flags: review?(roc) → review+

Comment hidden (obsolete)

Try: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=b7bf18766a90

Ekanan Ketunuti

Comment 52

•

10 years ago

https://tbpl.mozilla.org/?tree=Try&rev=b7bf18766a90
And now you've got
  mochitest-oth perma-fails: https://tbpl.mozilla.org/php/getParsedLog.php?id=49036358&tree=Try
  os-x mochitest-5 fails: https://tbpl.mozilla.org/php/getParsedLog.php?id=49036804&tree=Try

Theppitak Karoonboonyanan

Comment 53

•

10 years ago

Attached patch Fix more mochitest failures — Details — Splinter Review

OK. In ClusterIterator c-tor, mWordBreaks[0] or mWordBreaks[textLen] may be initialized with true, and using PodCopy will overwrite it. To behave exactly the same as previous code, we need to OR the word break positions instead of assigning them.

This passes the two failed mochitests. I can't run the os-x mochitest, though. I'm using Linux.

Please review it again. Thanks.

Attachment #8496946 - Flags: review?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

10 years ago

Attachment #8496946 - Flags: review?(roc) → review+

Ekanan Ketunuti

Comment 54

•

10 years ago

Comment on attachment 8496946 [details] [diff] [review]
Fix more mochitest failures

still seeing mochitest failure: https://tbpl.mozilla.org/?tree=Try&rev=eb160289a0b5 

test_wordboundary.html:
  498 INFO TEST-UNEXPECTED-PASS | chrome://mochitests/content/a11y/accessible/tests/mochitest/text/test_wordboundary.html | getTextBeforeOffset (word end): wrong text (got 'hello', expected: 'hello'), offset: 6, id: 'e2' ; - expected FAIL
  500 INFO TEST-UNEXPECTED-PASS | chrome://mochitests/content/a11y/accessible/tests/mochitest/text/test_wordboundary.html | getTextBeforeOffset (word end): wrong end offset(got '5', expected: '5'), offset: 6, id: 'e2' ; - expected FAIL
  601 INFO TEST-UNEXPECTED-PASS | chrome://mochitests/content/a11y/accessible/tests/mochitest/text/test_wordboundary.html | getTextAtOffset (word end): wrong text (got ' ', expected: ' '), offset: 5, id: 'e2' ; - expected FAIL
  602 INFO TEST-UNEXPECTED-PASS | chrome://mochitests/content/a11y/accessible/tests/mochitest/text/test_wordboundary.html | getTextAtOffset (word end): wrong start offset(got '5', expected: '5'), offset: 5, id: 'e2' ; - expected FAIL
  613 INFO TEST-UNEXPECTED-PASS | chrome://mochitests/content/a11y/accessible/tests/mochitest/text/test_wordboundary.html | getTextAtOffset (word end): wrong text (got ' ', expected: ' '), offset: 6, id: 'e2' ; - expected FAIL
  614 INFO TEST-UNEXPECTED-PASS | chrome://mochitests/content/a11y/accessible/tests/mochitest/text/test_wordboundary.html | getTextAtOffset (word end): wrong start offset(got '5', expected: '5'), offset: 6, id: 'e2' ; - expected FAIL
  715 INFO TEST-UNEXPECTED-PASS | chrome://mochitests/content/a11y/accessible/tests/mochitest/text/test_wordboundary.html | getTextAfterOffset (word end): wrong text (got ' ', expected: ' '), offset: 5, id: 'e2' ; - expected FAIL
  716 INFO TEST-UNEXPECTED-PASS | chrome://mochitests/content/a11y/accessible/tests/mochitest/text/test_wordboundary.html | getTextAfterOffset (word end): wrong start offset(got '5', expected: '5'), offset: 5, id: 'e2' ; - expected FAIL
https://tbpl.mozilla.org/php/getParsedLog.php?id=49156080&tree=Try


test_movement_by_words.html on OS X:
  156 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_movement_by_words.html | Left movement broken with eatSpace=false, stopAtPunctuation=true in "<b>Log out</b> roc"; sel.anchorNode.parentNode=[object HTMLDivElement] - got 2, expected 1
  257 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_movement_by_words.html | Left movement broken with eatSpace=true, stopAtPunctuation=true in "<b>Log out</b> roc"; sel.anchorNode.parentNode=[object HTMLDivElement] - got 2, expected 1
  358 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_movement_by_words.html | Left movement broken with eatSpace=false, stopAtPunctuation=false in "<b>Log out</b> roc"; sel.anchorNode.parentNode=[object HTMLDivElement] - got 2, expected 1
  459 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_movement_by_words.html | Left movement broken with eatSpace=true, stopAtPunctuation=false in "<b>Log out</b> roc"; sel.anchorNode.parentNode=[object HTMLDivElement] - got 2, expected 1
https://tbpl.mozilla.org/php/getParsedLog.php?id=49174837&tree=Try

test_movement_by_words.html on B2G:
  328 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_movement_by_words.html | Left movement broken with eatSpace=false, stopAtPunctuation=true in "<b>Log out</b>  roc"; sel.anchorNode.parentNode=[object HTMLDivElement] - got 2, expected 1
  429 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_movement_by_words.html | Left movement broken with eatSpace=true, stopAtPunctuation=true in "<b>Log out</b>  roc"; sel.anchorNode.parentNode=[object HTMLDivElement] - got 2, expected 1
  530 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_movement_by_words.html | Left movement broken with eatSpace=false, stopAtPunctuation=false in "<b>Log out</b>  roc"; sel.anchorNode.parentNode=[object HTMLDivElement] - got 2, expected 1
  631 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_movement_by_words.html | Left movement broken with eatSpace=true, stopAtPunctuation=false in "<b>Log out</b>  roc"; sel.anchorNode.parentNode=[object HTMLDivElement] - got 2, expected 1
https://tbpl.mozilla.org/php/getParsedLog.php?id=49169174&tree=Try

test_bug425915.html on B2G:
  157 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0d: Wrong anchor offset. - got 11, expected 8
  159 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0d: Wrong focus offset. - got 11, expected 8
  161 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0e: Wrong anchor offset. - got 16, expected 11
  163 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0e: Wrong focus offset. - got 16, expected 11
  165 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0f: Wrong anchor offset. - got 19, expected 16
  167 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0f: Wrong focus offset. - got 19, expected 16
  169 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0g: Wrong anchor offset. - got 28, expected 19
  171 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0g: Wrong focus offset. - got 28, expected 19
  173 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0h: Wrong anchor offset. - got 35, expected 28
  175 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0h: Wrong focus offset. - got 35, expected 28
  177 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0i: Wrong anchor offset. - got 40, expected 32
  179 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0i: Wrong focus offset. - got 40, expected 32
  181 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0j: Wrong anchor offset. - got 40, expected 35
  183 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 0j: Wrong focus offset. - got 40, expected 35
  193 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 1b: Wrong anchor offset. - got 29, expected 32
  195 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 1b: Wrong focus offset. - got 29, expected 32
  197 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 1c: Wrong anchor offset. - got 20, expected 29
  199 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 1c: Wrong focus offset. - got 20, expected 29
  201 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 1d: Wrong anchor offset. - got 17, expected 20
  203 INFO TEST-UNEXPECTED-FAIL | /tests/layout/generic/test/test_bug425915.html | test 1d: Wrong focus offset. - got 17, expected 20
https://tbpl.mozilla.org/php/getParsedLog.php?id=49169174&tree=Try

Theppitak Karoonboonyanan

Comment 55

•

10 years ago

I've been busy preparing for a conference right now. I'll be back to work on it soon after that.

Dietrich Ayala (:dietrich)

Comment 56

•

7 years ago

Hi Theppitak!

I confirmed today with some Mozillians in Bangkok that this bug still exists. Do you have the time to update the patch to contemporary mozilla-central and trying again?

Thanks!

Dietrich

Flags: needinfo?(thep)

Theppitak Karoonboonyanan

Comment 57

•

7 years ago

My machine that was used to build mozilla has been out of order. And I'm currently engaged with some job right now. So, I'm afraid I can't work on it yet.

Meanwhile, I'm happy if somebody could finish it.

Ekanan Ketunuti

Comment 58

•

7 years ago

How about we give Theppitak level one access so he can build and run tests on the mozilla infrastructure?

Theppitak, if you happy to finish this please follow https://www.mozilla.org/en-US/about/governance/policies/commit/.

Dietrich Ayala (:dietrich)

Comment 59

•

7 years ago

Would this patch fix the issue shown here, where Google Docs is breaking at the wrong place?

https://i.imgur.com/gNiX5fm.png

Or is that a different bug?

Ekanan Ketunuti

Comment 60

•

7 years ago

(In reply to Dietrich Ayala (:dietrich) from comment #59)
> Would this patch fix the issue shown here, where Google Docs is breaking at
> the wrong place?
> 
> https://i.imgur.com/gNiX5fm.png
> 
> Or is that a different bug?

it's definitely different bug.

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/132664/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/132664/

Axel Hecht [:Pike]

Comment 62

•

7 years ago

I got pinged on this bug by the thai community on our l10n workshop, and I've spent a few to see if I can get the patch to apply and build. The answer is yes, so I've uploaded the result to review-board and pushed to try.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c742f505f548cdfcf18d728872b52e861b3256cc

I know from local testing that the actual test in this patch fails now, but probably because we restructured stuff in test infra.

From testing at the event, the double-click selection works for thai, it doesn't work for lao, though.

I wonder if three years later, with icu in our code, we can actually implement this in a way that works for all lanuages with this problem?

Note, I won't be able to actually do anything useful beyond C++ nits, but this should give folks a head start.

Axel Hecht [:Pike]

Comment 63

•

7 years ago

... doesn't even builds on windows, in https://hg.mozilla.org/try/rev/7861406f8916b00b65a1bd737ca9566f74ac40dd#l2.144


15:07:23     INFO -  Warning: C4805 in z:\build\build\src\intl\lwbrk\nsSampleWordBreaker.cpp: '|=': unsafe mix of type 'bool' and type 'unsigned char' in operation
15:07:23     INFO -  z:/build/build/src/intl/lwbrk/nsSampleWordBreaker.cpp(218): warning C4805: '|=': unsafe mix of type 'bool' and type 'unsigned char' in operation

(In reply to Axel Hecht [:Pike] from comment #62)
> I wonder if three years later, with icu in our code, we can actually
> implement this in a way that works for all lanuages with this problem?

jfkthame, would you know?

Flags: needinfo?(jfkthame)

Jonathan Kew [:jfkthame]

Assignee

Comment 64

•

7 years ago

(In reply to Axel Hecht [:Pike] from comment #63)
> ... doesn't even builds on windows, in
> https://hg.mozilla.org/try/rev/7861406f8916b00b65a1bd737ca9566f74ac40dd#l2.
> 144
> 
> 
> 15:07:23     INFO -  Warning: C4805 in
> z:\build\build\src\intl\lwbrk\nsSampleWordBreaker.cpp: '|=': unsafe mix of
> type 'bool' and type 'unsigned char' in operation
> 15:07:23     INFO - 
> z:/build/build/src/intl/lwbrk/nsSampleWordBreaker.cpp(218): warning C4805:
> '|=': unsafe mix of type 'bool' and type 'unsigned char' in operation

That should be easily fixable, e.g. by casting the bool to a uint8_t.

> (In reply to Axel Hecht [:Pike] from comment #62)
> > I wonder if three years later, with icu in our code, we can actually
> > implement this in a way that works for all lanuages with this problem?
> 
> jfkthame, would you know?

Well.... ICU has line- and word-breaking support, though currently I believe we exclude that from our build. So enabling it would increase our code (and data) size, as it'll pull in the Thai dictionary that is used to identify word boundaries. I'm not sure what the total size impact would be, but given that we're still waiting to shoehorn ICU into the Android build, I'd be hesitant to do it right now.

If we do (eventually) move to an ICU-backed solution, that should work for whatever languages ICU supports -- I think that includes several of the SEAsian languages that have similar issues. The same would be true, though, for a solution based on platform-native breaking (like we currently use for line-break): although the patch here specifically checks for Thai letters, it could perfectly well do the same for Lao, Khmer, Burmese, etc., and pass those through to the platform breaker as well; then we'd get support for whatever languages the host platform implements.

Eventually, a re-write of both line- and word-breaking to use ICU services consistently across all platforms (and eliminate platform-specific code here) would probably be a good thing. But until we tackle that, something along the lines of the patch here is probably the easiest interim solution.

Flags: needinfo?(jfkthame)

Jonathan Kew [:jfkthame]

Assignee

Updated

•

7 years ago

Depends on: 1423593

Peiying Mo [:CocoMo]

Comment 65

•

5 years ago

Our monthly meetings with the Thai community keeps bringing up this bug. This issue is a pretty major usability issue and hinders bringing new contributors to the community.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Updated

•

5 years ago

Priority: -- → P2

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Updated

•

5 years ago

Priority: P2 → P1

Poren Chiang [:RSChiang]

Comment 66

•

4 years ago

Attached file Bug 425915 - Use complex breaker for Thai and CJK scripts. — Details

Word boundary detection is used for various text-related functions
in Firefox, notably caret keyboard navigation, mouse double-click
selection, and three-tap dictionary lookup on macOS. The current
"naive" word breaker use boundary between different scripts, which
would break languages such as Thai, Mandarin, and Japanese (mixed
Hiragana and Han character [Kanji]) into only large chuck of text.

This patch utilizes complex script services provided by platform
that are already being used by moz::intl::LineBreaker, adapting
similar code for moz::intl::WordBreaker.

Poren Chiang [:RSChiang]

Comment 67

•

4 years ago

I'd like to bump up this issue as it affects Mandarin and Japanese hiragana as well. Double-clicking, option+ arrow key caret movement, and most importantly, dictionary lookup, are all selecting the whole chunk of text.

Additional Scenarios for Mandarin and Japanese

Han characters generally doesn’t work well:

山林開放後|，|我們與動物的距離 (current behavior)
山林|開放|後|，|我們|與|動物|的|距離 (expected)

Japanese words are more complicated, as there could be Han and Hiragana in a single word (but not Katakana):

雨|と|気温上昇|で|雪解|けが|進|む (current behavior, grouping word by script type)
雨|と|気温|上昇|で|雪解け|が|進む (expected, native behavior in macOS)

Current Status

I’ve refreshed the patch on Phabricator, built and tested against macOS 10.14. It works for Thai examples provided above, but due to liberal line-breaking rules in East Asian scripts, NS_GetComplexLineBreaks basically means break-all.

Proposed Solution

Probably related to Bug 1275486. For complex scripts and Han/Hiragana characters, I think we need to pass different parameter to platform text engine, which in macOS’ case would be kCFStringTokenizerUnitWord instead of kCFStringTokenizerUnitSentence. This would probably be implemented in a function similar to NS_GetComplexLineBreaks (say, NS_GetComplexWordBoundary).

Thoughts?

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 68

•

4 years ago

Ideally, we would have a behavior that is (a) the same across platforms, (b) the same across browsers, and (c) something that we could standardize. How does what you propose fit these criteria?

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 69

•

4 years ago

... and I suppose it's also worth asking how it aligns with what CSS Text Level 3 specifies.

Poren Chiang [:RSChiang]

Comment 70

•

4 years ago

@dbaron Thanks for pointing these out.

My understanding is that LineBreaker is used to implement CSS word-break and line-break, hence it is properly documented and carefully distinguish approaches on different scripts. It is the class that calls into platform-specific language service for better line breaking (see intl/lwbrk/LineBreaker.cpp line 1072–84).

WordBreaker, however, is neither used in rendering, nor does it follow any web standard previously. A full grep in source shows that it is only used for caret, find in page, spellchecker, dictionary lookup, and other UI functions. Thus, it is important for them to be consistent with native platform behaviors. The previous suggestion is basically giving alphabets a fast track, and pass the rest to the system.

Jonathan Kew [:jfkthame]

Assignee

Comment 71

•

4 years ago

That's right, according to my understanding. The terminology is a bit confusing, because e.g. the CSS word-break property is one of the things that controls line breaking for layout, but that's not what Gecko's WordBreaker is for -- it provides word boundary analysis for various user-interaction purposes. I don't think CSS Text really has anything to say about this.

What might become relevant here is that there's a proposal to add a text segmentation API to JavaScript, which would provide word-boundary services (among other things). If this happens, it will probably involve reimplementing what WordBreaker does in a more thorough and better-standardized way, possibly based on ICU or a Rust implementation of UAX 29. If/when that happens, it's likely to supersede anything done here.

However, that's still under discussion and the way forward is not entirely clear. In the meantime, I think that if we can improve the behavior for languages such as Thai by means of a fairly straightforward patch to the existing code, this would be a valuable interim solution. It's unfortunate that Thep's work several years ago came so close to being ready to land, but we weren't able to get over the finish line at that time.

I've just pushed the current patch from phabricator to tryserver to see how things look across platforms: https://treeherder.mozilla.org/#/jobs?repo=try&revision=68211de31fd686e359ce3147900ae4fe65420954.

Poren Chiang [:RSChiang]

Comment 72

•

4 years ago

@jfkthame Thank you for summarize it up. UAX 29 is a pretty good read, and from this note under 4.1.1:

For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.

I'd say if we ever implement UAX 29, it would be a proper replacement for the current WordBreaker script guessing hack only. Unless we’re going full ICU with dictionaries, we still need to pass them to the platform just like you mentioned.

I’ve never used try server before! I’ll see if I could make Japanese text work, based from the current test results.

Arthit Suriyawongkul

Reporter

Comment 73

•

4 years ago

May related:

W3C Southeast Asian Layout task force (sealreq) is now developing document for complex text layout languages (Thai, Lao, Khmer) here https://github.com/w3c/sealreq/

Jonathan Kew [:jfkthame]

Assignee

Comment 74

•

4 years ago

Attached file Bug 425915 - Use complex line breaker to identify word boundaries in SEAsian languages without interword spaces. r=m_kato — Details

Jonathan Kew [:jfkthame]

Assignee

Comment 75

•

4 years ago

Poren: I took your patch and made some minor fixes to resolve crashes/failures that showed up on tryserver, and then I have also updated it to handle the various Southeast Asian writing systems that do not use interword spaces (Khmer, Lao, etc) in addition to Thai, based on the Unicode script property of the characters. This seems to be working pretty well in my (limited) testing.

I omitted sending Han & Hiragana text through the complex breaker, as I'm not sure using line-break positions as word boundaries will work so well there; it tends to be too eager to break. Maybe that would still be better than the current behavior, but it's less clear to me. Really, we need to call a slightly different platform API (or integrate a real text segmentation component that knows about Chinese & Japanese). Anyhow, what I suggest is that in this bug we just target the SEAsian alphabets (like Thai) that do not have word spaces; if we have ideas for how to handle Japanese/Chinese better, we can do a followup bug about that.

If you'd like to test this version of the patch, and let me know of any further issues/suggestions, that would be really helpful - thanks!

Flags: needinfo?(ren.chiang)

Jonathan Kew [:jfkthame]

Assignee

Updated

•

4 years ago

Assignee: thep → jfkthame

Flags: needinfo?(thep)

Jonathan Kew [:jfkthame]

Assignee

Comment 76

•

4 years ago

Try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=79e8fbea0da178e91f4b10855b53a62ab7cd40f3

Jonathan Kew [:jfkthame]

Assignee

Comment 77

•

4 years ago

Just to note: there are other aspects of the WordBreaker code that should really be updated, too -- e.g. it doesn't handle surrogate pairs, which means it won't properly recognize supplementary-plane characters. But for clarity, I haven't included that here; this patch just aims to address the specific issue of "borrowing" the complex line-breaker behavior for these SEAsian scripts, without disrupting anything else. Other fixes/improvements should be separate bugs (or may be superseded by a new text-segmentation component, if/when we adopt an ICU- or Rust-based implementation for that).

Phabricator Automation

Updated

•

4 years ago

Attachment #9141047 - Attachment description: Bug 425915 - Use complex line breaker to identify word boundaries in SEAsian languages without interword spaces. → Bug 425915 - Use complex line breaker to identify word boundaries in SEAsian languages without interword spaces. r=m_kato

Jonathan Kew [:jfkthame]

Assignee

Comment 78

•

4 years ago

Attached file Bug 425915 - Add Thai movement-by-word testcases. r=m_kato — Details

Depends on D71206

Pulsebot

Comment 79

•

4 years ago

Pushed by jkew@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/73938a98d8fc
Use complex line breaker to identify word boundaries in SEAsian languages without interword spaces. r=m_kato
https://hg.mozilla.org/integration/autoland/rev/e6483e810796
Add Thai movement-by-word testcases. r=m_kato

Cristian Brindusan [:cbrindusan]

Comment 80

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/73938a98d8fc
https://hg.mozilla.org/mozilla-central/rev/e6483e810796

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

status-firefox77: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla77

Dietrich Ayala (:dietrich)

Comment 81

•

4 years ago

Nice to see this fixed, thanks so much Jonathan!

It's been ~10 years since Barcamp Bangkok, fond memories of working with local hackers there and helping to file and push forward these types of issues that matter so much to daily use of Firefox in Thai language :D

Poren Chiang [:RSChiang]

Comment 82

•

4 years ago

Jonathan: Thank you so much, glad to see it merged!

I finally had time to revisit the codebase (given the epidemic), and not until now did I realize why you mentioned the upcoming IntlSegmenter component. It totally makes sense now. 😅

I'll start a new bug to track possible mitigations for CJK scripts, as IntlSegmenter probably won’t land for at least half a year? Bug 345823 seems stale.

Flags: needinfo?(ren.chiang)

Makoto Kato [:m_kato]

Updated

•

4 years ago

Regressions: 1672269

First patch, using NextWord() method instead of BreakInBetween() 10 years ago Theppitak Karoonboonyanan 1.69 KB, patch		Details \| Diff \| Splinter Review
Updated patch, with all methods implemented 10 years ago Theppitak Karoonboonyanan 3.15 KB, patch		Details \| Diff \| Splinter Review
Reformatted patch 10 years ago Theppitak Karoonboonyanan 4.94 KB, patch	roc : review-	Details \| Diff \| Splinter Review
Patch using nsTArray 10 years ago Theppitak Karoonboonyanan 4.94 KB, patch		Details \| Diff \| Splinter Review
Patch with new GetWordBreaks() method 10 years ago Theppitak Karoonboonyanan 7.90 KB, patch	roc : review+	Details \| Diff \| Splinter Review
Edited patch 10 years ago Theppitak Karoonboonyanan 8.15 KB, patch	roc : review+	Details \| Diff \| Splinter Review
Correcting build & test failures 10 years ago Theppitak Karoonboonyanan 8.44 KB, patch		Details \| Diff \| Splinter Review
Fix more warnings & mochitest failure 10 years ago Theppitak Karoonboonyanan 8.51 KB, patch		Details \| Diff \| Splinter Review
Really fix mochitest failure 10 years ago Theppitak Karoonboonyanan 8.49 KB, patch		Details \| Diff \| Splinter Review
Formatted patch for review 10 years ago Theppitak Karoonboonyanan 8.76 KB, patch	roc : review-	Details \| Diff \| Splinter Review
Updated patch for review & commit 10 years ago Theppitak Karoonboonyanan 9.22 KB, patch	roc : review+	Details \| Diff \| Splinter Review
Fix word-back behavior 10 years ago Theppitak Karoonboonyanan 15.19 KB, patch	roc : review+	Details \| Diff \| Splinter Review
Fix a11y mochitest 10 years ago Theppitak Karoonboonyanan 15.08 KB, patch	roc : review+	Details \| Diff \| Splinter Review
Fix more mochitest failures 10 years ago Theppitak Karoonboonyanan 14.94 KB, patch	roc : review+	Details \| Diff \| Splinter Review
Bug 425915 - word boundary detection for Thai text 7 years ago Axel Hecht [:Pike] 59 bytes, text/x-review-board-request		Details
Bug 425915 - Use complex breaker for Thai and CJK scripts. 4 years ago Poren Chiang [:RSChiang] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 425915 - Use complex line breaker to identify word boundaries in SEAsian languages without interword spaces. r=m_kato 4 years ago Jonathan Kew [:jfkthame] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 425915 - Add Thai movement-by-word testcases. r=m_kato 4 years ago Jonathan Kew [:jfkthame] 47 bytes, text/x-phabricator-request		Details \| Review