Last Comment Bug 7969 - need dictionary based Thai line breaker and intergrate into line/word breaker
: need dictionary based Thai line breaker and intergrate into line/word breaker
Status: RESOLVED WONTFIX
: helpwanted, intl
Product: Core
Classification: Components
Component: Internationalization (show other bugs)
: Trunk
: All All
P3 normal with 10 votes (vote)
: Future
Assigned To: Simon Montagu :smontagu
: Yuying Long
: Makoto Kato [:m_kato]
Mentors:
: 162940 257935 (view as bug list)
Depends on:
Blocks: thai line-breaking 257935 299423
  Show dependency treegraph
 
Reported: 1999-06-10 23:18 PDT by Frank Tang
Modified: 2008-10-01 14:36 PDT (History)
21 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
patch for mozilla to use libinthai for Thai word break (8.38 KB, patch)
2002-02-14 05:27 PST, samphan
no flags Details | Diff | Splinter Review
The libinthai word breaking library (211.95 KB, application/octet-stream)
2002-02-14 05:36 PST, samphan
no flags Details
example for non-separation in colums with Thai characters (26.39 KB, text/html)
2004-12-20 02:02 PST, webmaster
no flags Details
Patch to use ICU for line-breaking for Thai (5.32 KB, patch)
2005-01-27 05:13 PST, Samphan Raruenrom
no flags Details | Diff | Splinter Review
New version of patch to use ICU for line-breaking (5.85 KB, patch)
2005-02-07 20:54 PST, Samphan Raruenrom
no flags Details | Diff | Splinter Review
Patch to use ICU for Thai line-breaking, statically link on Windows (11.86 KB, patch)
2005-06-16 05:22 PDT, Samphan Raruenrom
no flags Details | Diff | Splinter Review
Patch to reduce the size of ICU for Thai line-breaking (5.89 KB, patch)
2005-06-16 05:28 PDT, Samphan Raruenrom
no flags Details | Diff | Splinter Review
Patch to optionally use ICU for Thai line breaking (5.60 KB, patch)
2005-06-30 18:20 PDT, Samphan Raruenrom
jshin1987: review-
Details | Diff | Splinter Review
Patch to add Thai word break support to Mozilla/Firefox using libthai (8.03 KB, patch)
2005-09-05 06:08 PDT, Pattara Kiatisevi
no flags Details | Diff | Splinter Review
Updated libthai patch: use NSPR instead of dlopen() to load libthai (7.83 KB, patch)
2006-02-09 01:49 PST, Pattara Kiatisevi
no flags Details | Diff | Splinter Review
Updated libthai patch (7.72 KB, patch)
2006-04-17 02:19 PDT, Theppitak Karoonboonyanan
no flags Details | Diff | Splinter Review
Rewritten LibThai Patch (13.92 KB, patch)
2006-05-02 04:29 PDT, Theppitak Karoonboonyanan
no flags Details | Diff | Splinter Review
Rewritten libthai patch, rediffed against 1.8 branch (14.21 KB, patch)
2006-05-02 07:56 PDT, Theppitak Karoonboonyanan
no flags Details | Diff | Splinter Review
Alternative patch, with libthai dependency on --enable-libthai configure switch (15.63 KB, patch)
2006-05-08 04:43 PDT, Theppitak Karoonboonyanan
no flags Details | Diff | Splinter Review
Patch against trunk to link to libthai with --enable-libthai option (7.32 KB, patch)
2006-07-25 11:30 PDT, Theppitak Karoonboonyanan
no flags Details | Diff | Splinter Review
intl/lwbrk/src/nsLibThaiLineBreaker.h (2.40 KB, text/plain)
2006-07-25 11:35 PDT, Theppitak Karoonboonyanan
no flags Details
intl/lwbrk/src/nsLibThaiLineBreaker.cpp (4.65 KB, text/plain)
2006-07-25 11:37 PDT, Theppitak Karoonboonyanan
no flags Details
intl/lwbrk/src/nsLibThaiLineBreaker.h -- with cache (3.68 KB, text/plain)
2006-07-26 05:32 PDT, Theppitak Karoonboonyanan
no flags Details
intl/lwbrk/src/nsLibThaiLineBreaker.cpp -- with cache (5.65 KB, text/plain)
2006-07-26 05:34 PDT, Theppitak Karoonboonyanan
no flags Details
Patch against trunk to call mozlibthai component, with --enable-libthai option (8.44 KB, patch)
2006-07-27 13:54 PDT, Theppitak Karoonboonyanan
no flags Details | Diff | Splinter Review
Patch for adding the mozlibthai component (21.12 KB, patch)
2006-07-27 13:57 PDT, Theppitak Karoonboonyanan
no flags Details | Diff | Splinter Review
Patch against trunk for mozlibthai component (1/2) (8.11 KB, patch)
2006-07-28 01:38 PDT, Theppitak Karoonboonyanan
no flags Details | Diff | Splinter Review
Patch against trunk for mozlibthai component (2/2) (21.00 KB, patch)
2006-07-28 01:39 PDT, Theppitak Karoonboonyanan
no flags Details | Diff | Splinter Review
Patch to lwbrk so it calls mozlibthai component at appropriate places (8.60 KB, patch)
2006-10-07 08:23 PDT, Theppitak Karoonboonyanan
no flags Details | Diff | Splinter Review
Patch to add mozlibthai component (21.14 KB, patch)
2006-10-07 08:24 PDT, Theppitak Karoonboonyanan
no flags Details | Diff | Splinter Review

Description User image Frank Tang 1999-08-23 14:23:59 PDT
change to M15 since this is post beta
Comment 1 User image Frank Tang 2000-02-16 01:03:09 PST
mark it M20
Comment 2 User image Frank Tang 2000-06-05 23:50:21 PDT
line / work breaking related issue. Reassign to shanjian and cc erik/nhotta
Comment 3 User image Frank Tang 2000-07-28 15:05:55 PDT
mark it future
Comment 4 User image markpeak 2000-12-02 07:16:08 PST
Line breaking issue is big issue for Thai. Browser without Line breaking 
(Netscape 4) display text worse than IE5 (with line-breaking) so much.
This also impontant in Word Processing too. (should be in composer too).
What will you all do with this bug?
Comment 5 User image Shanjian Li 2001-01-24 16:06:39 PST
We need help to resolve this problem. Could you provide us help in this area? 
I can explain the current code to you if necessary.
Comment 6 User image Arthit Suriyawongkul 2001-06-10 21:07:10 PDT
for a thai dictionary for wordbreaker, i think samphan@thai.net can help you.
Comment 7 User image Arthit Suriyawongkul 2002-02-09 03:21:17 PST
What is the current status of this issue ?
Comment 8 User image samphan 2002-02-14 05:27:51 PST
Created attachment 69458 [details] [diff] [review]
patch for mozilla to use libinthai for Thai word break

Here's the patch for dictionary-based Thai line breaker for Mozilla.
The work is not finished and have been tested on Linux only.
There're some problems with it such as :-
- On Linux the dictionary will be loaded as "/usr/share/inthai/inthai.lex"
- On other platform it'll be loaded as "inthai.lex" so it must be in 
  current directory! See mozilla/intl/inthai/src/PackedTrie.cpp

Set your browser to "http://opensource.thai.net" if you want some test data in
Thai.
Comment 9 User image samphan 2002-02-14 05:36:58 PST
Created attachment 69459 [details]
The libinthai word breaking library

- cd to mozilla/.. and tar -xzvf libinthai-moz.tar.gz
  The library will be installed in mozilla/intl/inthai
- Copy the dictionary file mozilla/intl/inthai/data/inthai.lex to
  /usr/share/inthai for Linux or the executable directory for Win.
  (you must mkdir /usr/share/inthai first).
Comment 10 User image samphan 2002-02-14 06:13:29 PST
Sorry, I forgot to tell that the above patch is made on mozilla 0.9.8 source.
Comment 11 User image Roland Mainz 2002-02-28 01:49:26 PST
How does the functionality of this patch compare to the functionality of
Mozilla's CTL(=Complex Text Layout) extension ?
Comment 12 User image Arthit Suriyawongkul 2002-03-08 03:27:27 PST
as of my knowledge,

CTL provide the functionalities of:
- render Thai characters (cell-base, 4 levels of characters in 1 cell)
- cell-base caret travelling, delete, backspace
- cell-base highlight selection

There's NO wordbreak functionality in CTL (at least, for now).

Prabhat may better clarify this.
Comment 13 User image Frank Tang 2002-05-09 15:46:48 PDT
please do not attach application/octect-stream type in bugzilla attachment. 
Comment 14 User image Prabhat Hegde 2002-05-28 15:14:34 PDT
I looked at this patch. With some minor changes we could integrate this 
to co-exist with current CTL code. One issue for discussion could be if
anyone in intl team is planning to integrate ICU (which i believe also
provides good thai word-breaking support). Could someone in intl team 
comment on this?

Comment 15 User image samphan 2002-06-02 23:39:14 PDT
I'd love to hear about using ICU in mozilla. IMHO, this will solve many problems.

> ICU (which i believe also provides good thai word-breaking support)

Thai dictionary-based word-breaker in ICU is not good enough. People from
IBM/Sun Thailand who use/test ICU with Thai language can tell you that. The
problems come from :-
1) Greedy, longest-matching algorithm use in ICU can't handle ambiguous word
sequences at all (it was mentioned as over-kill). Modern algorithms in Thai use
maximal-matching algorithm which use dynamic programming technic to handle
ambiguous sequences efficiently.
2) ICU algorithm don't handle unknown words (words not in the dictionary) very
well. Unknown words happend a lot in real Thai texts so it's *very* serious.
3) It's not been tested enough with Thai so there may be cases that cause error.
One case I can think of is the use of '.' in Thai to denote abbreviations. The
'.' could be threated as part of Thai word or as punctuation (some use it to
denote end of sentences) depend on context and dictionary. The dictionary should
also contain list of abbreviations for this to work.

On the other hand, IMO, ICU's BreakIterator API is the way to go. Libinthai
actually adopted BreakIterator API from the beginning. The interface you see in
my Mozilla patch is the result of painfully adapting the stateful BreakIterator
interface to Mozilla's stateless breaker API. Internally libinthai is ICU
compatible. I think Mozilla should consider using BreakIterator API directly in
the future.
Comment 16 User image samphan 2002-06-02 23:50:32 PDT
> The work is not finished and have been tested on Linux only.
> There're some problems with it such as :-
> - On Linux the dictionary will be loaded as "/usr/share/inthai/inthai.lex"
> - On other platform it'll be loaded as "inthai.lex" so it must be in 
>   current directory! See mozilla/intl/inthai/src/PackedTrie.cpp

The dependency on the dictionary data file (inthai.lex) in libinthai has been
solved. The (default) main dictionary is compiled-in as a module as in ICU.
We're working on (re)building libinthai on Windows. Now libinthai also has
spelling checking/correction functionality too (spelling check and word breaking
for Thai are the two sides of the same coin).

Comment 17 User image Prabhat Hegde 2002-07-10 16:21:38 PDT
hi samphan - Please post the latest patch and library with dictionary dependency
removed so that i can get our thai team to verify this. 
thanks,
prabhat
Comment 18 User image samphan 2002-07-23 02:13:12 PDT
The new libinthai can be d/l at
ftp://opensource.thai.net/pub/office-tle/libinthai_0.9.tar.gz
I haven't look at the patches for a while but will do ASAP if you want to.
BTW, what's the current plan?
1) change current code to use ICU's BreakIterator API and
1.1) go directly to ICU now
1.2) use libinthai until Thai BreakIterator in ICU is ok
2) use libinthai with its mozilla specific API as a temporaly solution

After all, please note that this particular patch (from above) fix the 'bug'
that prevent word-breaking for Thai eventhough the rule-based code is already
check-in (mozilla/intl/lwbrk/src/rulebrk.c) and used to work in old versions.

diff -ur tmp/mozilla/intl/lwbrk/src/nsJISx4501LineBreaker.cpp
dev/mozilla/intl/lwbrk/src/nsJISx4501LineBreaker.cpp
--- tmp/mozilla/intl/lwbrk/src/nsJISx4501LineBreaker.cpp	Wed Jan 16 07:03:16 2002
+++ dev/mozilla/intl/lwbrk/src/nsJISx4501LineBreaker.cpp	Wed Feb 13 18:06:52 2002
@@ -234,7 +233,8 @@
   return ((0x1100 <= (u) && (u) <= 0x11ff) ||
           (0x2e80 <= (u) && (u) <= 0xd7ff) ||
           (0xf900 <= (u) && (u) <= 0xfaff) ||
-          (0xff00 <= (u) && (u) <= 0xffff) );
+          (0xff00 <= (u) && (u) <= 0xffff) ) ||
+          (0x0e00 <= (u) && (u) <= 0x0e5f);
 }
 
 static inline int
Comment 19 User image webmaster 2004-12-20 02:02:01 PST
Created attachment 169176 [details]
example for non-separation in colums with Thai characters

Simplest solution for line break will be to allow separation after every Thai
Character
A better solution will be to separate before
characters in Thai Code page 874: 0xE0 to 0xE4
or Unicode: 0x0E40 to 0x0E44
Best will be to have a complete algorithm for word separation as e.g. in MS
Internet Explorer.
Comment 20 User image Samphan Raruenrom 2005-01-27 05:09:49 PST
This patch is just for reviewing. It is for mozilla/firefox to use ICU for 
line breaking. It works on both Linux and Win32. The problem with it is that
- you need ICU installed in the target machine
- to build it is quite ad-hoc
  - on linux you have to put ICU in /usr/lib
  - on win32 you need to add icu/lib path to the LIB variable
Anyway, integrating a lite version of ICU into the tree will solve this problem.
Other than that it works quite well. The patched firefox has been released
and being used by many people.

A flaw in the patch, nsJISx4051LineBreaker::BreakInBetween() is not implemented.
It simply do this :-

if((CLASS_THAI == c1) && (CLASS_THAI == c2))
{
   *oCanBreak = 0;
}
else
{
   *oCanBreak = GetPair(c1,c2);
}

I'm not sure if this is ok but I guess that BreakInBetween() is only need for
CJK (cause the patch work for Thai) and I can't think of a way to implement 
BreakInBetween() using ICU. 
Any comment?
Comment 21 User image Samphan Raruenrom 2005-01-27 05:13:30 PST
Created attachment 172558 [details] [diff] [review]
Patch to use ICU for line-breaking for Thai
Comment 22 User image Samphan Raruenrom 2005-02-07 20:54:25 PST
Created attachment 173704 [details] [diff] [review]
New version of patch to use ICU for line-breaking

This is the new patch that fix a bug that make some space missing when it is at
the end of line in HTML source. It simply use
DictionaryBasedBreakIterator::CreateWordInstance instead of CreateLineInstance.

Many people are using the patched firefox on Linux. I also distribute a setup
program on Windows that contain patched firefox and ICU together so Windows
users can start using it.
Comment 23 User image Frank Tang 2005-03-01 23:55:43 PST
shanjian is no longer working on mozilla for 2 years and these bugs are still
here. Mark them won't fix. If you want to reopen it, find a good owner first. 
Comment 24 User image Travis Chase 2005-03-02 03:50:17 PST
Mass Re-open of Frank Tangs Won't fix debacle. Spam is his responsibility not my own
Comment 25 User image Travis Chase 2005-03-02 04:05:50 PST
Mass Re-assinging Frank Tangs old bugs that he closed won't fix and had to be
re-open. Spam is his fault not my own
Comment 26 User image Samphan Raruenrom 2005-06-16 05:22:30 PDT
Created attachment 186465 [details] [diff] [review]
Patch to use ICU for Thai line-breaking, statically link on Windows

On Windows, I built ICU with --enable-static and modified the source as shown
in the next patch. Then I statically link the ICU binaries with the patched
firefox (which is built statically with default --enable-optimize). The result
firefox.exe with Thai line-breaking feature is only 328K bigger.
This proofs that it may be possible to include ICU in Mozilla/Firefox w/o too
much size penalty. ICU size can be reduced a lot. And static linking will pull
only the code we need from ICU.
Comment 27 User image Samphan Raruenrom 2005-06-16 05:28:34 PDT
Created attachment 186467 [details] [diff] [review]
Patch to reduce the size of ICU for Thai line-breaking

This patch remove a lot of stuffs. I make sure only that Thai
DictionaryBasedBreakIterator is working.
Comment 28 User image Samphan Raruenrom 2005-06-25 05:04:38 PDT
I've just build the original firefox source with the same MOZCONFIG/setting I
used for the modified source. The result firefox.exe is a little bit smaller
than Mozilla.org official build. So I recompute the actual size that my ICU add
to firefox.exe. The modified firefox.exe is actually 528K bigger.
Comment 29 User image Jungshik Shin 2005-06-26 19:35:56 PDT
Comment on attachment 186465 [details] [diff] [review]
Patch to use ICU for Thai line-breaking, statically link on Windows

Thanks for your work, but introducing a new dependency (although only at the
compilation time) is not under my 'jurisdiction'. Perhaps, we need to figure
out a better way to make use of ICU (including a subset of them in our tree may
be an option).
Comment 30 User image Samphan Raruenrom 2005-06-26 20:06:02 PDT
1) Is there any other possible use of ICU in Mozilla now? There'll be more
chance to include ICU in the tree if there're many uses of it.

2) Is it possible to introduce a configure option, e.g. --with-icu[=icu-prefix],
to enable the use of ICU at compile time? This should not interfere with the
normal build while providing the option to use ICU in Mozilla (anywhere).
Comment 31 User image Jungshik Shin 2005-06-26 20:15:37 PDT
(In reply to comment #30)
> 1) Is there any other possible use of ICU in Mozilla now? There'll be more
> chance to include ICU in the tree if there're many uses of it.

There may be many if we decide to use ICU, but ...
 
> 2) Is it possible to introduce a configure option, e.g. --with-icu[=icu-prefix],
> to enable the use of ICU at compile time?
 
That's another possibility. Perhaps, that's a good interim solution. Can you
make a patch to implement that? 
Comment 32 User image Samphan Raruenrom 2005-06-27 01:24:40 PDT
I don't know how yet but I'll try.
I've looked at similar stuff in configure.in.
Comment 33 User image Samphan Raruenrom 2005-06-30 18:20:12 PDT
Created attachment 187900 [details] [diff] [review]
Patch to optionally use ICU for Thai line breaking

I split the patch into two parts. The patch to configure.in (add --with-icu) is
filed as another bug #299324 which this bug depends. This patch now use ICU for
Thai line-breaking only if configure with --with-icu flag (which will #define
ENABLE_ICU).
Comment 34 User image Jungshik Shin 2005-07-31 02:51:44 PDT
Comment on attachment 187900 [details] [diff] [review]
Patch to optionally use ICU for Thai line breaking

Sorry for the delay.
We have to keep the current Thai line breaker if ENABLE_ICU is NOT defined,
don't we?
Comment 35 User image Jungshik Shin 2005-07-31 02:52:38 PDT
Comment on attachment 186467 [details] [diff] [review]
Patch to reduce the size of ICU for Thai line-breaking

This is not releavant at least for now
Comment 36 User image Samphan Raruenrom 2005-08-01 03:54:55 PDT
> We have to keep the current Thai line breaker if ENABLE_ICU is NOT defined,
don't we?

No. The old rule-based Thai word-breaking code should be removed. It doesn't get
called anyway. nsJISx4501LineBreaker.cpp w/o the new patch only process CJK now
(it used to called the Thai rule-based code). That's why Thai word-breaking is
currently not working.
Comment 37 User image Pattara Kiatisevi 2005-09-05 06:08:37 PDT
Created attachment 194910 [details] [diff] [review]
Patch to add  Thai word break support to Mozilla/Firefox using libthai

You will need to install libthai (http://libthai.sf.net/) on your system for
the Thai word breaking function to work.

The code loads libthai.so dynamically so if there is no libthai in the system,
it will do nothing. The benefit is that the patch is small and doesn't hurt
other users who don't speak Thai. For Thai users, they just have to install
libthai in their system and the vanilla version Mozilla/Firefox should work
immediately, no need to re-compile in order to have Thai support.

The code uses dlopen() (tested on Linux Debian unstable +Firefox-1.0.6), to be
more cross-platform, we might switch to GLib's GModule? I have included an
experimental code in the patch but had a problem with Makefile/INCLUDE stuff so
didn't test it yet. 

For Windows, instead of libthai, we can use the word breaking function provided
by Uniscribe? I have not much idea on this so Windows guru could you help?
Comment 38 User image Pattara Kiatisevi 2005-09-05 06:30:02 PDT
The ability to break Thai words of web browsers is really needed for Thai 
users. The libthai-based patch above (modified from Samphan's ICU patch, 
tested) is a solution to add this support to Mozilla/Firefox. It loads 
libthai.so (http://libthai.sf.net/), if exists in the system, dynamically, so 
no need to do ./configure --enable-something and/or recompile Mozilla/Firefox 
for every new releases. 
 
Not sure if Mozilla will eventually go for ICU/Pango/... in the future but for 
the time being, is a solution like this acceptable?  
Comment 39 User image Pattara Kiatisevi 2005-09-14 23:28:39 PDT
Will someone in Mozilla read this thread/bug?
Comment 40 User image :Gavin Sharp [email: gavin@gavinsharp.com] 2006-01-10 08:06:06 PST
*** Bug 257935 has been marked as a duplicate of this bug. ***
Comment 41 User image Samphan Raruenrom 2006-01-11 21:05:52 PST
I have another idea. Since incorporating ICU into Mozilla code base seems to be impossible. Is it possible to write an extension for the job? 

The extension will do this :-
1) It'll implement the interface nsILineBreaker, using ICU for Thai line breaking
2) It'll register the new interface in the service NS_LWBRK_CONTRACTID, so nsDocument::GetLineBreaker() will get the new implementation from the extension instead.

I know nothing about XPCOM.
- Is this possible? Is it possible to replace the interface implementation from an extension?

What do you think? The way to go?
Comment 42 User image Pattara Kiatisevi 2006-02-09 01:49:26 PST
Created attachment 211242 [details] [diff] [review]
Updated libthai patch: use NSPR instead of dlopen() to load libthai

An updated version of libthai patch. It uses NSPR instead of dlopen() to load libthai according to the information here:
http://www.mozilla.org/projects/nspr/reference/html/prlink.html#24072
Thank you Jean-Marc Desperrier for poiting this out.

What is the proper way to seach for libthai.so from default paths specified in /etc/ld.so.conf or LD_LIBRARY_PATH? Now I just hard-code it to load from /usr/lib.

pattara
Comment 43 User image Axel Hecht [:Pike] 2006-02-11 09:31:52 PST
Without looking into all details of the bug, does
http://developer.mozilla.org/en/docs/Using_Dependent_Libraries_In_Extension_Components
help? Maybe not, but still putting the reference here.
Comment 44 User image Jungshik Shin 2006-02-15 04:51:36 PST
(In reply to comment #42)

> What is the proper way to seach for libthai.so from default paths specified in
> /etc/ld.so.conf or LD_LIBRARY_PATH? Now I just hard-code it to load from
> /usr/lib.

If you use PR_LoadLibrary, I think you don't have to do anything. NSPR does the job for you afaik. 
Comment 45 User image Pattara Kiatisevi 2006-03-26 22:43:15 PST
Comment on attachment 194910 [details] [diff] [review]
Patch to add  Thai word break support to Mozilla/Firefox using libthai

obsoleted by the libthai NSPR patch
Comment 46 User image Pattara Kiatisevi 2006-03-26 23:06:43 PST
Hi,

Months have passed. The patch is there. The patch does fix the problem. But it is left there. 

Just wondering how long will it take until Mozilla/Firefox can break Thai words correctly. Who is the right contact point for this? Where is the right place to post? If not here then what is this place for?

Does Mozilla/Firefox usually accept contribution like this from users? Is it the same for other bugs or this is the special case that nobody seems to care at all? Is it a technical problem or an administrative one? If the patch doesn't conform to some guidelines please let us know what to fix. 

Just need some responses. Please at least show me that Mozilla is an Open Source project with open collaboration from its users. Otherwise how should I tell my users when things like Thai word-breaking don't work and don't seem to get fixed in any near future?

Pattara
Comment 47 User image Theppitak Karoonboonyanan 2006-04-17 02:19:46 PDT
Created attachment 218666 [details] [diff] [review]
Updated libthai patch

Updated libthai patch as modified when submitted for inclusion in Ubuntu dapper. It now loads libthai.so.0 (with library version info specified) instead of just libthai.so, to guarantee ABI. Besides, the debug messages are now put only if DEBUG is defined.
Comment 48 User image Axel Hecht [:Pike] 2006-04-17 07:58:22 PDT
The code lacks platform ifdefs to boot.
Re Windows and ICU, going the extension way may be a good idea. A post to
mozilla.dev.platform and .i18n may be good here (.platform is probably more
important than .i18n).
Any idea about the mac? jshin, is any of this related to the indic problems we 
have?
Comment 49 User image Pattara Kiatisevi 2006-04-18 08:50:36 PDT
(In reply to comment #48)
> The code lacks platform ifdefs to boot.

you mean something like

#ifdef LINUX

[...the added part...]

#endif

?

If not could you give me an example or point to some URLs?

Would you like to help test the patch?

> Re Windows and ICU, going the extension way may be a good idea. A post to

Windows has Uniscribe, so IMHO we should just use it (i don't know how).

> mozilla.dev.platform and .i18n may be good here (.platform is probably more
> important than .i18n).
> Any idea about the mac? jshin, is any of this related to the indic problems we 
> have?

I think the final solution which should work on all platforms is to use Pango. I heard Theppitak is working on it. But meanwhile, for the sake of Thai Linux users, temporary-but-works-and-doesn't-hurt-anybody solution like libthai should not be ignored.
Comment 50 User image Theppitak Karoonboonyanan 2006-05-02 04:29:17 PDT
Created attachment 220497 [details] [diff] [review]
Rewritten LibThai Patch

I've rewritten the LibThai patch. It's against MOZILLA_1_7_BRANCH, but I think still applies to MOZILLA_1_8_BRANCH as well (but not HEAD, of course).

In this patch, I split the libthai stuffs into a separate nsLibThaiLineBreaker class and call it from nsJISx4501LineBreaker at appropriate places. (At first, I was about to add a case in nsLWBreakerFimp factory, but it seemed all the callers passed a blank parameter. So, having more than one implementation is just impossible.)

Also, a preprocessor is added so that libthai is called only on Unix platforms, although making libthai available on Windows is also possible with cygwin/mingw. But that can be added later.
Comment 51 User image Theppitak Karoonboonyanan 2006-05-02 07:56:01 PDT
Created attachment 220509 [details] [diff] [review]
Rewritten libthai patch, rediffed against 1.8 branch

OK. I managed to rediff the rewritten libthai patch against 1.8 branch, by using debian's xulrunner 1.8.0.1 pristine source.
Comment 52 User image Theppitak Karoonboonyanan 2006-05-07 01:56:49 PDT
Related info: I've opened Bug #336959 for Pango/Uniscribe approach to line breaking in Trunk.
Comment 53 User image Theppitak Karoonboonyanan 2006-05-08 04:43:17 PDT
Created attachment 221289 [details] [diff] [review]
Alternative patch, with libthai dependency on --enable-libthai configure switch

As another alternative, libthai dependency may be better for deployment in some distributions that also ship libthai.
Comment 54 User image Jungshik Shin 2006-07-18 19:34:14 PDT
I'm really sorry for dragging my feet here. Would you please make a new patch against the trunk instead of 1.8 branch? (note that nsILineBreaker is deCOMtaminated so that you need to make some changes). Without baking in the trunk for a while, it can't be landed in 1.8 branch (if it's not too late). BTW, why don't you use 'cvs diff' to make your patch? 
Comment 55 User image Theppitak Karoonboonyanan 2006-07-18 20:25:16 PDT
> BTW, why don't you use 'cvs diff' to make your patch?

Because new files were introduced, and without CVS write permission,
I couldn't 'cvs add' them before diff-ing. They would thus have been 
lost in the patch.

Regarding working with trunk or branch, it's quite a long story.
In short, it's a mix of results from Thai team's timing and from distro 
contacts which forward the bug back upstream. So, I lose the focus on
trunk. Now we seem to miss all the plans, anyway.

Sorry for the wrongly targeted patch. I shall rework it against trunk 
soon.
Comment 56 User image Jungshik Shin 2006-07-24 22:15:21 PDT
(In reply to comment #55)
> > BTW, why don't you use 'cvs diff' to make your patch?
> 
> Because new files were introduced, and without CVS write permission,
> I couldn't 'cvs add' them before diff-ing. They would thus have been 
> lost in the patch.

Ok. It doesn't matter much, but in a sense it may be more convenient for you  to upload two files (one is a patch against existing files obtained with 'cvs diff' and the other contains new files).
 
> Regarding working with trunk or branch, it's quite a long story.
....
> trunk. Now we seem to miss all the plans, anyway.

Sorry I'm partly to blame ....
Comment 57 User image Theppitak Karoonboonyanan 2006-07-25 11:30:40 PDT
Created attachment 230603 [details] [diff] [review]
Patch against trunk to link to libthai with --enable-libthai option

OK. I have reworked the patch against trunk, using the configure switch instead of PR_LoadLibrary. Using the switch makes the solution available on other platforms than XP_UNIX as well, provided that libthai is made available.

This patch is adjusted a little bit from previous ones. The existing rule-based Thai word segmentation in rulebrk.c is added back as the fallback when libthai is not linked.

Two additional files are coming.
Comment 58 User image Mike Hommey [:glandium] 2006-07-25 11:34:19 PDT
The PR_LoadLibrary thing was actually better... You don't need to rebuild mozilla to enable/disable libthai support...
Comment 59 User image Theppitak Karoonboonyanan 2006-07-25 11:35:24 PDT
Created attachment 230604 [details]
intl/lwbrk/src/nsLibThaiLineBreaker.h
Comment 60 User image Theppitak Karoonboonyanan 2006-07-25 11:37:21 PDT
Created attachment 230607 [details]
intl/lwbrk/src/nsLibThaiLineBreaker.cpp
Comment 61 User image Theppitak Karoonboonyanan 2006-07-25 12:02:59 PDT
Mike Hommey (#58) wrote:

> The PR_LoadLibrary thing was actually better... You don't need to rebuild
> mozilla to enable/disable libthai support...

One drawback is that library names and directories for various platforms will have to be listed, right?

With --enable switch, you turn it on only if libthai is available on your platform, eliminating explicit conditions for different platforms in the code.

Any other points I miss?
Comment 62 User image Axel Hecht [:Pike] 2006-07-25 12:25:07 PDT
Wouldn't the availability of libthai depend on installation instead of platform?
Comment 63 User image Theppitak Karoonboonyanan 2006-07-25 20:39:55 PDT
(In reply to comment #62)

Yes, libthai availability depends on installation. But its location and name depends on platform.

From my old patch in attachment #220509 [details] [diff] [review]:

+static const char* LibraryPaths[] = {
+  "/usr/local/lib",
+  "/usr/lib",
+  nsnull
+};
+
+nsLibThaiLineBreaker::nsLibThaiLineBreaker()
+  : mLibThai (nsnull)
+{
+  for (const char** ppLibPath = LibraryPaths; *ppLibPath; ++ppLibPath) {
+    char* pFullName = PR_GetLibraryName (*ppLibPath, "libthai.so.0");
+    if (pFullName) {
+      mLibThai = PR_LoadLibrary (pFullName);
+      PR_FreeLibraryName (pFullName);
+      if (mLibThai)
+        break;
+    }
+  }
+}

Yes, I can list all possible directories (/usr/lib, /usr/local/lib, c:/windows,
c:/Program\ Files/Firefox, etc.) and library names (libthai.so.0, libthai-0.dll, etc.) if this approach is chosen. Just feeling it's a bit awkward.

On the other hand, in distro maintainer's point of view, it might be convenient not to rebuild the whole suite to have libthai support.

This morning, I got another improvement idea: let's encapsulate all the mess and fallbacks for Thai in a class, say nsThaiLineBreaker. And the code in nsJISx4501LineBreaker would become a little bit cleaner. I will post the patch soon after testing it. But I will not switch back to PR_LoadLibrary yet at the moment, until we reach some decision.
Comment 64 User image Mike Hommey [:glandium] 2006-07-25 22:21:31 PDT
Let me reiterate what i told you in the bugreport you did on debian: put it in a component. Then, the component can be linked against libthai without problem. It won't be loaded if libthai is not present at runtime.
Comment 65 User image Theppitak Karoonboonyanan 2006-07-25 22:55:43 PDT
I know. And my reply was that the approach was currently not supported in the code base. Currently, JISx4501LineBreaker is practically the sole LineBreaker ever created to handle Unicode string in general. There is no string analysis in Pango's manner which allows multiple engines to be created for individual scripts yet.
Comment 66 User image Theppitak Karoonboonyanan 2006-07-26 05:30:02 PDT
Ahem, I decide to cancel the idea about the nsThaiLineBreaker encapsulation. Leaving the fallback in nsJISx4501LineBreaker.cpp should be more beneficial for componentization in the future. The new component just does its job. And the fallback is there on its absence. Isn't that good?

Anyway, I've added some caching to nsLibThaiLineBreaker to improve its performance. I'm posting the new version here.
Comment 67 User image Theppitak Karoonboonyanan 2006-07-26 05:32:47 PDT
Created attachment 230731 [details]
intl/lwbrk/src/nsLibThaiLineBreaker.h -- with cache
Comment 68 User image Theppitak Karoonboonyanan 2006-07-26 05:34:16 PDT
Created attachment 230732 [details]
intl/lwbrk/src/nsLibThaiLineBreaker.cpp -- with cache
Comment 69 User image Jungshik Shin 2006-07-26 08:13:49 PDT
Re: the path of libthai
How about adding a pref. entry (to be set in about:config) to point to the list of candidate paths (with the reasonable *platform-dependent* values set by default)? On Windows, we need to do a little more than that ... This is not a very clean solution, but there's a precedent of a similar approach (for libfreetype), I think.
Comment 70 User image Theppitak Karoonboonyanan 2006-07-26 08:54:50 PDT
I agree with using about:config. Could you suggest some sample code?
Comment 71 User image Arthit Suriyawongkul 2006-07-26 09:41:32 PDT
Can we put "[line-breaking]" in "Status Whiteboard" filed ?

(just as Bug #157967 "Make Gecko interoperate better with advanced typography systems such as ATSUI, Uniscribe, Pango & STSF")
Comment 72 User image Arthit Suriyawongkul 2006-07-26 09:52:04 PDT
Meta bug:

Bug #206152 [meta] line breaking bugs
Comment 73 User image Mike Hommey [:glandium] 2006-07-26 10:13:03 PDT
(In reply to comment #69)
> This is not a very clean solution, but there's a precedent of a similar approach (for libfreetype), I think.

This is indeed very unclean, and FWIW, the libfreetype thing is useless nowadays...

Comment 74 User image Jungshik Shin 2006-07-26 17:50:46 PDT
(In reply to comment #73)
> the libfreetype thing is useless nowadays...

I know that at least as well as anybody else here.

Adding a pref. entry can be considered if there's no better way than hard-coding the library path or assuming the build-time availability to be the same as the run-time availability. (the latter will work if firefox is built for , say, a Thai linux distribution)

In reply to comment #70)
> I agree with using about:config. Could you suggest some sample code?

See
http://lxr.mozilla.org/seamonkey/source/gfx/src/freetype/nsFreeType.cpp#498

for a sample.  
Comment 75 User image Mike Hommey [:glandium] 2006-07-26 22:37:39 PDT
(In reply to comment #74)
> Adding a pref. entry can be considered if there's no better way than
> hard-coding the library path or assuming the build-time availability to be the
> same as the run-time availability. (the latter will work if firefox is built
> for , say, a Thai linux distribution)

There's also the possibility to do like with the mozgnome component: if the gnome libs are present at build time, build the component, and if they are not at run time, the component won't be loaded (the latter doesn't require anything, the component loading system just can't load a component when its dependencies are not here ; the former is not hard to do).
Comment 76 User image Theppitak Karoonboonyanan 2006-07-27 01:06:58 PDT
(In reply to comment #75)

All right, that would be perfect, as both gecko and libthai will then rely on component system to get connected, rather than hard-coding or low-level configuration. Can we make that interface Thai-specific for the moment, to be called from nsJISx4501LineBreaker for Thai chunks?

Being Thai-specific means only Thai Unicode range is supported for the moment, until a proper Pango-like text analysis is implemented.

I know nothing about XPCOM, either, anyway.
Comment 77 User image Theppitak Karoonboonyanan 2006-07-27 13:54:43 PDT
Created attachment 230985 [details] [diff] [review]
Patch against trunk to call mozlibthai component, with --enable-libthai option

I managed to work out the mozlibthai component.
Comment 78 User image Theppitak Karoonboonyanan 2006-07-27 13:57:45 PDT
Created attachment 230986 [details] [diff] [review]
Patch for adding the mozlibthai component
Comment 79 User image Theppitak Karoonboonyanan 2006-07-28 01:34:14 PDT
Hmm... It seems splitting the word breaker as a component has solved many issues. Now I have adjusted the patch so that:

- nsJISx4501LineBreaker _always_ tries to load a component of generic nsIThaiLineBreaker interface, with rule-based fallback on its absence. (That is, it's not guarded by libthai build condition any more.)

- The mozlibthai component installs itself as an implementation of nsIThaiLineBreaker interface, without direct link to libi18n.so. It will be built only if libthai is available at configure time, and --disable-libthai option is not set.

This way, distributions can build firefox with libthai support and split the mozlibthai component as a companion package, without enforcing dependency on non-Thai users. Meanwhile, a separate language pack is also possible on Windows in the same manner. (Thanks to Mike Hommey for mentioning mozgnome example.)

We can generalize the interface for more South East Asian languages later. It's still Thai-specific for the moment.
Comment 80 User image Theppitak Karoonboonyanan 2006-07-28 01:38:08 PDT
Created attachment 231067 [details] [diff] [review]
Patch against trunk for mozlibthai component (1/2)

May I request for a review?
Comment 81 User image Theppitak Karoonboonyanan 2006-07-28 01:39:45 PDT
Created attachment 231068 [details] [diff] [review]
Patch against trunk for mozlibthai component (2/2)
Comment 82 User image Theppitak Karoonboonyanan 2006-10-07 08:18:22 PDT
I have some updates for the last two patches. They happened to work by accident on my system, due to some trace left between the builds. Now I have moved the IDL into the component's directory, so that the .xpt file gets built and installed. Besides, the code is also adjusted a little bit.
Comment 83 User image Theppitak Karoonboonyanan 2006-10-07 08:23:14 PDT
Created attachment 241552 [details] [diff] [review]
Patch to lwbrk so it calls mozlibthai component at appropriate places
Comment 84 User image Theppitak Karoonboonyanan 2006-10-07 08:24:41 PDT
Created attachment 241553 [details] [diff] [review]
Patch to add mozlibthai component
Comment 85 User image Theppitak Karoonboonyanan 2006-10-07 08:30:47 PDT
Note: To test the patches, you need libthai, which is available at:
  ftp://linux.thai.net/pub/thailinux/software/libthai
or libthai-dev package in Debian/Ubuntu.
Comment 86 User image chris hofmann 2006-11-10 03:27:13 PST
jshin, can you review the latest patches?  

simon, I wonder if you could help out and review to if jshin doesn't have cycles to review.

thanks
Comment 87 User image Theppitak Karoonboonyanan 2007-03-29 05:46:10 PDT
While caching up recent changes in trunk to update the patch, and weighing on alternatives, I think another way might be simpler: just replace nsJIS4051LineBreaker with full Pango break support.

I have updated the patch in Bug 336959 (Attachment 260003 [details] [diff]). And the performance seems now acceptable. So, please also consider such option.
Comment 88 User image Jean-Marc Desperrier 2007-07-20 06:52:59 PDT
According to my tests, Thai line breaking looks fixed (or at least very strongly enhanced) since the checkin of bug 255990. The recent checkin of bug 336959 did not change anything to the behavior.

If the current behavior is correct, maybe this bug can be closed ?

NB : For info, I created the following test page to help test the line breaking behavior for thai : http://jmdesp.free.fr/i18n/linebreak/LBThai.html
Is there a test in the testsuite already to check for that ?
Comment 89 User image Axel Hecht [:Pike] 2007-07-20 07:36:03 PDT
Bug 336959 is for linux only, we still need help for windows and mac.
Comment 90 User image Theppitak Karoonboonyanan 2007-07-20 08:03:50 PDT
AFAIK, bug 255990 was about applying line breaks to ASCII chars. Probably, the formerly-bypassed rule-based Thai line breaker got called as a side-effect of the patch. But that's not what we need here, after all. We need more precise approch like dictionary-based breaker.

Bug 336959 makes Thai chunks be analyzed by Pango breaker, which in turn handles Thai line breaking with dictionary-based engine--what we try to achieve in this bug.

But it's only for Unix platforms. We still need more works on Windows and Mac.
Comment 91 User image Jean-Marc Desperrier 2007-07-20 10:12:47 PDT
*** Bug 162940 has been marked as a duplicate of this bug. ***
Comment 92 User image Alexander Sack 2007-07-23 05:39:25 PDT
(In reply to comment #90)
> Bug 336959 makes Thai chunks be analyzed by Pango breaker, which in turn
> handles Thai line breaking with dictionary-based engine--what we try to achieve
> in this bug.
> 
> But it's only for Unix platforms. We still need more works on Windows and Mac.
> 

Thep, though I see that using pango would be elegant for linux, why don't we try to land something like what we currently have in ubuntu?

Comment 93 User image Axel Hecht [:Pike] 2007-07-23 06:00:33 PDT
Alexander, it would clearly help the general audience if you'd made a more precise statement on what ubuntu currently has.

Anyway, as probably said elsewhere, too, but at least on http://weblogs.mozillazine.org/roc/archives/2007/07/status_3.html, we went for using pango on linux, "and modern Pangos can hook up to libthai to make this happen." Thus I'm not sure why we would need to do additional stuff for linux/unix.
Comment 94 User image Theppitak Karoonboonyanan 2007-07-23 07:29:46 PDT
Alexander meaned my another patch currently shipped with Ubuntu. It calls libthai without Pango in between, in an XPCOM component. In other words, the last patches proposed in this bug, as inspired by Mike's suggestion.

Why not? Well, many reasons:

- You can see it has got no response so far, while the Pango patch gets reviewed and committed. But that's probably a non-scientific reason. :)

- It's roc's suggestion, which I agree at some degree. It's just simpler, while allowing supports for other languages like Lao and Khmer immediately when their Pango works are ready. (But availibility on other platforms is an affordable trade-off.)

- My patch here may still require more adjustments to be checked-in. The component communicates with nsJISx4501LineBreaker via nsIThaiLineBreaker interface, which, as suggested by its name, is still Thai-specific. A Pango-like layer will still be needed to separately analyze Lao and Khmer in the future. Actually, while I like to actually have such layer in Mozilla, the change may require architectural changes, which seem to be too huge for me to work alone without any comment to the initially proposed patch.

- Packaging is another issue. While libthai is available on Debian and Ubuntu, and probably other Linux distros as well, it's not available on Windows and Mac yet. The introduced external dependency may still require further re-arrangement, both for libthai itself and Mozilla's build system.

So, as Thai line breaking is now available in most platforms, simply exploiting them appears to be a shorter path.
Comment 95 User image Jean-Marc Desperrier 2007-07-23 08:54:04 PDT
So we need two additionnal bugs blocking this one :
- Use Uniscribe breaker to analyze Thai chunks under Windows
- Use ATSUI breaker to analyze Thai chunks under Mac OS
And if those two are done this bug will be fully solved.

It will be easier to get that in that it would have been to add libthai as an additional dependency under linux.
Comment 96 User image Jean-Marc Desperrier 2007-07-23 08:56:56 PDT
Oups,  I meant "than to add libthai as an additional dependency under windows and Mac OS"
Comment 97 User image Theppitak Karoonboonyanan 2007-07-25 01:58:23 PDT
Bug 389520 has been filed for Mac.
Comment 98 User image Keng Susumpow 2007-07-29 10:11:08 PDT
Bug #390048 has been filed for Windows.
Comment 99 User image Isriya Paireepairit 2007-08-15 16:47:25 PDT
Since Bug #336959 (Linux), Bug #389520 (Mac) and Bug #390048 (Windows) are all fixed and checked-in, the current dependent Bug #299324 (ICU) approach is obsolete. Could we change the dependent list so that this bug will get fixed?
Comment 100 User image Jean-Marc Desperrier 2007-08-21 08:50:04 PDT
In fact, it seems to me this means this bug is now fixed ? 
Can we close as fixed and thus get rid of another 4 digit bug :-) ?
Comment 101 User image Theppitak Karoonboonyanan 2007-08-24 04:21:15 PDT
(In reply to comment #100)
> In fact, it seems to me this means this bug is now fixed ? 
> Can we close as fixed and thus get rid of another 4 digit bug :-) ?

Agreed.
Comment 102 User image Simon Montagu :smontagu 2007-09-09 20:17:39 PDT
If I understand the last few comments correctly, the correct resolution here is WONTFIX, since Thai line breaking has been fixed by other means.
Comment 103 User image Wayne Mery (:wsmwk, NI for questions) 2008-10-01 14:34:53 PDT
Comment on attachment 241552 [details] [diff] [review]
Patch to lwbrk so it calls mozlibthai component at appropriate places

clear outstanding reviews

Note You need to log in before you can comment on or make changes to this bug.