229896 - (grapheme-breaker) we need a generic grapheme cluster breaker/iterator

Assignee

Description

•

21 years ago

At the moment, the cursor movement, selection, and layout are done under the
assumption that each Unicode charcter can stand by itself. However, that doesn't
hold any more when combining character sequences come into the scene. More than
one Unicode characters have to be treated as a 'unit' in layout (e.g.
justification), cursor movement and selection. Currently, SunCTL code implements
this for Devanagari and Thai scripts, but that relies on a particular 'font
encoding'. We need to generalize it to cover the all Unicode characters based on
UAX #29 (Text Boundaries) [1] and to make it 'font-neutral'. 


Making things more complicated is that we need to keep them together _across_
mark-up tags. For instance, sequences like '<em>C1</em>C2' should not be broken
into two separate 'graphems' in selection/cursor movement if 'C1C2' forms a
single grapheme. It gets even more complicated if   '<em>C1</em>C2' has to be
rendered with 'g1g2g3' (where gi's are glyphs) [2] Currently, Mozilla on Windows
relying on ExTextOut() fails to render 'C1C2' sequence correctly because C1 and
C2 are handed over to ExTextOutW() separately in the presence of a markup (and
ExTextOutW() renders each of them separately not being able to see the context).
On the other hand, MS IE (using Uniscribe) applies the markup to the sequence as
a whole. It may not be what's intended by the author, but still it's better than
what Mozilla does. To deal with this and similar cases, we may need 'attributed
string' (as found in Pango). That should be a separate (related) bug. 


[1] http://www.unicode.org/reports/tr29/
[2] http://www.unicode.org/mail-arch/unicode-ml/y2003-m12/0370.html
(a very long thread on the Unicode mailing list. Use 'unicode-ml' and 'unicode'
when prompted for username and password)

Jungshik Shin

Assignee

Updated

•

21 years ago

Blocks: 40882

Jungshik Shin

Assignee

Comment 1

•

21 years ago

reassigning to me

Assignee: prabhat.hegde → jshin

Simon Montagu :smontagu

Updated

•

20 years ago

Blocks: 75011

Simon Montagu :smontagu

Updated

•

20 years ago

Blocks: 266899

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Updated

•

20 years ago

Blocks: 276079

Arthit Suriyawongkul

Updated

•

19 years ago

Blocks: 157546

Arthit Suriyawongkul

Updated

•

19 years ago

Blocks: thai

Arthit Suriyawongkul

Updated

•

19 years ago

Blocks: 167983

Arthit Suriyawongkul

Updated

•

19 years ago

Blocks: 100173

Samphan Raruenrom

Comment 2

•

19 years ago

Should use ICU for this? 
Pros: cross-platform; can be used for i18n (e.g.Thai) word/line breaking
Cons: it is big, but it can be made smaller :
http://icu.sourceforge.net/charts/icu4c_footprint.html

Arthit Suriyawongkul

Comment 3

•

19 years ago

related bug,
Bug 283271 : CTL cluster-based operations unsupported

Alias: grapheme-breaker

Jungshik Shin

Assignee

Updated

•

19 years ago

Blocks: 324609

Jungshik Shin

Assignee

Comment 4

•

18 years ago

*** Bug 332741 has been marked as a duplicate of this bug. ***

Jungshik Shin

Assignee

Comment 5

•

18 years ago

Copied from bug 332741

I think we need a way to iterate, both forwards and backwards, over the default
grapheme clusters in a string of text.  (Such a mechanism could live on top of
a mechanism for iterating over the characters in a piece of (UTF-8 or UTF-16)
text.)

For the definition of default grapheme clusters, see:
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
http://www.unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table
and for why this is needed, see the dependent bugs, bug 329069 and bug 332739.

We may want to do some tricks involving glue libraries for this:  we could have
the data live in the i18n library, but have a tiny glue library (probably part
of unicharutil?) have code that loads the pointer to the data into each library
that needs it in the module ctor so we don't need virtual function calls or
cross-library calls to load the pointer.

Arthit Suriyawongkul

Comment 6

•

16 years ago

(In reply to comment #2)
> Should use ICU for this? 
> Pros: cross-platform; can be used for i18n (e.g.Thai) word/line breaking
> Cons: it is big, but it can be made smaller :
> http://icu.sourceforge.net/charts/icu4c_footprint.html

libicu is now widely available in many Linux distros.

should we reconsider Samphan's proposal again ?

Christopher Blizzard (:blizzard)

Comment 7

•

16 years ago

We're doing breaking + clusters with pango on Linux at the moment and the respective native APIs on Windows and the Mac as well.  Does it not meet your needs?

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 8

•

16 years ago

I think that we should not use other components for breaking. In principal, we are processing to find the breaking points ourselves. (it is not so, for complex languages e.g., Thai) For bug 249159, we need to keep the rule.

Arthit Suriyawongkul

Comment 9

•

16 years ago

I put bug 249159 [implement 'word-break' (word-break-cjk, word-break-inside) properties of CSS3] as "blocks" for this bug.

Please remove it if you see it irrelevant.

(In reply to comment #7)
> We're doing breaking + clusters with pango on Linux at the moment and the
> respective native APIs on Windows and the Mac as well.  Does it not meet your
> needs?

Thanks Blizzard, we see great improvements for line-breaking in Firefox 3.
Anyway, this bug is also about the cursor movement and selection (highlighting).

for example, see James Clark's comment in Bug 157546 [IM: <delete> key should delete WHOLE Thai "display cell"]
https://bugzilla.mozilla.org/show_bug.cgi?id=157546#c8

Blocks: 249159

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 10

•

16 years ago

I think we should close this bug. It's too vague. Textruns give us an API to identify cluster boundaries. It's now a matter of filing and fixing specific bugs to make sure that information is used correctly.

Arthit Suriyawongkul

Comment 11

•

16 years ago

(In reply to comment #10)
> I think we should close this bug. It's too vague. Textruns give us an API to
> identify cluster boundaries. It's now a matter of filing and fixing specific
> bugs to make sure that information is used correctly.

Right. So request for comments.

Should this bug close, or
change to "tracker bug" or
leave as it is ?

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 12

•

16 years ago

Let's close it. I'll mark it "fixed" since gfxTextRun basically does the job.

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

timeless

Updated

•

16 years ago

Component: Layout: CTL → Layout: Text

QA Contact: arthit → layout.fonts-and-text

Bugzilla

Quick Search

we need a generic grapheme cluster breaker/iterator

Categories

(Core :: Layout: Text and Fonts, defect)

Tracking

()

People

(Reporter: jshin1987, Assigned: jshin1987)

References

(Blocks 3 open bugs)

Details

(Keywords: intl)

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Updated

Updated

Updated

Updated

Updated

Updated

Updated

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated