Closed Bug 229896 (grapheme-breaker) Opened 21 years ago Closed 16 years ago

we need a generic grapheme cluster breaker/iterator

Categories

(Core :: Layout: Text and Fonts, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: jshin1987, Assigned: jshin1987)

References

(Blocks 3 open bugs)

Details

(Keywords: intl)

At the moment, the cursor movement, selection, and layout are done under the
assumption that each Unicode charcter can stand by itself. However, that doesn't
hold any more when combining character sequences come into the scene. More than
one Unicode characters have to be treated as a 'unit' in layout (e.g.
justification), cursor movement and selection. Currently, SunCTL code implements
this for Devanagari and Thai scripts, but that relies on a particular 'font
encoding'. We need to generalize it to cover the all Unicode characters based on
UAX #29 (Text Boundaries) [1] and to make it 'font-neutral'. 


Making things more complicated is that we need to keep them together _across_
mark-up tags. For instance, sequences like '<em>C1</em>C2' should not be broken
into two separate 'graphems' in selection/cursor movement if 'C1C2' forms a
single grapheme. It gets even more complicated if   '<em>C1</em>C2' has to be
rendered with 'g1g2g3' (where gi's are glyphs) [2] Currently, Mozilla on Windows
relying on ExTextOut() fails to render 'C1C2' sequence correctly because C1 and
C2 are handed over to ExTextOutW() separately in the presence of a markup (and
ExTextOutW() renders each of them separately not being able to see the context).
On the other hand, MS IE (using Uniscribe) applies the markup to the sequence as
a whole. It may not be what's intended by the author, but still it's better than
what Mozilla does. To deal with this and similar cases, we may need 'attributed
string' (as found in Pango). That should be a separate (related) bug. 


[1] http://www.unicode.org/reports/tr29/
[2] http://www.unicode.org/mail-arch/unicode-ml/y2003-m12/0370.html
(a very long thread on the Unicode mailing list. Use 'unicode-ml' and 'unicode'
when prompted for username and password)
Blocks: 40882
reassigning to me
Assignee: prabhat.hegde → jshin
Blocks: 75011
Blocks: 266899
Blocks: 157546
Blocks: thai
Blocks: 167983
Blocks: 100173
Should use ICU for this? 
Pros: cross-platform; can be used for i18n (e.g.Thai) word/line breaking
Cons: it is big, but it can be made smaller :
http://icu.sourceforge.net/charts/icu4c_footprint.html
 
related bug,
Bug 283271 : CTL cluster-based operations unsupported
Alias: grapheme-breaker
Blocks: 324609
*** Bug 332741 has been marked as a duplicate of this bug. ***
Copied from bug 332741

I think we need a way to iterate, both forwards and backwards, over the default
grapheme clusters in a string of text.  (Such a mechanism could live on top of
a mechanism for iterating over the characters in a piece of (UTF-8 or UTF-16)
text.)

For the definition of default grapheme clusters, see:
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
http://www.unicode.org/reports/tr29/#Default_Grapheme_Cluster_Table
and for why this is needed, see the dependent bugs, bug 329069 and bug 332739.

We may want to do some tricks involving glue libraries for this:  we could have
the data live in the i18n library, but have a tiny glue library (probably part
of unicharutil?) have code that loads the pointer to the data into each library
that needs it in the module ctor so we don't need virtual function calls or
cross-library calls to load the pointer.
(In reply to comment #2)
> Should use ICU for this? 
> Pros: cross-platform; can be used for i18n (e.g.Thai) word/line breaking
> Cons: it is big, but it can be made smaller :
> http://icu.sourceforge.net/charts/icu4c_footprint.html

libicu is now widely available in many Linux distros.

should we reconsider Samphan's proposal again ?

We're doing breaking + clusters with pango on Linux at the moment and the respective native APIs on Windows and the Mac as well.  Does it not meet your needs?
I think that we should not use other components for breaking. In principal, we are processing to find the breaking points ourselves. (it is not so, for complex languages e.g., Thai) For bug 249159, we need to keep the rule.
I put bug 249159 [implement 'word-break' (word-break-cjk, word-break-inside) properties of CSS3] as "blocks" for this bug.

Please remove it if you see it irrelevant.

(In reply to comment #7)
> We're doing breaking + clusters with pango on Linux at the moment and the
> respective native APIs on Windows and the Mac as well.  Does it not meet your
> needs?

Thanks Blizzard, we see great improvements for line-breaking in Firefox 3.
Anyway, this bug is also about the cursor movement and selection (highlighting).

for example, see James Clark's comment in Bug 157546 [IM: <delete> key should delete WHOLE Thai "display cell"]
https://bugzilla.mozilla.org/show_bug.cgi?id=157546#c8
Blocks: 249159
I think we should close this bug. It's too vague. Textruns give us an API to identify cluster boundaries. It's now a matter of filing and fixing specific bugs to make sure that information is used correctly.
(In reply to comment #10)
> I think we should close this bug. It's too vague. Textruns give us an API to
> identify cluster boundaries. It's now a matter of filing and fixing specific
> bugs to make sure that information is used correctly.

Right. So request for comments.

Should this bug close, or
change to "tracker bug" or
leave as it is ?
Let's close it. I'll mark it "fixed" since gfxTextRun basically does the job.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Component: Layout: CTL → Layout: Text
QA Contact: arthit → layout.fonts-and-text
You need to log in before you can comment on or make changes to this bug.