Closed Bug 855184 Opened 13 years ago Closed 1 year ago

getText* char boundary doesn't respect clusters

Tracking

()

Status:

RESOLVED FIXED

Milestone:

128 Branch

Tracking Flags:

Tracking

Status

firefox128

---

fixed

People

(Reporter: surkov, Assigned: Jamie)

References

(Blocks 3 open bugs)

Details

(Keywords: access)

Attachments

(2 files)

Bug 855184 part 1: Add BOUNDARY_CLUSTER so a11y can query grapheme clusters, AKA user-perceived characters. 1 year ago James Teh [:Jamie] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 855184 part 2: Map IA2_TEXT_BOUNDARY_CHAR to cluster. 1 year ago James Teh [:Jamie] 48 bytes, text/x-phabricator-request		Details \| Review

alexander :surkov (:asurkov)

Reporter

Description

•

13 years ago

previously we did since we relied on nsFrame::PeekOffset(cluster), now we return mString[aIndex]. Robert, 1) can you give me please a couple examples of clusters (when character is not a cluster) 2) do I have alternative to gfxTextRun::IsClusterStart() to detect a cluster boundaries? (I have a string where I need to detect clusters in)

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 1

•

13 years ago

> 1) can you give me please a couple examples of clusters (when character is not a cluster) e.g. 'A' U+0300 > 2) do I have alternative to gfxTextRun::IsClusterStart() to detect a cluster boundaries? nsUnicodeProperties.h ClusterIterator. However, be aware that a cluster could start in one text node and continue into the following text node, so you might not find all the cluster boundaries correctly just by looking within a single text node.

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

James Teh [:Jamie]

Assignee

Updated

•

1 year ago

Duplicate of this bug: 1900171

James Teh [:Jamie]

Assignee

Comment 3

•

1 year ago

This was also discussed in bug 1346535 comment 13.

From the IA2 API docs, it's pretty clear that IA2_TEXT_BOUNDARY_CHAR should map to a cluster (in Gecko terminology):

Typically, a single character is returned. In some cases more than one character is returned, for example, when a document contains field data such as a field containing a date, time, or footnote reference. In this case the caret can move over several characters in one movement of the caret.

Unfortunately, ATK is different. See bug 1346535 comment 18:

There is an ATSPI_TEXT_BOUNDARY_CHAR, but it sticks to the unicode notion of code points:
« @ATSPI_TEXT_BOUNDARY_CHAR: An #AtspiText instance is bounded by this character only. Start and end offsets differ by one, by definition, for this value. »

So we either need to special case character calculation in ATK or introduce another boundary type.

Eitan, I assume Mac and Android ideally want clusters too?

Flags: needinfo?(eitan)

James Teh [:Jamie]

Assignee

Updated

•

1 year ago

Comment 4

•

1 year ago

I think the iterator class we want is now GraphemeClusterBreakIteratorUtf16 in intl/lwbrk/Segmenter.h. I'm not quite sure how we're going to handle moving backwards. GraphemeClusterBreakIteratorUtf16 only moves forwards. There is GraphemeClusterBreakReverseIteratorUtf16, but the code comments explicitly say:

Note: The reverse iterator doesn't handle conjoining Jamo and emoji.

That's definitely not going to work for us.

Eitan Isaacson [:eeejay]

Comment 5

•

1 year ago

(In reply to James Teh [:Jamie] from comment #3)

This was also discussed in bug 1346535 comment 13.

From the IA2 API docs, it's pretty clear that IA2_TEXT_BOUNDARY_CHAR should map to a cluster (in Gecko terminology):

Typically, a single character is returned. In some cases more than one character is returned, for example, when a document contains field data such as a field containing a date, time, or footnote reference. In this case the caret can move over several characters in one movement of the caret.

Unfortunately, ATK is different. See bug 1346535 comment 18:

There is an ATSPI_TEXT_BOUNDARY_CHAR, but it sticks to the unicode notion of code points:
« @ATSPI_TEXT_BOUNDARY_CHAR: An #AtspiText instance is bounded by this character only. Start and end offsets differ by one, by definition, for this value. »

So we either need to special case character calculation in ATK or introduce another boundary type.

Eitan, I assume Mac and Android ideally want clusters too?

Yeah, seems like ATK is the outlier here. I think both Android and Mac would work better with clusters. In Mac there is the expectation that one caret left/right move would correspond to AXNextTextMarkerForTextMarker/AXPreviousTextMarkerForTextMarker.

Flags: needinfo?(eitan)

James Teh [:Jamie]

Assignee

Comment 6

•

1 year ago

•

Edited

Clusters for testing:
À
🤦‍♂️
🤦🏼‍♂️
x͇͕̦̍͂͒
x̧̡̬̘͓̖̲̻̻̲̠̪̻͓͙̜̂̓̊̔̀̀͗̑̀̅̀̂̚͘̕̚͘͢͜͠

The last two were generated with https://glitchtextgenerator.com/

Note that there is no hard upper limit on the length of a cluster.

James Teh [:Jamie]

Assignee

Updated

•

1 year ago

Assignee: nobody → jteh

James Teh [:Jamie]

Assignee

Comment 7

•

1 year ago

Attached file Bug 855184 part 1: Add BOUNDARY_CLUSTER so a11y can query grapheme clusters, AKA user-perceived characters. — Details

Most OS APIs want a cluster when they ask for a "character", except ATK.
Rather than altering BOUNDARY_CHAR, I added a new BOUNDARY_CLUSTER.
Aside from being less risky and causing less churn, there are cases internally where we want to move a TextLeafPoint by character; e.g. to explicitly move to the next/previous Accessible or to move to the next/previous character in an abstract way without worrying about Accessible boundaries.
Calculating clusters is more expensive, so it doesn't make sense to move by cluster in those cases.

James Teh [:Jamie]

Assignee

Comment 8

•

1 year ago

Attached file Bug 855184 part 2: Map IA2_TEXT_BOUNDARY_CHAR to cluster. — Details

James Teh [:Jamie]

Assignee

Comment 9

•

1 year ago

:m_kato, it looks like the icu4x implementation of GraphemeClusterBreakIteratorUtf16::Seek always walks forward calling Next() from the start of the string, regardless of the offset we're seeking to. So, if we have a string of 100000 characters and we're seeking to offset 99000, we'll have to walk through the first 99000 characters even though we don't need to. It seems like the non-icu4x implementation is more performant. However, i gather this is a limitation of icu4x. This is rather unfortunate for a11y because we need to be able to determine cluster boundaries from arbitrary offsets in a text node, and some text nodes can be very large; e.g. large textareas. I'd prefer to avoid caching cluster boundaries because of complexity and memory usage.

Am I correct about this or is there an alternative I'm missing?

Thanks.

Flags: needinfo?(m_kato)

Makoto Kato [:m_kato]

Comment 10

•

1 year ago

Actually, ICU4X doesn't support random access. I am considering this feature on 2.0.

Flags: needinfo?(m_kato)

Pulsebot

Comment 11

•

1 year ago

Pushed by jteh@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d9690363d7ae part 1: Add BOUNDARY_CLUSTER so a11y can query grapheme clusters, AKA user-perceived characters. r=eeejay https://hg.mozilla.org/integration/autoland/rev/4d2e6f2fac60 part 2: Map IA2_TEXT_BOUNDARY_CHAR to cluster. r=eeejay

Serban Stanca [:SerbanS]

Comment 12

•

1 year ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/d9690363d7ae
https://hg.mozilla.org/mozilla-central/rev/4d2e6f2fac60

Status: NEW → RESOLVED

Closed: 1 year ago

status-firefox128: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 128 Branch

James Teh [:Jamie]

Assignee

Comment 13

•

1 year ago

For future reference, the GitHub issue for ICU4x break iterator random access is: https://github.com/unicode-org/icu4x/issues/3247

If we do hit bad performance problems before this is implemented, one suggestion in this discussion is to pass a substring instead of the full string. We'd have to pass multiple different substrings and be a bit creative to get what we want, but it's probably possible.

James Teh [:Jamie]

Assignee

Updated

•

1 year ago

Blocks: 1901171

James Teh [:Jamie]

Assignee

Updated

•

1 year ago

Blocks: 1901172

James Teh [:Jamie]

Assignee

Updated

•

1 year ago

Regressions: 1905065

You need to log in before you can comment on or make changes to this bug.