Closed Bug 855184 Opened 13 years ago Closed 1 year ago

getText* char boundary doesn't respect clusters

Categories

(Core :: Disability Access APIs, defect)

defect

Tracking

()

RESOLVED FIXED
128 Branch
Tracking Status
firefox128 --- fixed

People

(Reporter: surkov, Assigned: Jamie)

References

(Blocks 3 open bugs)

Details

(Keywords: access)

Attachments

(2 files)

previously we did since we relied on nsFrame::PeekOffset(cluster), now we return mString[aIndex]. Robert, 1) can you give me please a couple examples of clusters (when character is not a cluster) 2) do I have alternative to gfxTextRun::IsClusterStart() to detect a cluster boundaries? (I have a string where I need to detect clusters in)
> 1) can you give me please a couple examples of clusters (when character is not a cluster) e.g. 'A' U+0300 > 2) do I have alternative to gfxTextRun::IsClusterStart() to detect a cluster boundaries? nsUnicodeProperties.h ClusterIterator. However, be aware that a cluster could start in one text node and continue into the following text node, so you might not find all the cluster boundaries correctly just by looking within a single text node.
Severity: normal → S3
Duplicate of this bug: 1900171

This was also discussed in bug 1346535 comment 13.

From the IA2 API docs, it's pretty clear that IA2_TEXT_BOUNDARY_CHAR should map to a cluster (in Gecko terminology):

Typically, a single character is returned. In some cases more than one character is returned, for example, when a document contains field data such as a field containing a date, time, or footnote reference. In this case the caret can move over several characters in one movement of the caret.

Unfortunately, ATK is different. See bug 1346535 comment 18:

There is an ATSPI_TEXT_BOUNDARY_CHAR, but it sticks to the unicode notion of code points:
« @ATSPI_TEXT_BOUNDARY_CHAR: An #AtspiText instance is bounded by this character only. Start and end offsets differ by one, by definition, for this value. »

So we either need to special case character calculation in ATK or introduce another boundary type.

Eitan, I assume Mac and Android ideally want clusters too?

Flags: needinfo?(eitan)
See Also: → 1346535

I think the iterator class we want is now GraphemeClusterBreakIteratorUtf16 in intl/lwbrk/Segmenter.h. I'm not quite sure how we're going to handle moving backwards. GraphemeClusterBreakIteratorUtf16 only moves forwards. There is GraphemeClusterBreakReverseIteratorUtf16, but the code comments explicitly say:

Note: The reverse iterator doesn't handle conjoining Jamo and emoji.

That's definitely not going to work for us.

(In reply to James Teh [:Jamie] from comment #3)

This was also discussed in bug 1346535 comment 13.

From the IA2 API docs, it's pretty clear that IA2_TEXT_BOUNDARY_CHAR should map to a cluster (in Gecko terminology):

Typically, a single character is returned. In some cases more than one character is returned, for example, when a document contains field data such as a field containing a date, time, or footnote reference. In this case the caret can move over several characters in one movement of the caret.

Unfortunately, ATK is different. See bug 1346535 comment 18:

There is an ATSPI_TEXT_BOUNDARY_CHAR, but it sticks to the unicode notion of code points:
« @ATSPI_TEXT_BOUNDARY_CHAR: An #AtspiText instance is bounded by this character only. Start and end offsets differ by one, by definition, for this value. »

So we either need to special case character calculation in ATK or introduce another boundary type.

Eitan, I assume Mac and Android ideally want clusters too?

Yeah, seems like ATK is the outlier here. I think both Android and Mac would work better with clusters. In Mac there is the expectation that one caret left/right move would correspond to AXNextTextMarkerForTextMarker/AXPreviousTextMarkerForTextMarker.

Flags: needinfo?(eitan)

Clusters for testing:

🤦‍♂️
🤦🏼‍♂️
x͇͕̦̍͂͒
x̧̡̬̘͓̖̲̻̻̲̠̪̻͓͙̜̂̓̊̔̀̀͗̑̀̅̀̂̚͘̕̚͘͢͜͠

The last two were generated with https://glitchtextgenerator.com/

Note that there is no hard upper limit on the length of a cluster.

Assignee: nobody → jteh

Most OS APIs want a cluster when they ask for a "character", except ATK.
Rather than altering BOUNDARY_CHAR, I added a new BOUNDARY_CLUSTER.
Aside from being less risky and causing less churn, there are cases internally where we want to move a TextLeafPoint by character; e.g. to explicitly move to the next/previous Accessible or to move to the next/previous character in an abstract way without worrying about Accessible boundaries.
Calculating clusters is more expensive, so it doesn't make sense to move by cluster in those cases.

:m_kato, it looks like the icu4x implementation of GraphemeClusterBreakIteratorUtf16::Seek always walks forward calling Next() from the start of the string, regardless of the offset we're seeking to. So, if we have a string of 100000 characters and we're seeking to offset 99000, we'll have to walk through the first 99000 characters even though we don't need to. It seems like the non-icu4x implementation is more performant. However, i gather this is a limitation of icu4x. This is rather unfortunate for a11y because we need to be able to determine cluster boundaries from arbitrary offsets in a text node, and some text nodes can be very large; e.g. large textareas. I'd prefer to avoid caching cluster boundaries because of complexity and memory usage.

Am I correct about this or is there an alternative I'm missing?

Thanks.

Flags: needinfo?(m_kato)

Actually, ICU4X doesn't support random access. I am considering this feature on 2.0.

Flags: needinfo?(m_kato)
Pushed by jteh@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d9690363d7ae part 1: Add BOUNDARY_CLUSTER so a11y can query grapheme clusters, AKA user-perceived characters. r=eeejay https://hg.mozilla.org/integration/autoland/rev/4d2e6f2fac60 part 2: Map IA2_TEXT_BOUNDARY_CHAR to cluster. r=eeejay
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 128 Branch

For future reference, the GitHub issue for ICU4x break iterator random access is: https://github.com/unicode-org/icu4x/issues/3247

If we do hit bad performance problems before this is implemented, one suggestion in this discussion is to pass a substring instead of the full string. We'd have to pass multiple different substrings and be a bit creative to get what we want, but it's probably possible.

Blocks: 1901171
Blocks: 1901172
Regressions: 1905065
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: