getText* char boundary doesn't respect clusters
Categories
(Core :: Disability Access APIs, defect)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox128 | --- | fixed |
People
(Reporter: surkov, Assigned: Jamie)
References
(Blocks 3 open bugs)
Details
(Keywords: access)
Attachments
(2 files)
Updated•3 years ago
|
| Assignee | ||
Comment 3•1 year ago
|
||
This was also discussed in bug 1346535 comment 13.
From the IA2 API docs, it's pretty clear that IA2_TEXT_BOUNDARY_CHAR should map to a cluster (in Gecko terminology):
Typically, a single character is returned. In some cases more than one character is returned, for example, when a document contains field data such as a field containing a date, time, or footnote reference. In this case the caret can move over several characters in one movement of the caret.
Unfortunately, ATK is different. See bug 1346535 comment 18:
There is an ATSPI_TEXT_BOUNDARY_CHAR, but it sticks to the unicode notion of code points:
« @ATSPI_TEXT_BOUNDARY_CHAR: An #AtspiText instance is bounded by this character only. Start and end offsets differ by one, by definition, for this value. »
So we either need to special case character calculation in ATK or introduce another boundary type.
Eitan, I assume Mac and Android ideally want clusters too?
| Assignee | ||
Comment 4•1 year ago
|
||
I think the iterator class we want is now GraphemeClusterBreakIteratorUtf16 in intl/lwbrk/Segmenter.h. I'm not quite sure how we're going to handle moving backwards. GraphemeClusterBreakIteratorUtf16 only moves forwards. There is GraphemeClusterBreakReverseIteratorUtf16, but the code comments explicitly say:
Note: The reverse iterator doesn't handle conjoining Jamo and emoji.
That's definitely not going to work for us.
Comment 5•1 year ago
|
||
(In reply to James Teh [:Jamie] from comment #3)
This was also discussed in bug 1346535 comment 13.
From the IA2 API docs, it's pretty clear that IA2_TEXT_BOUNDARY_CHAR should map to a cluster (in Gecko terminology):
Typically, a single character is returned. In some cases more than one character is returned, for example, when a document contains field data such as a field containing a date, time, or footnote reference. In this case the caret can move over several characters in one movement of the caret.
Unfortunately, ATK is different. See bug 1346535 comment 18:
There is an ATSPI_TEXT_BOUNDARY_CHAR, but it sticks to the unicode notion of code points:
« @ATSPI_TEXT_BOUNDARY_CHAR: An #AtspiText instance is bounded by this character only. Start and end offsets differ by one, by definition, for this value. »So we either need to special case character calculation in ATK or introduce another boundary type.
Eitan, I assume Mac and Android ideally want clusters too?
Yeah, seems like ATK is the outlier here. I think both Android and Mac would work better with clusters. In Mac there is the expectation that one caret left/right move would correspond to AXNextTextMarkerForTextMarker/AXPreviousTextMarkerForTextMarker.
| Assignee | ||
Comment 6•1 year ago
•
|
||
Clusters for testing:
À
🤦♂️
🤦🏼♂️
x͇͕̦̍͂͒
x̧̡̬̘͓̖̲̻̻̲̠̪̻͓͙̜̂̓̊̔̀̀͗̑̀̅̀̂̚͘̕̚͘͢͜͠
The last two were generated with https://glitchtextgenerator.com/
Note that there is no hard upper limit on the length of a cluster.
| Assignee | ||
Updated•1 year ago
|
| Assignee | ||
Comment 7•1 year ago
|
||
Most OS APIs want a cluster when they ask for a "character", except ATK.
Rather than altering BOUNDARY_CHAR, I added a new BOUNDARY_CLUSTER.
Aside from being less risky and causing less churn, there are cases internally where we want to move a TextLeafPoint by character; e.g. to explicitly move to the next/previous Accessible or to move to the next/previous character in an abstract way without worrying about Accessible boundaries.
Calculating clusters is more expensive, so it doesn't make sense to move by cluster in those cases.
| Assignee | ||
Comment 8•1 year ago
|
||
| Assignee | ||
Comment 9•1 year ago
|
||
:m_kato, it looks like the icu4x implementation of GraphemeClusterBreakIteratorUtf16::Seek always walks forward calling Next() from the start of the string, regardless of the offset we're seeking to. So, if we have a string of 100000 characters and we're seeking to offset 99000, we'll have to walk through the first 99000 characters even though we don't need to. It seems like the non-icu4x implementation is more performant. However, i gather this is a limitation of icu4x. This is rather unfortunate for a11y because we need to be able to determine cluster boundaries from arbitrary offsets in a text node, and some text nodes can be very large; e.g. large textareas. I'd prefer to avoid caching cluster boundaries because of complexity and memory usage.
Am I correct about this or is there an alternative I'm missing?
Thanks.
Comment 10•1 year ago
|
||
Actually, ICU4X doesn't support random access. I am considering this feature on 2.0.
Comment 11•1 year ago
|
||
Comment 12•1 year ago
|
||
| bugherder | ||
https://hg.mozilla.org/mozilla-central/rev/d9690363d7ae
https://hg.mozilla.org/mozilla-central/rev/4d2e6f2fac60
| Assignee | ||
Comment 13•1 year ago
|
||
For future reference, the GitHub issue for ICU4x break iterator random access is: https://github.com/unicode-org/icu4x/issues/3247
If we do hit bad performance problems before this is implemented, one suggestion in this discussion is to pass a substring instead of the full string. We'd have to pass multiple different substrings and be a bit creative to get what we want, but it's probably possible.
Description
•