Closed Bug 1346535 Opened 3 years ago Closed 2 years ago

atk_text_get_get() return value incorrect after multibyte character

Categories

(Core :: Disability Access APIs, defect, P3)

defect

Tracking

()

VERIFIED FIXED
mozilla61
Tracking Status
firefox55 --- wontfix
firefox59 --- wontfix
firefox60 --- verified
firefox61 --- verified
firefox62 --- verified

People

(Reporter: jdiggs, Assigned: samuel.thibault)

References

(Blocks 2 open bugs)

Details

Attachments

(3 files, 20 obsolete files)

259 bytes, text/plain
Details
14.79 KB, patch
Details | Diff | Splinter Review
75.29 KB, image/png
Details
Steps to reproduce:

1. Load data:text/html;charset=utf-8,<div>Hello %F0%9F%92%97 world</div>

2. Use Accerciser to find the accessible object corresponding with the <div> element.

3. In the ipython console, type the following:

-----
text = acc.queryText()
allText = text.getText(0,-1)
["%s %s" % (allText[i], text.getText(i, i+1)) for i in range(len(allText))]
-----

Expected results: The values would match.

Actual results: The values match only until the multibyte character is reached. It seems like the get text implementation is assuming single-byte chars (see example output below). Also note that performing the same test using the Gedit text editor works as expected.

['H H',
 'e e',
 'l l',
 'l l',
 'o o',
 '   ',
 '
Interesting. It appears bugzilla doesn't like multibyte characters either as I had pasted more output than the above. Trying again (the full output shows up in preview.)

==== Firefox ====

['H H',
 'e e',
 'l l',
 'l l',
 'o o',
 '   ',
 '
(Attaching output because Bugzilla doesn't want to display it, contrary to what the Preview suggests.)
Joanie what is the user impact?
Flags: needinfo?(jdiggs)
Priority: -- → P3
The user impact is that Orca will present the wrong character to the user if it doesn't hack around this.
Flags: needinfo?(jdiggs)
Make that the wrong substring, which might be a character or might not.

Related: Orca would be attempting to access content in this fashion as a result of encountering a bogus return value when trying to get text by char, word, line, or sentence at offset. So this bug here boils down to: When hacking around some other text implementation bug, Orca must also beware the presence of multibyte chars.
It seems it happens only for characters whose utf-16 encoding needs more more than one 16bit value: U+FFFD works fine, but U+10000 doesn't work. That's coherent with the shift that shows up in Joanmarie's test: the character counts as two, and for both U+FFFD is returned to accerciser.

I guess HyperTextAccessible::TextSubstring etc. offsets are utf-16 offsets then? (I guess that is coming from the Windows world) That's a real concern because that shifts everything as soon as on web pages there are characters not in unicode16, e.g. very notably emojis...
It appears Gecko indeed stores strings as double-bytes arrays [1], so layout we query for rendered text also operates with XPCOM strings [2].

Daniel, do you know if there are any issues with non unicode16 characters support in layout (RenderedText in particular)?

[1] https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/Glue_classes/nsAString
[2] https://dxr.mozilla.org/mozilla-central/source/layout/generic/nsIFrame.h?q=nsIFrame%3A%3ARenderedText&redirect_type=direct#2622
Flags: needinfo?(dholbert)
I don't know - punting my needinfo to jfkthame who's more likely to know.
Flags: needinfo?(dholbert) → needinfo?(jfkthame)
(In reply to Samuel Thibault from comment #6)
> I guess HyperTextAccessible::TextSubstring etc. offsets are utf-16 offsets
> then? (I guess that is coming from the Windows world) That's a real concern
> because that shifts everything as soon as on web pages there are characters
> not in unicode16, e.g. very notably emojis...

I don't think this comes from the Windows world in particular; it's simply that the DOM interfaces are entirely built around 16-bit character codes (i.e. UTF-16 code units, NOT necessarily Unicode characters). So if HyperTextAccessible is dealing with DOM offsets and ranges, those will indeed be based on UTF-16.

Perhaps this needs to be clarified, e.g. in the nsIAccessibleText interface, which seems to refer to "offset" and "count" values without being explicit about what they're counting. I would expect an element containing a single emoji character to report a length of 2, for example. (Or even more: e.g. there could be a Fitzpatrick skin-tone modifier, resulting in a length of 4 for the single emoji. DOM interfaces don't abstract this away, and I don't expect the a11y interfaces do so either.)

So in the example from comment 0, the problem (I think) is that:

    text.getText(i, i+1)

is explicitly reading a single *code unit* from a UTF-16 string, which results in an unpaired surrogate.
Flags: needinfo?(jfkthame)
I referred to the Windows world in the sense that there, wchar_t is 16bit, while on Unix wchar_t is 32bit and thus gets completely away with the need to care about surrogates.

The example of code is at-spi-based, which manipulates unicode characters in utf-8, and so getText(i, i+1) does make sense there (and how would it know that it's a surrogate without reading it?). In at-spi, offsets etc. count unicode characters, simply. With a skin-tone modifier, that would get 2 unicode characters.

It happens that accessible/atk/nsMaiInterfaceText.cpp's getTextCB() gets only one surrogate, and when it calls NS_ConvertUTF16toUTF8 that returns U+FFFD. To make conversions properly, everything there should take care whether there are non-unicode16 charaacters, and fix lengths and offsets appropriately :/
(In reply to Samuel Thibault from comment #10)
> The example of code is at-spi-based, which manipulates unicode characters in
> utf-8, and so getText(i, i+1) does make sense there (and how would it know
> that it's a surrogate without reading it?). In at-spi, offsets etc. count
> unicode characters, simply. With a skin-tone modifier, that would get 2
> unicode characters.

It sounds like there's a mismatch, then, between the interpretation of offsets/lengths in nsIAccessibleText and the expectations of the at-spi interface. Maybe we should resolve that within HyperTextAccessible? It'll presumably come with some performance cost, though, as string indexing will no longer be straightforward. And I don't know whether it would break anything that's depending on the current behavior, where offsets in that interface relate directly to DOM offsets.
I guess it would break the iaccessible2 interface on Windows where unicode is encoded with 16bit values, yes. DocAccessibleChild.cpp also returns nsString.

It should rather be handled in nsMailInterfaceText.cpp: that's where NS_ConvertUTF16toUTF8 is called just before handing the text to atk.

I'm thinking that one way to avoid the indexing shifts issues would be to use a UTF16toUTF8 converter which converts surrogate pairs into two characters: the non-unicode16 character and a BOM (U+FEFF). And in the cases where nsMailInterfaceText.cpp gets from TextSubstring some text starting with a trailing surrogate, it would replace it with U+FEFF, and in the cases where it ends with a heading surrogate, it would call TextSubstring again with increased length to get the non-unicode16 character (or just always do it and ignore the additional character if not needed).

Other alternatives look quite invasive to me: parsing the whole text each time to compute the length? adding to nsString a variant of the Length method to avoid recomputing the size each time? And adding to it an Append variant that indexes unicode characters, etc. etc.
(In reply to Samuel Thibault from comment #12)
> I'm thinking that one way to avoid the indexing shifts issues would be to
> use a UTF16toUTF8 converter which converts surrogate pairs into two
> characters: the non-unicode16 character and a BOM (U+FEFF). And in the cases
> where nsMailInterfaceText.cpp gets from TextSubstring some text starting
> with a trailing surrogate, it would replace it with U+FEFF, and in the cases
> where it ends with a heading surrogate, it would call TextSubstring again
> with increased length to get the non-unicode16 character (or just always do
> it and ignore the additional character if not needed).

That's an interesting and creative idea. :) So to clarify, for "Hello U+1F497 world":
Retrieving the entire string would return: "Hello U+1F497U+FEFF world"
Retrieving (6, 7) would return U+1F497
Retrieving (7, 8) would return U+FEFF
That could make things a bit ugly for braille routing, though.

We also have another bug here which is that the character boundary doesn't handle characters which deal with multiple Unicode code points. For example, the sequence U+091C U+094D U+091E is actually a single "character" semantically, even though it consists of 3 Unicode code points (and this doesn't even have any surrogates). However, we should file a separate bug for that.
(In reply to James Teh [:Jamie] from comment #13)
> So to clarify, for "Hello U+1F497 world":
> Retrieving the entire string would return: "Hello U+1F497U+FEFF world"
> Retrieving (6, 7) would return U+1F497
> Retrieving (7, 8) would return U+FEFF

Yes.

> That could make things a bit ugly for braille routing, though.

How so? I mean, braille already has issues, since transcription is not 1to1 anyway. U+FEFF can be transcribed into an empty braille cell set (like other invisible zero-width characters), conversely to U+1F497 already being transcribed into several braille cells. I don't see where the problem is worse than what we already have. 

> We also have another bug here which is that the character boundary doesn't
> handle characters which deal with multiple Unicode code points. For example,
> the sequence U+091C U+094D U+091E is actually a single "character"
> semantically, even though it consists of 3 Unicode code points (and this
> doesn't even have any surrogates). However, we should file a separate bug
> for that.

Well, character-encoding-wise, they are indeed 3 characters, and string functions treat them as 3 separate characters, this is coherent, I'd say we only need to keep that coherency. Then there is the actual rendering which is a complex thing indeed, but that can be all handled between liblouis/orca/nvda, firefox does not have to be involved.

My concern is that if we try to do anything else than sticking to unicode code points, we will get a lot of troubles (double-width characters, zero-width characters, etc.) which should really matter only for rendering, and not for transmission of the text.
(In reply to Samuel Thibault from comment #14)
> > That could make things a bit ugly for braille routing, though.
> How so? I mean, braille already has issues, since transcription is not 1to1
> anyway. U+FEFF can be transcribed into an empty braille cell set (like other
> invisible zero-width characters),

That requires the translator or tables know about U+FEFF, and I don't know that they currently do. However, I guess you could just use lou_compileString (I think that's what it's called) to insert a rule at runtime, so that should work.

> > We also have another bug here which is that the character boundary doesn't
> > handle characters which deal with multiple Unicode code points.
> Well, character-encoding-wise, they are indeed 3 characters, and string
> functions treat them as 3 separate characters, this is coherent, I'd say we
> only need to keep that coherency.

I mean the character "boundary", not the offsets. The example should consume 3 offsets, yes, since it consists of 3 code points. However, it is semantically a single character and should be considered as such when cursoring, for example. So, IAccessibleText::textAtOffset(0, IA2_TEXT_BOUNDARY_CHAR) should return offsets (0, 3). I'm not sure what the ATK equivalent is, but I imagine there would be one.
(In reply to James Teh [:Jamie] from comment #13)
> That's an interesting and creative idea. :) So to clarify, for "Hello
> U+1F497 world":
> Retrieving the entire string would return: "Hello U+1F497U+FEFF world"
> Retrieving (6, 7) would return U+1F497
> Retrieving (7, 8) would return U+FEFF

Given an element with the text "Hello U+1F497 world", wouldn't a user of the at-spi interface expect that retrieving (8, 13) should return "world"? But under this scheme it will instead return " worl" because of the "fake" character that's inserted, shifting the offsets of everything that follows.
(In reply to Jonathan Kew (:jfkthame) from comment #16)
> Given an element with the text "Hello U+1F497 world", wouldn't a user of the
> at-spi interface expect that retrieving (8, 13) should return "world"?

The client can only know what to "expect" based on something they retrieved earlier. In this scheme, the client never actually gets "Hello U+1F497 world". Instead, we lie to the client and they get "Hello U+1F497U+FEFF world" when they retrieve the entire string. After that, any offsets to which the client refers will be based on this string (the one with the U+FEFF we lied about).
(In reply to James Teh [:Jamie] from comment #15)
> (In reply to Samuel Thibault from comment #14)
> > > That could make things a bit ugly for braille routing, though.
> > How so? I mean, braille already has issues, since transcription is not 1to1
> > anyway. U+FEFF can be transcribed into an empty braille cell set (like other
> > invisible zero-width characters),
> 
> That requires the translator or tables know about U+FEFF, and I don't know
> that they currently do.

Well, not a problem that can't be solved :) They really have to care anyway, since Unicode text files do often include the BOM at the beginning.

> However, I guess you could just use lou_compileString (I think that's what it's called) to insert a rule at runtime, so that should work.

Until that gets fixed in liblouis, indeed.

> > > We also have another bug here which is that the character boundary doesn't
> > > handle characters which deal with multiple Unicode code points.
> > Well, character-encoding-wise, they are indeed 3 characters, and string
> > functions treat them as 3 separate characters, this is coherent, I'd say we
> > only need to keep that coherency.
> 
> I mean the character "boundary", not the offsets. The example should consume
> 3 offsets, yes, since it consists of 3 code points. However, it is
> semantically a single character and should be considered as such when
> cursoring, for example. So, IAccessibleText::textAtOffset(0,
> IA2_TEXT_BOUNDARY_CHAR) should return offsets (0, 3). I'm not sure what the
> ATK equivalent is, but I imagine there would be one.

There is an ATSPI_TEXT_BOUNDARY_CHAR, but it sticks to the unicode notion of code points:

« @ATSPI_TEXT_BOUNDARY_CHAR: An #AtspiText instance is bounded by this character only. Start and end offsets differ by one, by definition, for this value. »

So it will be coherent for at-spi (and the change I propose is only for at-spi, not ia2).
(In reply to James Teh [:Jamie] from comment #17)
> (In reply to Jonathan Kew (:jfkthame) from comment #16)
> > Given an element with the text "Hello U+1F497 world", wouldn't a user of the
> > at-spi interface expect that retrieving (8, 13) should return "world"?
> 
> The client can only know what to "expect" based on something they retrieved
> earlier. In this scheme, the client never actually gets "Hello U+1F497
> world". Instead, we lie to the client and they get "Hello U+1F497U+FEFF
> world" when they retrieve the entire string. After that, any offsets to
> which the client refers will be based on this string (the one with the
> U+FEFF we lied about).

I don't know much about these interfaces, or how they're used, but isn't there still potential for confusion/problems if the client itself has previously set the text of an element -- so it knows exactly what it "expects" to be there -- but then we map offsets differently than it expects? So, client sets text of an editable element to "Hello U+1F497 world", and then wants to set the selection to the word "world", so it sets it to (8, 13)... and misses its target.
In at-spi2 there is actually no way for clients to set the text of an element. Element modifications go through the usual input methods, with all the potential for conflicts with other input sources, and thus the screen reader already has to care for what actual text insertion events happen.
OK, clearly I don't understand the pieces involved here. Looking at https://github.com/GNOME/at-spi2-core/tree/master/xml, for example, I see an EditableText interface that looks like it can set/modify text, with methods like SetTextContents, InsertText, CutText, etc. Is that not potentially an issue?
Oh, sorry, that's right. It seems it has been too long a time since I actually looked at the whole at-spi interface, and Orca does not actually make any use of it, so to my mind it just didn't exist at all :)

The conflicting thing still applies: a screen reader should not expect the resulting text to be exactly what it set, since other kinds of input could intervene in the meanwhile.
(In reply to Samuel Thibault from comment #18)
> There is an ATSPI_TEXT_BOUNDARY_CHAR, but it sticks to the unicode notion of
> code points:

Fair enough; if that's what the spec says, that's what we should do. Still, this seems flawed. That makes it impossible for a screen reader to report the semantic character correctly when cursoring through text, since pressing right/left arrow will consider all 3 code points as just one position in a well behaved editor. There are also other cases of this; e.g. a word processor where a date field occupies a single "character" stop when cursoring.
Hello all,

I've found a real life example to test this issue.

Steps to reproduce:
1) Open Firefox without e10s (maybe the bug exists with e10s but I don't use e10s for now)
2) Go to https://blog.nightly.mozilla.org/2018/01/18/these-weeks-in-firefox-issue-30/
3) Go to the heading 2 "Project Updates"
4) Press up arrow

Result: Orca reads "New contributors (
Maybe there is a mistake  with bugzilla, it seems that Unicode characters break the end of my message.

Result: Orca reads "New contributors ( = First Patch!)"

Expected result: Orca should read "Steve Pfister fixed a bug where the SafeBrowsing warning notification bar wouldn’t go away after a page navigation, and made sure that the tab title was properly set in about:url-classifier"

Remove the star character solves the issue.


Best regards.
Unfortunately, Orca has to hack around various and sundry bugs in the AtkText implementation. If I don't get the hack quite right, things like skipping text or getting stuck can occur. I'll see if something like that is going on and try to work around this particular instance.

That said, fixing all the AtkText bugs in Gecko would be super appreciated! :)
(In reply to Joanmarie Diggs from comment #26)
> Unfortunately, Orca has to hack around various and sundry bugs in the
> AtkText implementation. If I don't get the hack quite right, things like
> skipping text or getting stuck can occur. I'll see if something like that is
> going on and try to work around this particular instance.

In the context of Hypra, Samuel are working on this bug. So I hope you'll be able to remove your workaround in the near future.

> That said, fixing all the AtkText bugs in Gecko would be super appreciated!
> :)

Right, could you point us what bugs are the most problematic ? To be honest we've difficulties to understand if those bugs have a user impact until we are ourselves impacted by the bugs. You've provided lot of good workaround to make our lives better. 
As we're overloaded we focus on most important bugs. You maintain the screen reader, we (hypra) maintain the screen magnifier (ezoom on Compiz), it's also a really huge work.

You could find the bug we plan to work on here : https://bugs.debian.org/cgi-bin/pkgreport.cgi?users=bugs@hypra.fr;tag=hypra

Best regards.
Attached patch multibyte (obsolete) — Splinter Review
Here is what it would look like: the UTF16toUTF8 converter is added an aSurrBOM parameter which defaults to false, that the atk layer can use to make it add the BOMs along non-unicode16 values.

atk can then request one more character on each side to be sure to get the paired surrogates, and remove them before pushing to at-spi.
Attachment #8947407 - Flags: review?(surkov.alexander)
Drive-by comment: I don't feel good about pushing this hack all the way into xpcom/string code; if it's necessary to do something like this, I think it should be kept local to the a11y code and not added to core string APIs.
Ok, we can probably insert the BOMs along the non-unicode16 after conversion, I'll have a look.
(In reply to Alex ARNAUD from comment #24)
> Hello all,
> 
> I've found a real life example to test this issue.
> 
> Steps to reproduce:
> 1) Open Firefox without e10s (maybe the bug exists with e10s but I don't use
> e10s for now)
> 2) Go to
> https://blog.nightly.mozilla.org/2018/01/18/these-weeks-in-firefox-issue-30/
> 3) Go to the heading 2 "Project Updates"
> 4) Press up arrow
> 
> Result: Orca reads "New contributors (

The Samuel's patch resolves the issue with this test case.

Best regards.
Comment on attachment 8947407 [details] [diff] [review]
multibyte

Review of attachment 8947407 [details] [diff] [review]:
-----------------------------------------------------------------

r? to david baron, the xpcom strings owner
Attachment #8947407 - Flags: review?(dbaron)
Comment on attachment 8947407 [details] [diff] [review]
multibyte

This isn't possible to review without code comments explaining what this new boolean parameter does.  Please either:
 * add them and request review again
 * refactor along the lines of comment 29.

I probably prefer the latter, since I think this adds overhead for everybody else using these APIs (APIs that were previously quite straightforward).
Attachment #8947407 - Flags: review?(dbaron) → review-
Comment on attachment 8947407 [details] [diff] [review]
multibyte

Samuel, will you redesgin the patch to address David's comment, right?
Attachment #8947407 - Flags: review?(surkov.alexander)
I will, yes, I'm just currently busy right now by other things :)
(In reply to Samuel Thibault from comment #35)
> I will, yes, I'm just currently busy right now by other things :)

sure, please take your time, and thank you for doing this.

just review request was bumping me daily :)
Attached patch multibyte (obsolete) — Splinter Review
Here is a reworked patch, which adds the BOMs after translating to UTF-8.
Attachment #8947407 - Attachment is obsolete: true
Attachment #8952348 - Flags: review?(surkov.alexander)
Comment on attachment 8952348 [details] [diff] [review]
multibyte

Review of attachment 8952348 [details] [diff] [review]:
-----------------------------------------------------------------

nits and styling issues

::: accessible/atk/nsMaiInterfaceText.cpp
@@ +205,5 @@
> +
> +  return true;
> +}
> +
> +// Add a BOM after each non-unicode16 character of str

nit: it'd be nice have a comment explaining what BOM is
nit: dot in the end of the sentence

@@ +212,5 @@
> +{
> +  uint32_t n;
> +
> +  // First compute how much room we will need
> +  n = 0;

nit: uint32_t n = 0; or I guess you can use auto n = 0U;

@@ +261,5 @@
> +      // Unicode16, just copy the byte.
> +      ret.Replace(reti++, 1, str[i]);
> +    }
> +  }
> +  NS_ASSERTION(reti == n, "Bogus conversion!?");

MOZ_ASSERT

@@ +274,4 @@
>  {
>    AccessibleWrap* accWrap = GetAccessibleWrap(ATK_OBJECT(aText));
>    nsAutoString autoStr;
> +  bool aStartShifted = false, aEndShifted = false;

nit: local variables should not have 'a' prefix; it is used for arguments

@@ +282,5 @@
>        return nullptr;
>  
> +    PrepareUTF16toUTF8(&aStartOffset, &aEndOffset,
> +                       &aStartShifted, &aEndShifted,
> +                       text->IsDefunct() ? 0 :

it'd say to return early if an accessible is defunct, since we cannot guarantee any text for a defunct accessible.

@@ +304,5 @@
> +  }
> +  nsAutoCString cautoStrBOM = AddBOMs(cautoStr);
> +  if (!FinishUTF16toUTF8(cautoStrBOM, aStartShifted, aEndShifted)) {
> +    return nullptr;
> +  }

I would really prefer to wrap these by class similar to NS_ConvertUTF16toUTF8 which takes string, offsets and character count
> I would really prefer to wrap these by class similar to NS_ConvertUTF16toUTF8 which takes string, offsets and character count

Mmm, not sure about the details.

- AIUI, by "wrap these by class", you mean like nsString.h's class NS_ConvertUTF16toUTF8 which is a trivial wrapper around the AppendUTF16toUTF8 function which does the actual work of copying from one string to another, right? I.e. something like

    class NS_AddBOMs : public nsAutoCString
    {
    public:
        explicit NS_AddBOMS(const nsAString& aString)
        {
            AddBOMs(aString, *this);
        }
    }

- Should this still be put within nsMaiInterfaceText.cpp, not to be stuffed into nsString.h?

- what do you mean by "offsets and character count"? I don't see them in the NS_ConvertUTF16toUTF-8, and for AddBOMs, it just needs the string anyway, no other information is needed. For FinishUTF16toUTF8, the additional needed information beyond the string is whether the heading and/or the trailing characters should be removed, I didn't really plan to implement generic substringing.
Flags: needinfo?(surkov.alexander)
Actually, while I see how to add a wrapper for AddBOMs, I'm not sure about FinishUTF16toUTF8: what should I do on failure? Just assertions? I'm a bit afraid of depending on TextSubstring() providing expected data.
(In reply to Samuel Thibault from comment #39)
> > I would really prefer to wrap these by class similar to NS_ConvertUTF16toUTF8 which takes string, offsets and character count
> 
> Mmm, not sure about the details.
> 
> - AIUI, by "wrap these by class", you mean like nsString.h's class
> NS_ConvertUTF16toUTF8 which is a trivial wrapper around the
> AppendUTF16toUTF8 function which does the actual work of copying from one
> string to another, right? I.e. something like
> 
>     class NS_AddBOMs : public nsAutoCString
>     {
>     public:
>         explicit NS_AddBOMS(const nsAString& aString)
>         {
>             AddBOMs(aString, *this);
>         }
>     }
> 

I meant it'd be nice to wrap PrepareUTF16toUTF8/AddBOMs/FinishUTF16toUTF8 by a single call, it could be:

class UTIF16ToMultibyteUTF8 {
  bool UTIF16ToMultibyteUTF8(const gint aStartOffset, const gint aEndOffset,
                             gint aCount, nsAutoString &aStr) {
    NS_ConvertUTF16toUTF8 str(aStr);
    auto startOffset = aStartOffset; auto endOffset = aEndOffset;
    auto startShifted = ...
    PrepareUTF16toUTF8(&startOffset, &endOffset, ..)
    nsAutoCString cautoStrBOM = AddBOMs(cautoStr);
    if (!FinishUTF16toUTF8(cautoStrBOM, aStartShifted, aEndShifted)) {
    }
  }

  // Returns a string if the conversion was successful.
  char* get() {}
}

Actually I'm flexible how it could be implemented, it may be a class or namespace, the point is it'd be good to scope PrepareUTF16toUTF8/AddBOMs/FinishUTF16toUTF8, so the caller has zero chance for error.

> - Should this still be put within nsMaiInterfaceText.cpp, not to be stuffed
> into nsString.h?

David said in comment #33 that his preference would be not to add this into string API. So it could be either nsMaiInterfaceText.cpp or a new file in atk folder.

> - what do you mean by "offsets and character count"? 

just arguments of a new function/ctor
Flags: needinfo?(surkov.alexander)
Attached patch multibyte (obsolete) — Splinter Review
Oh, I see the idea now, notably it's the . Here is a worked patch. I still had to split the NS_DOMtoATK work in two pieces, since there's an interlacing between getting the HyperTextAccessible or the ProxyAccessible and its size, the preparation, the TextSubstring call, and the eventual Conversion. The addition of a class however makes it look nice indeed.
Attachment #8952348 - Attachment is obsolete: true
Attachment #8952348 - Flags: review?(surkov.alexander)
Attachment #8953378 - Flags: review?(surkov.alexander)
"it's the" -> it's the keeping of startshifted/endshifted which is useful to wrap.
Comment on attachment 8953378 [details] [diff] [review]
multibyte

Review of attachment 8953378 [details] [diff] [review]:
-----------------------------------------------------------------

::: accessible/atk/nsMaiInterfaceText.cpp
@@ +270,5 @@
> +  }
> +  MOZ_ASSERT(desti == n);
> +}
> +
> +class NS_DOMtoATK : public nsAutoCString

I lean towards to move all of this code into a separate file: 300 new lines into 600 lines file, which are not closely related to each other. Sounds good with you?

and please could you add a comment to the class describing what it's about?

@@ +276,5 @@
> +public:
> +  void Prepare(gint *aStartOffset, gint *aEndOffset, gint count)
> +  {
> +    PrepareUTF16toUTF8(aStartOffset, aEndOffset,
> +                       &StartShifted, &EndShifted, count);

wouldn't be nicer to keep get rid of PrepareUTF16toUTF8 and move its content here?

@@ +319,5 @@
>      ConvertTexttoAsterisks(accWrap, autoStr);
>    } else if (ProxyAccessible* proxy = GetProxy(ATK_OBJECT(aText))) {
> +
> +    AtkStr.Prepare(&aStartOffset, &aEndOffset,
> +                   proxy->CharacterCount());

if you had a variable to keep character count like

uint32_t charCount = 0;
if (accWrap) {
  charCount = text->CharacterCount();
}
else {
  charCount = proxy->CharacterCount();
}

then you could move Prepare and NS_ConvertUTF16toUTF inside of AtkString by

AtkString cautoStr(aStartOffset, aEndOffset, charCount);
return (cautoStr.get()) ? g_strdup(cautoStr.get()) : nullptr;
> I lean towards to move all of this code into a separate file

Why not indeed :)

> wouldn't be nicer to keep get rid of PrepareUTF16toUTF8 and move its content here?

I was wondering it indeed. Keeping it as separate function like the two others looked simpler to grasp when reading: three functions which work together, and a class which wraps the way to call them.

> you could move Prepare and NS_ConvertUTF16toUTF inside of AtkString by

No, because of the interleaving: one has to call Prepare to get Start/EndOffset modified, then use one or the other TextString, before converting to UTF8, adding BOMs and finishing. At best we could have

HyperTextAccessible* text
ProxyAccessible* proxy
gint count;
if (accWrap) {
    text = accWrap->AsHyperText();
    count = static_cast<gint>(text->CharacterCount());
} else if (proxy = GetProxy(ATK_OBJECT(aText))) {
    count = proxy->CharacterCount();
} else {
    return nullptr;
}
AtkString cautoStr(&aStartOffset, &aEndOffset, charCount);
if (accWrap) {
    text->TextSubstring(aStartOffset, aEndOffset, autoStr);
} else {
    proxy->TextSubstring(aStartOffset, aEndOffset, autoStr);
}
if (!cautoStr.Convert(autoStr)) {
    return nullptr;
}
return g_strdup(cautoStr.get());

that doesn't look very appealing to me, compared to 

DOMtoATK cautoStr;
if (accWrap) {
    HyperTextAccessible* text = accWrap->AsHyperText();
    cautoStr->Prepare(&aStartOffset, &aEndOffset, static_cast<gint>(text->CharacterCount()));
    text->TextSubstring(aStartOffset, aEndOffset, autoStr);
} else if (ProxyAccessible* proxy = GetProxy(ATK_OBJECT(aText))) {
    cautoStr->Prepare(&aStartOffset, &aEndOffset, proxy->CharacterCount());
    proxy->TextSubstring(aStartOffset, aEndOffset, autoStr);
}
if (!cautoStr.Convert(autoStr)) {
    return nullptr;
}
return g_strdup(cautoStr.get());
(In reply to Samuel Thibault from comment #45)
> > I lean towards to move all of this code into a separate file
> 
> Why not indeed :)

good :)

> > wouldn't be nicer to keep get rid of PrepareUTF16toUTF8 and move its content here?
> 
> I was wondering it indeed. Keeping it as separate function like the two
> others looked simpler to grasp when reading: three functions which work
> together, and a class which wraps the way to call them.

if you don't need to access to those functions outside a class, then it makes sense to put them into the class, just to hide implementation details. Technically it looks pretty much same visually, just functions are prefixed by a class name. Right?

> > you could move Prepare and NS_ConvertUTF16toUTF inside of AtkString by
> 
> No, because of the interleaving: one has to call Prepare to get
> Start/EndOffset modified, then use one or the other TextString, before
> converting to UTF8, adding BOMs and finishing. At best we could have

I see

> HyperTextAccessible* text
> ProxyAccessible* proxy
> gint count;
> if (accWrap) {
>     text = accWrap->AsHyperText();
>     count = static_cast<gint>(text->CharacterCount());
> } else if (proxy = GetProxy(ATK_OBJECT(aText))) {
>     count = proxy->CharacterCount();
> } else {
>     return nullptr;
> }
> AtkString cautoStr(&aStartOffset, &aEndOffset, charCount);
> if (accWrap) {
>     text->TextSubstring(aStartOffset, aEndOffset, autoStr);
> } else {
>     proxy->TextSubstring(aStartOffset, aEndOffset, autoStr);
> }
> if (!cautoStr.Convert(autoStr)) {
>     return nullptr;
> }
> return g_strdup(cautoStr.get());
> 
> that doesn't look very appealing to me, compared to 
> 
> DOMtoATK cautoStr;
> if (accWrap) {
>     HyperTextAccessible* text = accWrap->AsHyperText();
>     cautoStr->Prepare(&aStartOffset, &aEndOffset,
> static_cast<gint>(text->CharacterCount()));
>     text->TextSubstring(aStartOffset, aEndOffset, autoStr);
> } else if (ProxyAccessible* proxy = GetProxy(ATK_OBJECT(aText))) {
>     cautoStr->Prepare(&aStartOffset, &aEndOffset, proxy->CharacterCount());
>     proxy->TextSubstring(aStartOffset, aEndOffset, autoStr);
> }
> if (!cautoStr.Convert(autoStr)) {
>     return nullptr;
> }
> return g_strdup(cautoStr.get());

String class changing offsets right before grabbing a text look a bit weird. So you could move those outside DOMtoATK class like
auto startOffset = aStartOffset > 0 ? aStartOffset - 1 : 0;
auto endOffset = aEndOffset;
and then
if (accWrap) {
  auto charCount = text->CharactedCount();
  endOffset = endOffset == -1 || endOffset == charCount ? charCount : endOffset + 1;
}
else {
  ..
}

or rename 'cautoStr' to something more generic like 'converter'? and rename Prepare method to something more explicit like AdjustOffsets().

Also could you please give more info (sorry, but I missed this part in the bug discussion) on non-unicode16 values in UTF16 string, and why we need to adjust offsets to handle those characters. Are those characters are bigger than 16bits?
Attached patch multibyte (obsolete) — Splinter Review
> it makes sense to put them into the class

Sure, that makes sense.

> String class changing offsets right before grabbing a text look a bit weird. So you could move those outside DOMtoATK class like

Well, to me _that_ would look odd :)  The whole thing works correctly because we change the offsets to grab one character before and one character after the requested selection, and the finish operation needs to know whether one more character were grabbed on either end to properly drop them, thus that information needs to be transmitted along. It's really better to stuff that knowledge inside the class, so the outside just needs to know that it has to prepare the conversion, get the text with possibly shifted offsets, and call the conversion.

> or [...] rename Prepare method to something more explicit like AdjustOffsets().

Yes, that makes sense

> non-unicode16 values in UTF16 string, and why we need to adjust offsets to handle those characters. Are those characters are bigger than 16bits?

Yes, they are bigger than 16bit, and are thus represented with two UTF-16 code units (a pair of surrogates), while in unicode they are just one character. When the selection is in the middle of a non-unicode16 character, we need to get both surrogates, not only the one that falls within the selection, otherwise we can't convert that into the unicode character.

I have attached a reworked patch.
Attachment #8953378 - Attachment is obsolete: true
Attachment #8953378 - Flags: review?(surkov.alexander)
Attachment #8953660 - Flags: review?(surkov.alexander)
Comment on attachment 8953660 [details] [diff] [review]
multibyte

Review of attachment 8953660 [details] [diff] [review]:
-----------------------------------------------------------------

another portions of comments, that's a complicated thing :)

::: accessible/atk/nsMaiInterfaceText.cpp
@@ +164,5 @@
>      proxy->TextSubstring(aStartOffset, aEndOffset, autoStr);
>    }
>  
>    NS_ConvertUTF16toUTF8 cautoStr(autoStr);
> +  if (!AtkStr.Convert(cautoStr)) {

it looks like you have double NS_ConvertUTF16toUTF8 conversion now

::: accessible/atk/nsDOMtoATK.cpp
@@ +8,5 @@
> +#include "nsUTF8Utils.h"
> +
> +// In order to properly get non-unicode16 values, change offsets to get one more
> +// character on each end, so that ConvertUTF16toUTF8 can convert surrogates even
> +// if the originally requested offsets fall between them.

usually a method comments are in headers

@@ +32,5 @@
> +  if (StartShifted) {
> +    // PrepareUTF16toUTF8 added a heading character
> +
> +    if (Length() == 0) {
> +      // Odd, there should have been a character...

shouldn't you assert there? and everywhere where odd is used

@@ +36,5 @@
> +      // Odd, there should have been a character...
> +      return false;
> +    }
> +
> +    if ((int) Length() < UTF8traits::bytes(CharAt(0))) {

nit: static_cast<int>

@@ +82,5 @@
> +NS_DOMtoATK::AddBOMs(const nsAutoCString& aSource)
> +{
> +  uint32_t n = 0;
> +
> +  // First compute how much room we will need

nit: dot in the end

@@ +91,5 @@
> +    }
> +    n++;
> +  }
> +
> +  uint32_t desti; // Index within this

nit: initialize it

@@ +94,5 @@
> +
> +  uint32_t desti; // Index within this
> +
> +  // Add BOMs after non-unicode16 characters
> +  Truncate(n);

Truncate cannot make string longer? Did you mean SetLength()?

@@ +95,5 @@
> +  uint32_t desti; // Index within this
> +
> +  // Add BOMs after non-unicode16 characters
> +  Truncate(n);
> +  desti = 0;

yep ^ :)

@@ +129,5 @@
> +  }
> +  MOZ_ASSERT(desti == n);
> +}
> +
> +bool NS_DOMtoATK::Convert(nsAutoString aAutoStr)

nit:
bool
NS_DOMToATK::Convert

also public methods usually go before priviate methods used for their implementation

@@ +132,5 @@
> +
> +bool NS_DOMtoATK::Convert(nsAutoString aAutoStr)
> +{
> +  NS_ConvertUTF16toUTF8 cautoStr(aAutoStr);
> +  if (!(cautoStr.get())) {

nit: no () around str.get() are needed, right?

@@ +134,5 @@
> +{
> +  NS_ConvertUTF16toUTF8 cautoStr(aAutoStr);
> +  if (!(cautoStr.get())) {
> +    return false;
> +  }

since AdjustOffset and Convert has to go paired, then it'd be good to assert if the user forgets to call AdjustOffsets

::: accessible/atk/nsDOMtoATK.h
@@ +28,5 @@
> +// AdjustOffsets before peeking the DOM string, and only after call Convert to
> +// get the proper string with non-unicode16 characters expressed in just one
> +// unicode character followed by a BOM.
> +
> +class NS_DOMtoATK : public nsAutoCString

it's sort of weird to have nsAutoCString as a base class and use no constructor (because AdjustOffset has to be called after creation), and have Convert instead Assign. It would be probably nicer to remove inheritance from nsAutoCString. Convert could return a pointer to a converted string, to avoid having extra get() method.

@@ +31,5 @@
> +
> +class NS_DOMtoATK : public nsAutoCString
> +{
> +public:
> +  void AdjustOffsets(gint *aStartOffset, gint *aEndOffset, gint count);

nit: type* name

@@ +38,5 @@
> +private:
> +  void AddBOMs(const nsAutoCString& aSource);
> +  bool FinishUTF16toUTF8(void);
> +
> +  bool StartShifted;

nit members should be prefixed by 'm'
Attached patch multibyte (obsolete) — Splinter Review
> it looks like you have double NS_ConvertUTF16toUTF8 conversion now

Ah, sorry, that part of the patch was not refreshed.

> shouldn't you assert there? and everywhere where odd is used

Right. Since the string surely comes from UTF16toUTF8 it is really supposed to be alright.

Here is a reworked patch.
Attachment #8953660 - Attachment is obsolete: true
Attachment #8953660 - Flags: review?(surkov.alexander)
Attachment #8954429 - Flags: review?(surkov.alexander)
Comment on attachment 8954429 [details] [diff] [review]
multibyte

Review of attachment 8954429 [details] [diff] [review]:
-----------------------------------------------------------------

mostly nits, could you please update the patch one more time so I can do extra run?

::: accessible/atk/nsMaiInterfaceText.cpp
@@ +164,4 @@
>      proxy->TextSubstring(aStartOffset, aEndOffset, autoStr);
>    }
>  
> +  //libspi will free it.

nit: space after //

::: accessible/atk/nsDOMtoATK.cpp
@@ +7,5 @@
> +#include "nsDOMtoATK.h"
> +#include "nsUTF8Utils.h"
> +
> +void
> +NS_DOMtoATK::AdjustOffsets(gint *aStartOffset, gint *aEndOffset,

nit: type* name

@@ +26,5 @@
> +
> +gchar*
> +NS_DOMtoATK::Convert(nsAutoString aAutoStr)
> +{
> +  MOZ_ASSERT(mAdjusted);

nit: nice to have an assertion message that offsets has to be adjusted at this point

@@ +47,5 @@
> +  if (mStartShifted) {
> +    // PrepareUTF16toUTF8 added a heading character
> +
> +    // There should be a character
> +    MOZ_ASSERT(aStr.Length() > 0);

nit: it makes sense to turn these comments into assertion messages

@@ +49,5 @@
> +
> +    // There should be a character
> +    MOZ_ASSERT(aStr.Length() > 0);
> +    // It should be complete
> +    MOZ_ASSERT (static_cast<int>(aStr.Length())

nit: extra space before (

@@ +50,5 @@
> +    // There should be a character
> +    MOZ_ASSERT(aStr.Length() > 0);
> +    // It should be complete
> +    MOZ_ASSERT (static_cast<int>(aStr.Length())
> +                >= UTF8traits::bytes(aStr.CharAt(0)));

nit: no operator on a new line, I guess it should be fine to keep all this on same line

@@ +57,5 @@
> +    aStr.Cut(0, UTF8traits::bytes(aStr.CharAt(0)));
> +  }
> +
> +  if (mEndShifted) {
> +    // PrepareUTF16toUTF8 added a trailing character

nit: pls dot in the end (here and everywhere)

@@ +59,5 @@
> +
> +  if (mEndShifted) {
> +    // PrepareUTF16toUTF8 added a trailing character
> +
> +    int len = aStr.Length();

len is used in assertsions, you could use aStr.Lenght() directly

@@ +63,5 @@
> +    int len = aStr.Length();
> +    // There should be a character
> +    MOZ_ASSERT(len > 0);
> +
> +    int trail;

nit: should be initialized

@@ +81,5 @@
> +  }
> +}
> +
> +void
> +NS_DOMtoATK::AddBOMs(nsAutoCString &aDest, const nsAutoCString& aSource)

nit: nsAutoCString& aDest

@@ +94,5 @@
> +    }
> +    n++;
> +  }
> +
> +  uint32_t desti; // Index within this

uint32_t desti = 0;
comment is confusing, it was probably left when NS_DOMtoATK was a string

@@ +105,5 @@
> +
> +    if (bytes >= 4) {
> +      // Non-unicode16 character, add a BOM after it.
> +
> +      if (bytes > (aSource.Length() - i)) {

nit: you don't need braces around aSource.Lenght() - i

@@ +107,5 @@
> +      // Non-unicode16 character, add a BOM after it.
> +
> +      if (bytes > (aSource.Length() - i)) {
> +        // We don't have the whole sequence, copy what we have
> +        for ( ; i < aSource.Length(); i++) {

nit: no space before ;

@@ +118,5 @@
> +      // Copy whole sequence.
> +      for (uint32_t j = 0; j < bytes; j++) {
> +        aDest.Replace(desti++, 1, aSource[i + j]);
> +      }
> +      i += bytes-1;

nit: space around -

@@ +123,5 @@
> +
> +      // And add a BOM after it.
> +      aDest.Replace(desti++, 1, '\xEF');
> +      aDest.Replace(desti++, 1, '\xBB');
> +      aDest.Replace(desti++, 1, '\xBF');

won't it be equivalent to aDest.Repalce(dest, 3, "\xEF\xBB\xBF");

::: accessible/atk/nsDOMtoATK.h
@@ +6,5 @@
> +
> +#include "nsString.h"
> +#include "nsMai.h"
> +
> +// ATK offsets are counted in unicode codepoints, while DOM offsets are counted

nit: usually class and method comments look this way:
/**
 * Blabla
 */

@@ +28,5 @@
> +// AdjustOffsets before peeking the DOM string, and only after call Convert to
> +// get the proper string with non-unicode16 characters expressed in just one
> +// unicode character followed by a BOM.
> +
> +class NS_DOMtoATK

nit: netscape still reminds about itself :) no NS is needed, just DOMToATK should be fine

@@ +32,5 @@
> +class NS_DOMtoATK
> +{
> +public:
> +  NS_DOMtoATK(void) {
> +    mAdjusted = false;

should be wrapped by DEBUG
Attached patch multibyte (obsolete) — Splinter Review
Here is a reworked patch.
Attachment #8954429 - Attachment is obsolete: true
Attachment #8954429 - Flags: review?(surkov.alexander)
Attachment #8954683 - Flags: review?(surkov.alexander)
Comment on attachment 8954683 [details] [diff] [review]
multibyte

Review of attachment 8954683 [details] [diff] [review]:
-----------------------------------------------------------------

more nits and more questions, and thank you for being patient about my review style :)

::: accessible/atk/DOMtoATK.cpp
@@ +8,5 @@
> +#include "nsUTF8Utils.h"
> +
> +void
> +DOMtoATK::AdjustOffsets(gint* aStartOffset, gint* aEndOffset,
> +                           gint count)

nit: wrong offset

@@ +20,5 @@
> +    (*aEndOffset)++;
> +    mEndShifted = true;
> +  }
> +
> +  mAdjusted = true;

wrap it by DEBUG

@@ +41,5 @@
> +  return g_strdup(cautoStrBOMs.get());
> +}
> +
> +void
> +DOMtoATK::FinishUTF16toUTF8(nsAutoCString &aStr)

nit: type& name

@@ +97,5 @@
> +  aDest.SetLength(n);
> +  for (uint32_t i = 0; i < aSource.Length(); i++) {
> +    uint32_t bytes = UTF8traits::bytes(aSource[i]);
> +
> +    if (bytes >= 4) {

wonder whether 4 should be a constant of UTF8traits or String, because it has to stay in sync with aSource.Length type

btw, I miss why >= 4 not > 4, since 4 bytes should a typical character, and you shouldn't add BOM, no?

@@ +101,5 @@
> +    if (bytes >= 4) {
> +      // Non-unicode16 character, add a BOM after it.
> +
> +      if (bytes > aSource.Length() - i) {
> +        // We don't have the whole sequence, copy what we have.

when it happens, could you add a more detailed comment please?

@@ +112,5 @@
> +
> +      // Copy whole sequence.
> +      for (uint32_t j = 0; j < bytes; j++) {
> +        aDest.Replace(desti++, 1, aSource[i + j]);
> +      }

it can be replaced on a single call, something like:
aDest.Replace(desti++, bytes, Substring(aSource, i, bytes)

@@ +113,5 @@
> +      // Copy whole sequence.
> +      for (uint32_t j = 0; j < bytes; j++) {
> +        aDest.Replace(desti++, 1, aSource[i + j]);
> +      }
> +      i += bytes - 1;

could you please double check this logic, i++ means 4 bytes shift, so if 'bytes' are 6, then you increment i at 2, which equivalent to 8 bytes if I don't miss anything

@@ +123,5 @@
> +      // Unicode16, just copy the byte.
> +      aDest.Replace(desti++, 1, aSource[i]);
> +    }
> +  }
> +  MOZ_ASSERT(desti == n, "incoherency between computed length"

nit: i -> I (uppercase)
Attached patch multibyte (obsolete) — Splinter Review
> > +    uint32_t bytes = UTF8traits::bytes(aSource[i]);
> > +
> > +    if (bytes >= 4) {
> 
> wonder whether 4 should be a constant of UTF8traits or String, because it has to stay in sync with aSource.Length type
> 
> btw, I miss why >= 4 not > 4, since 4 bytes should a typical character, and you shouldn't add BOM, no?

I have added a comment: 4 here is just the exact number of bytes that UTF-8 starts using for encoding non-unicode16 characters (yes, we are lucky that it's exactly that boundary: 3 bytes in utf8 can code up to U+FFFF, and U+10000 needs 4 bytes.)

> > +    if (bytes >= 4) {
> > +      // non-unicode16 character, add a bom after it.
> > +
> > +      if (bytes > asource.length() - i) {
> > +        // we don't have the whole sequence, copy what we have.
> 
> when it happens, could you add a more detailed comment please?

Ah, this is actually again defensive code that I was putting. Since the utf-8 string comes from the UTF16toUTF8 converter, this is really not supposed to happen, so I have put a MOZ_ASSERT instead.

> > +      // Copy whole sequence.
> > +      for (uint32_t j = 0; j < bytes; j++) {
> > +        aDest.Replace(desti++, 1, aSource[i + j]);
> > +      }
> > +      i += bytes - 1;
> 
> could you please double check this logic, i++ means 4 bytes shift,

No, aSource is a one-byte-one-char string. We are here just eating the bytes that we have copied into aDest, minus the one byte that the for loop will count.

Here is a reworked patch.
Attachment #8954683 - Attachment is obsolete: true
Attachment #8954683 - Flags: review?(surkov.alexander)
Attachment #8954802 - Flags: review?(surkov.alexander)
Attachment #8954802 - Attachment is patch: true
Comment on attachment 8954802 [details] [diff] [review]
multibyte

Review of attachment 8954802 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not sure I can testify that I 100% understand all logic behind adding a BOM. I don't have a11y-related concerns, so I'm tempting to set r+, but it'd be nice to have somebody more experienced than myself to look at this. Since the patch contains xpcom string related changes, then could you, David, take a look at conversion part as well?

::: accessible/atk/DOMtoATK.cpp
@@ +100,5 @@
> +
> +    if (bytes >= 4) {
> +      // Non-unicode16 character, add a BOM after it.
> +
> +      if (bytes > aSource.Length() - i) {

assert as you suggested?

@@ +112,5 @@
> +
> +      // Copy whole sequence.
> +      for (uint32_t j = 0; j < bytes; j++) {
> +        aDest.Replace(desti++, 1, aSource[i + j]);
> +      }

still, could be replaces by a single call?

::: accessible/atk/DOMtoATK.h
@@ +30,5 @@
> + * get the proper string with non-unicode16 characters expressed in just one
> + * unicode character followed by a BOM.
> + */
> +
> +class DOMtoATK

nit no new line between class and comment

@@ +47,5 @@
> +   * them.
> +   */
> +  void AdjustOffsets(gint* aStartOffset, gint* aEndOffset, gint count);
> +
> +  gchar* Convert(nsAutoString aAutoStr);

nit: a small comment wouldn't harm

@@ +62,5 @@
> +  void FinishUTF16toUTF8(nsAutoCString& aStr);
> +
> +  bool mAdjusted;
> +  bool mStartShifted;
> +  bool mEndShifted;

please initialize those in constructor too
Attachment #8954802 - Flags: review?(surkov.alexander)
Attachment #8954802 - Flags: review?(dbaron)
Attachment #8954802 - Flags: review+
Errr, sorry, it seems the patch didn't get refreshed actually...
Attached patch multibyte (obsolete) — Splinter Review
Here is the fixed really refreshed patch, sorry for the previous one.
Attachment #8954802 - Attachment is obsolete: true
Attachment #8954802 - Flags: review?(dbaron)
Attachment #8954825 - Flags: review?(surkov.alexander)
Attachment #8954825 - Flags: review?(dbaron)
Attached patch multibyte (obsolete) — Splinter Review
Here is also the few additional comments you made addressed.
Attachment #8954825 - Attachment is obsolete: true
Attachment #8954825 - Flags: review?(surkov.alexander)
Attachment #8954825 - Flags: review?(dbaron)
Attachment #8954827 - Flags: review?(surkov.alexander)
Attachment #8954827 - Flags: review?(dbaron)
Comment on attachment 8954827 [details] [diff] [review]
multibyte

Review of attachment 8954827 [details] [diff] [review]:
-----------------------------------------------------------------

::: accessible/atk/nsMaiInterfaceText.cpp
@@ +244,5 @@
> +      // Heading surrogate, get the trailing surrogate and combine them.
> +      gunichar characterLow = static_cast<gunichar>(text->CharAt(aOffset + 1));
> +
> +      if (!NS_IS_LOW_SURROGATE(characterLow)) {
> +        // It should have been a trailing surrogate... Flag the error.

btw, should we assert here? or such strings exist?

::: accessible/atk/DOMtoATK.h
@@ +37,5 @@
> +#ifdef DEBUG
> +    mAdjusted = false;
> +#endif
> +    mStartShifted = false;
> +    mEndShifted = false;

it's better to put these into initialization list

@@ +51,5 @@
> +
> +  /**
> +   * After calling AdjustOffsets and obtained the increased text portion, this
> +   * can be called to convert to ATK properly (add BOMs and remove the
> +   * additional characters requested by AdjustOffsets).

maybe: something simpler:

Converts a string of accessible text into ATK gchar* string (add ...). Note, AdjustOffsets has to be called in advance.

Another option to consider. If you enclosed most of atk's getText into this class, then you would avoid mAdjusted and stuff, i.e. that could look like

getTextCB(AtkText *aText, gint aStartOff, ..)
{
  return TextToATKString(ATK_OBJECT(aText), gint* aStartOffset, gint* aEndOffset);
}

class TextToATKString
{
  TextToATKString(AtkObject* aObject, gint* aStartOffset, gint* aEndOffset) {
  }
  operator gchar*() {
    // converts to accessible text to atk string
  }
};
Attached patch multibyte (obsolete) — Splinter Review
> > +      if (!NS_IS_LOW_SURROGATE(characterLow)) {
> > +        // It should have been a trailing surrogate... Flag the error.
> 
> btw, should we assert here? or such strings exist?

In principle it shouldn't happen, but I wouldn't be surprised that somehow some web pages could manage to sneak in such cases. That's different from the UTF-8 checks which should be alright already since it's UTF16toUTF8 output.

> If you enclosed most of atk's getText into this class

Actually I was surprised that only getText needs the conversion. It could very well happen that some newer ATK method would need it too, so stuffing getText into the class would then not make sense.
Attachment #8954827 - Attachment is obsolete: true
Attachment #8954827 - Flags: review?(surkov.alexander)
Attachment #8954827 - Flags: review?(dbaron)
Attachment #8954839 - Flags: review?(surkov.alexander)
Attachment #8954839 - Flags: review?(dbaron)
Summary: atk_text_get_get() return value incorrect after multbyte character → atk_text_get_get() return value incorrect after multibyte character
(In reply to Samuel Thibault from comment #59)
> Created attachment 8954839 [details] [diff] [review]
> multibyte
> 
> > > +      if (!NS_IS_LOW_SURROGATE(characterLow)) {
> > > +        // It should have been a trailing surrogate... Flag the error.
> > 
> > btw, should we assert here? or such strings exist?
> 
> In principle it shouldn't happen, but I wouldn't be surprised that somehow
> some web pages could manage to sneak in such cases. That's different from
> the UTF-8 checks which should be alright already since it's UTF16toUTF8
> output.
> 
> > If you enclosed most of atk's getText into this class
> 
> Actually I was surprised that only getText needs the conversion. It could
> very well happen that some newer ATK method would need it too, so stuffing
> getText into the class would then not make sense.

fair enough
Attachment #8954839 - Flags: review?(surkov.alexander) → review+
Comment on attachment 8954839 [details] [diff] [review]
multibyte

I'm assuming you're asking me to review only the part in xpcom/string/.

>--- a/xpcom/string/nsUTF8Utils.h
>+++ b/xpcom/string/nsUTF8Utils.h
>@@ -54,6 +54,23 @@ public:
>   {
>     return (aChar & 0xFE) == 0xFC;
>   }

>+  static int bytes(char aChar)

Please add a comment above this saying something like:
  // return the number of bytes in a sequence beginning with aChar

>+  {
>+    if (isASCII(aChar))
>+      return 1;
>+    if (is2byte(aChar))
>+      return 2;
>+    if (is3byte(aChar))
>+      return 3;
>+    if (is4byte(aChar))
>+      return 4;
>+    if (is5byte(aChar))
>+      return 5;
>+    if (is6byte(aChar))
>+      return 6;

Please brace each of these if statements, with the style:

if (isASCII(aChar)) {
  return 1;
}

(I also wonder if there's a faster way to do this.)

>+    // Not supposed to be used for in-sequence, but return 1

Please replace this comment with MOZ_ASSERT_UNREACHABLE("should not be used for in-sequence characters");


r=dbaron on this part of the patch, with those changes
Attachment #8954839 - Flags: review?(dbaron) → review+
Attached patch multibyte (obsolete) — Splinter Review
> (I also wonder if there's a faster way to do this.)

You mean, doing it with a bit of computation? Well, strictly speaking it could be a matter of counting the number of heading 1 bits, with special casing 0 (ascii) and 1 (in-sequence). In the end the most common case (ascii) will not be faster anyway, and we'll get a very subtle-to-understand formula for non-common cases :)

> Please replace this comment with MOZ_ASSERT_UNREACHABLE("should not be used for in-sequence characters");

Right :)
Attachment #8954839 - Attachment is obsolete: true
Attachment #8954962 - Flags: review?(surkov.alexander)
Attachment #8954962 - Flags: review?(dbaron)
Comment on attachment 8954962 [details] [diff] [review]
multibyte

>+    MOZ_ASSERT_UNREACHABLE("should not be used for in-sequence characters");

You should still have a return after this.  (Probably stick to 1?)

r=dbaron on this file, with that change
Attachment #8954962 - Flags: review?(dbaron) → review+
Attached patch multibyte (obsolete) — Splinter Review
Ah, ok :)
Attachment #8954962 - Attachment is obsolete: true
Attachment #8954962 - Flags: review?(surkov.alexander)
Attachment #8954964 - Flags: review?(surkov.alexander)
Attachment #8954964 - Flags: review?(dbaron)
(In reply to David Baron :dbaron: ⌚️UTC-8 from comment #61)
> Comment on attachment 8954839 [details] [diff] [review]
> multibyte
> 
> I'm assuming you're asking me to review only the part in xpcom/string/.

David, could you please take a look at logic of DOMToATK class as well? It'd be nice to have someone to check it, who knows strings internals.
Flags: needinfo?(dbaron)
Comment on attachment 8954964 [details] [diff] [review]
multibyte

OK, additional review per comment 65:

Throughout, in comments, you should replace the term "non-unicode16"
with "non-BMP".


>+ * BOMs (Byte Order Marks, U+FEFF) normally only appear at the beginning of
>+ * unicode files, but their occurrence within text (notably after cut&paste) is
>+ * not uncommon, and are thus considered as non-text.

It's worth noting here that the BOM is also known as a "ZERO WIDTH NO-BREAK
SPACE".

>+ * The DOMtoATK class handles the conversion.  Since the selection requested
>+ * through ATK may not contain both surrogates at the ends of the selection, we
>+ * need to fetch one UTF-16 code point more on both side, and get rid of it
>+ * before returning the string to ATK.  The caller thus needs to call
>+ * AdjustOffsets before peeking the DOM string, and only after call Convert to

replace
  "before peeking the DOM string, and only after"
with
  ", then use the possibly-adjusted offsets to get the data from the DOM string, and then"

>+ * get the proper string with non-unicode16 characters expressed in just one

"expressed in" -> "expressed as".

>+ * unicode character followed by a BOM.


That said, this two-step API is confusing and seems likely to be
difficult to maintain.  If it weren't for the other callers of
ConvertTexttoAsterisks, I'd suggest that you instead:
 * replace ConvertTexttoAsterisks with a boolean indicating whether to
   do the conversion
 * replace the three-step API described above with a one-step API
   (templatized over something that can be AccessibleWrap or
   ProxyAccessible) that takes the whole thing in a single call, taking
   a boolean argument for whether to do the conversion to asterisks

(This makes me wonder, though, why you're not changing the other callers
of ConvertTexttoAsterisks?  Do they not require the same change?)

(Also, why is the "to" in ConvertTexttoAsterisks not capitalized like it
should be?)



DOMtoATK::AdjustOffsets should MOZ_ASSERT(!mAdjusted, ...) to enforce
the proper way of calling the API.

In DOMtoATK::FinishUTF16toUTF8, all 3 occurrences of "heading character"
should be "leading character".

A number of comments refer to PrepareUTF16toUTF8, which doesn't exist.
I think they should refer to AdjustOffsets.

>+    // drop first character
>+    aStr.Cut(0, UTF8traits::bytes(aStr.CharAt(0)));

This is somewhat inefficient since it shifts all of the bytes in the
character.  It would be better if:
 * the g_strdup were inside this function
 * this function maintained an offset to use for the start

However, the order of operations of AddBOMs and the stripping of the
leading and trailing characters seems backwards.  It seems like this
will behave incorrectly if you add a BOM for the first or last
character, and mStartShifted/mEndShifted is true.


>-  //copy and return, libspi will free it.

This comment should probably be put in DOMtoAtk::Convert rather than just removed.


In the first loop in DOMtoATK::AddBOMs, don't call UTF8traits::bytes
twice on the same character; put the result in a variable.

The first loop in DOMtoATK::AddBOMs also seems to compute n incorrectly
for any multi-byte character, since it only adds 1 to n rather than the
number of bytes needed.  This makes me wonder how you managed not to hit
the MOZ_ASSERT at the end of the function.  Are you testing this patch
with a debug build?

I think the code would be clearer if you renamed variables:
  i -> srci
  n -> destlength

I think it would also be clearer if you took the i++ out of the for()
expressions and just did that increment in the loop (dropping the "- 1"
from the places you increment i inside the loop).



Some of these concerns require some significant work, and I'm also not sure if I reviewed the whole thing given how much I was jumping around, so marking review-, since I'd like to review it again after these revisions.
Flags: needinfo?(dbaron)
Attachment #8954964 - Flags: review?(dbaron) → review-
Attached patch multibyte (obsolete) — Splinter Review
> If it weren't for the other callers of ConvertTexttoAsterisks, I'd suggest that you instead:  * replace ConvertTexttoAsterisks

? This patch has nothing to do with ConvertTextToAsterisks

> However, the order of operations of AddBOMs and the stripping of the
> leading and trailing characters seems backwards.  It seems like this
> will behave incorrectly if you add a BOM for the first or last
> character, and mStartShifted/mEndShifted is true.

On the contrary, it'll work fine, precisely thanks to the BOM. Say text is just one non-BMP character, thus two UTF-16 codepoints in the DOM. From the point of view of ATK we want to make it appear as that character plus the BOM (to get two unicode characters). If ATK asks for Start/End as 0/1, 
AdjustOffsets will make that 0/2, we will properly get the two UTF-16 codepoints, convert to utf-8, add the BOM, and drop the BOM. If ATK asks for Start/End as 1/2, AdjustOffsets will make that 0/2, we will properly get the two UTF-16 codepoints, convert to utf-8, add the BOM, and drop the non-BMP character.

> The first loop in DOMtoATK::AddBOMs also seems to compute n incorrectly for any multi-byte character, since it only adds 1 to n rather than the number of bytes needed.

Urgl, indeed. It was to be expected after all these rewrites of this patch, that I'd let a bug slip in and miss checking it with  DEBUG.


Here is a reworked patch.
Attachment #8954964 - Attachment is obsolete: true
Attachment #8954964 - Flags: review?(surkov.alexander)
Attachment #8954985 - Flags: review?(surkov.alexander)
Attachment #8954985 - Flags: review?(dbaron)
Ergl, I actually again failed to build with proper DEBUG being enabled of course, let me fix that......
(In reply to Samuel Thibault from comment #67)
> > If it weren't for the other callers of ConvertTexttoAsterisks, I'd suggest that you instead:  * replace ConvertTexttoAsterisks
> 
> ? This patch has nothing to do with ConvertTextToAsterisks

It absolutely does, because that's the only thing blocking a much cleaner API rather than this confusing 3-step thing.
Attached patch multibyte (obsolete) — Splinter Review
There it is.
Attachment #8954985 - Attachment is obsolete: true
Attachment #8954985 - Flags: review?(surkov.alexander)
Attachment #8954985 - Flags: review?(dbaron)
Attachment #8954991 - Flags: review?(surkov.alexander)
Attachment #8954991 - Flags: review?(dbaron)
(And I'm still curious why you need to modify getTextCB, but not getTextAfterOffsetCB, getTextAtOffsetCB, and getTextBeforeOffsetCB.)
(In reply to Samuel Thibault from comment #67)
> > However, the order of operations of AddBOMs and the stripping of the
> > leading and trailing characters seems backwards.  It seems like this
> > will behave incorrectly if you add a BOM for the first or last
> > character, and mStartShifted/mEndShifted is true.
> 
> On the contrary, it'll work fine, precisely thanks to the BOM. Say text is
> just one non-BMP character, thus two UTF-16 codepoints in the DOM. From the
> point of view of ATK we want to make it appear as that character plus the
> BOM (to get two unicode characters). If ATK asks for Start/End as 0/1, 
> AdjustOffsets will make that 0/2, we will properly get the two UTF-16
> codepoints, convert to utf-8, add the BOM, and drop the BOM. If ATK asks for
> Start/End as 1/2, AdjustOffsets will make that 0/2, we will properly get the
> two UTF-16 codepoints, convert to utf-8, add the BOM, and drop the non-BMP
> character.

OK, I guess I'd forgotten that the cases where you add the BOM and the cases where you deal with surrogates are exactly the same.

However, the above behavior still seems surprising.  You should at least have comments explaining that the intent is that:
 * if the start is between the high and low surrogates, the UTF-8 result includes a BOM from it but not the character
 * if the end is between the high and low surrogates, the UTF-8 result includes the character but *not* the BOM
 * all non-BMP characters that are fully in the string are in the UTF-8 result as character followed by BOM

It's not obvious that that's a desirable behavior, so if it's what you want, you should explain it.
> > > If it weren't for the other callers of ConvertTexttoAsterisks, I'd suggest that you instead:  * replace ConvertTexttoAsterisks
> > 
> > ? This patch has nothing to do with ConvertTextToAsterisks
> 
> It absolutely does, because that's the only thing blocking a much cleaner API rather than this confusing 3-step thing.

> (templatized over something that can be AccessibleWrap or ProxyAccessible)

Ah, now I see. Uh, I'd have to learn how to do such templatization in C++ then.

> why you're not changing the other callers of ConvertTexttoAsterisks?  Do they not require the same change?)

In the other cases, the offsets are rounded by the Accessible objects, so they can't fall between surrogate pairs and we thus don't need AdjustOffsets/Finish.

That being said, we do indeed need addBOMs for them for coherency of UTF-8 string lengths.
(about convertTexttoAsterisk)
> that's the only thing blocking a much cleaner API rather than this confusing 3-step thing

There is also the getText() call which is in between.
Attached patch multibyte (obsolete) — Splinter Review
Here is a fixed patch with addition of BOMs for the other gettext* methods.

Concerning having to use two steps, yes, I agree this is really non-usual. But as explained previously, there is no way around it because there is fundamental interleaving between having to shift the offsets before calling the Accessible method with the shifted offsets, and only after, finishing the resulting string. The class is made to keep this as just one class with two calls, instead of exposing the state of the conversion etc.

Yes, we could embed the asterisk thing, it's not really a problem (actually that's why I didn't even understand why you were mentioning it, for me it was out of the issues at stake), but there remains the getText() call. We could also embed the call to getText() into the class but that changes the meaning of the class a lot (it does not just convert) and then the getTextCB function basically only makes one call to class and nothing else, and there is no other caller (or else the class would have to learn which other getText* it should call), i.e. we'd have just moved the content of getTextCB into the class, what becomes the point of defining a class at all then?
Attachment #8954991 - Attachment is obsolete: true
Attachment #8954991 - Flags: review?(surkov.alexander)
Attachment #8954991 - Flags: review?(dbaron)
Attachment #8955198 - Flags: review?(surkov.alexander)
Attachment #8955198 - Flags: review?(dbaron)
I'd note that getCharacterAtOffsetCB looks like it can read a character past the end of the buffer if the last character in the buffer is an unpaired high surrogate.  That's a potential crash that should be fixed.

(In reply to Samuel Thibault from comment #75)
> Concerning having to use two steps, yes, I agree this is really non-usual.
> But as explained previously, there is no way around it because there is
> fundamental interleaving between having to shift the offsets before calling
> the Accessible method with the shifted offsets, and only after, finishing
> the resulting string. The class is made to keep this as just one class with
> two calls, instead of exposing the state of the conversion etc.

I think there is a way around it:  a function that's templatized over the underlying class of the accessible.  (The templatized method should just have the three things in it so that you don't end up generating the code multiple times for a large thing -- the underlying pieces that AdjustOffsets and ConvertAdjusted do can remain non-templatized.)

In other words, you can do something like:

enum class AtkStringConvertFlags : uint32_t {
  ConvertTextToAsterisks = 1 << 0,
};

MOZ_MAKE_ENUM_CLASS_BITWISE_OPERATORS(AtkStringConvertFlags)

template <class AccessibleOrProxy>
gchar* StringToATKString(AccesibleOrProxy* aAccessible, gint aStartOffset, gint aEndOffset, AtkStringConvertFlags aFlags)
{
  gint startOffset = aStartOffset, endOffset = aEndOffset;
  StringToATKStringPrivate atkStringConverter; /* was DOMtoATK */
  atkStringConverter.AdjustOffsets(&startOffset, &endOffset, gint(aAccessible->CharacterCount()));
  nsAutoString str;
  aAccessible->TextSubstring(startOffset, endOffset, str);
  return atkStringConverter.ConvertAdjusted(str, aFlags);
}

> Yes, we could embed the asterisk thing, it's not really a problem (actually
> that's why I didn't even understand why you were mentioning it, for me it
> was out of the issues at stake), but there remains the getText() call. We
> could also embed the call to getText() into the class but that changes the
> meaning of the class a lot (it does not just convert) and then the getTextCB
> function basically only makes one call to class and nothing else, and there
> is no other caller (or else the class would have to learn which other
> getText* it should call), i.e. we'd have just moved the content of getTextCB
> into the class, what becomes the point of defining a class at all then?

getTextCB still needs to handle accessibles and proxies somewhat differently.

However, the point of defining a class is to define an API that's a lot harder for the caller to mess up.  Instead of having an API that must be called using an exact sequence of three things in sequence, the API is a single function call, which is a much more reasonable thing to reason about when looking at and thinking about the calling code.
And a version with probably better naming suggestions:

template <class AccessibleOrProxy>
gchar* NewATKString(AccesibleOrProxy* aAccessible, gint aStartOffset, gint aEndOffset, AtkStringConvertFlags aFlags)
{
  gint startOffset = aStartOffset, endOffset = aEndOffset;
  ATKStringConverterPrivate converter; /* was DOMtoATK */
  converter.AdjustOffsets(&startOffset, &endOffset, gint(aAccessible->CharacterCount()));
  nsAutoString str;
  aAccessible->TextSubstring(startOffset, endOffset, str);
  return converter.ConvertAdjusted(str, aFlags);
}
> I'd note that getCharacterAtOffsetCB looks like it can read a character past the end of the buffer if the last character in the buffer is an unpaired high surrogate.  That's a potential crash that should be fixed.

No, Accessible's CharAt already returns '\0' for out-of-bound accesses (otherwise a screen reader could crash firefox).

> I think there is a way around it:  a function that's templatized over the underlying class of the accessible.

Ok, I see the idea. It's a bit more complex than that, because one doesn't necessarily want to call TextSubstring (that's only for getTextCB, getTextAtOffsetCB will need TextAtOffset, ditto for Before/After, but I think with a bit of more templating that should be workable). I'll have a look.
> However, the point of defining a class is to define an API that's a lot harder for the caller to mess up.

Sure, I agree in general.

But if we are *only* moving the *whole* content of getTextCB into a class, and that can't be used for anything else, there is no point: we'd have just moved code which can't be reused, and added complexity.
Attached patch multibyte (obsolete) — Splinter Review
Here is a reworked patch. It is reworked as you proposed, and for getCharacterAtOffsetCB too.

Hoping that we start converging to something :)
Attachment #8955198 - Attachment is obsolete: true
Attachment #8955198 - Flags: review?(surkov.alexander)
Attachment #8955198 - Flags: review?(dbaron)
Attachment #8956057 - Flags: review?(surkov.alexander)
Attachment #8956057 - Flags: review?(dbaron)
David, is it still on your review queue, right? Just pinging to make sure we don't wait each other or something. Technically I r+'ed the patch already, but I would take another run after you're good with the patch (the changes were rather significant).
Flags: needinfo?(dbaron)
So I don't understand why getTextCB needs to deal with the possibility that the boundaries might fall between surrogates, but getTextAfterOffsetCB and getTextBeforeOffsetCB (and maybe some of the other callbacks too) don't need to deal with that possibility.

At the callsites of NewATKString, you shouldn't need template syntax; the compiler should deduce the correct template parameter from the arguments provided.


I think beyond that, I think it's probably better if you don't wait on me for finishing this -- Surkov should review the new patch, and then if he thinks it's good, it can land.
Flags: needinfo?(dbaron)
(In reply to David Baron :dbaron: ⌚️UTC-8 from comment #82)
> So I don't understand why getTextCB needs to deal with the possibility that
> the boundaries might fall between surrogates, but getTextAfterOffsetCB and
> getTextBeforeOffsetCB (and maybe some of the other callbacks too) don't need
> to deal with that possibility.

That is because in the getTextCB case, it's the screen reader which decides the precise offsets. In the other getText* cases, it's the Accessible object which rounds offsets to start or end of word/sentence/line/etc.
Attached patch multibyte (obsolete) — Splinter Review
In this version I have just removed the template syntax as advise by David.
Attachment #8956057 - Attachment is obsolete: true
Attachment #8956057 - Flags: review?(surkov.alexander)
Attachment #8958751 - Flags: review?(surkov.alexander)
(In reply to Samuel Thibault from comment #83)
> (In reply to David Baron :dbaron: ⌚️UTC-8 from comment #82)
> > So I don't understand why getTextCB needs to deal with the possibility that
> > the boundaries might fall between surrogates, but getTextAfterOffsetCB and
> > getTextBeforeOffsetCB (and maybe some of the other callbacks too) don't need
> > to deal with that possibility.
> 
> That is because in the getTextCB case, it's the screen reader which decides
> the precise offsets. In the other getText* cases, it's the Accessible object
> which rounds offsets to start or end of word/sentence/line/etc.

ATK_TEXT_BOUNDARY_CHAR might be special though? Anyway, we can catch these in followups if necessary, I don't think it can create any new problems.
I was about to say that ATK_TEXT_BOUNDARY_CHAR is supposed to return exactly one UTF-8 character, and not one UTF-16 codepoint, and thus again offsets should be rounded to UTF-8 characters boundaries, and not fall between UTF-16 surrogates, but looking again at the at-spi definition: "Start and end offsets differ by one, by definition, for this value." which of course makes sense in the at-spi world, but here we'd be tempted to return two UTF-8 characters, but that won't be fine. So again we have to return separate characters: the real character and the BOM.  Actually, testing DOM's behavior, nsIAccessibleText::BOUNDARY_CHAR  does separate surrogates, so it already has the behavior of returning just one UTF-16 surrogate. The only issue is that the conversion to UTF-8 will them turn that into U+FFFD.

I would say that to avoid making things even more complex we can special-case that: the ATK_TEXT_BOUNDARY_CHAR case really is almost very much like a getCharacterAtOffset.  Otherwise, a generic version would have to make the Accessible call once only to get the resulting offsets, then shift them, then make the Accessible call again to get the text, and then fix the result. While in the ATK_TEXT_BOUNDARY_CHAR the resulting offsets are trivial to compute (-1/+0/+1) and we can just then call getTextCB as it is already.

I'd indeed rather do that in a separate patch :)
Attached patch getchar (obsolete) — Splinter Review
Here is what it looks like.
Attachment #8958897 - Flags: review?(surkov.alexander)
Comment on attachment 8958751 [details] [diff] [review]
multibyte

Review of attachment 8958751 [details] [diff] [review]:
-----------------------------------------------------------------

I didn't look closely at strings conversion logic, it seems David was ok with it. I would use base classes in arguments, for example, nsAString instead nsAutoString, use references at strings to avoid copy, and use const where possible. Otherwise, it looks good.

Please let keep an eye on regressions, the code is complicated.

::: accessible/atk/nsMaiInterfaceText.cpp
@@ +131,3 @@
>  {
>    // convert each char to "*" when it's "password text"
>    if (accWrap->NativeRole() == roles::PASSWORD_TEXT) {

could you please file a follow up to get rid of NativeRole() == roles::PASSWORD_TEXT by adding Accessible::IsPassword() method?

::: accessible/atk/DOMtoATK.cpp
@@ +12,5 @@
> +
> +namespace mozilla::a11y::DOMtoATK {
> +
> +void
> +AddBOMs(nsAutoCString& aDest, const nsAutoCString& aSource)

const for aSource

@@ +78,5 @@
> +#endif
> +}
> +
> +gchar*
> +ATKStringConverterPrivate::FinishUTF16toUTF8(nsAutoCString& aStr)

I guess you can use base class here nsACString&

@@ +119,5 @@
> +  return g_strdup(aStr.get() + skip);
> +}
> +
> +gchar*
> +ATKStringConverterPrivate::ConvertAdjusted(nsAutoString aAutoStr)

const nsAString&

@@ +134,5 @@
> +  return FinishUTF16toUTF8(cautoStrBOMs);
> +}
> +
> +gchar*
> +Convert(nsAutoString aAutoStr)

const nsAString& ? you should avoid copying a string

@@ +149,5 @@
> +
> +void
> +ConvertTexttoAsterisks(nsAString& aString)
> +{
> +  for (uint32_t i = 0; i < aString.Length(); i++)

nit: technically moz coding style requires {} around single if and for

::: accessible/atk/DOMtoATK.h
@@ +70,5 @@
> +  };
> +
> +  MOZ_MAKE_ENUM_CLASS_BITWISE_OPERATORS(AtkStringConvertFlags)
> +
> +  class ATKStringConverterPrivate {

I would probably name it 'Helper' instead 'Private', because this class it not really private and actually visible to everyone
Attachment #8958751 - Flags: review?(surkov.alexander) → review+
Comment on attachment 8958897 [details] [diff] [review]
getchar

Review of attachment 8958897 [details] [diff] [review]:
-----------------------------------------------------------------

it makes sense to move this patch into separate bug I'd say, this one gets too long

::: accessible/atk/nsMaiInterfaceText.cpp
@@ +160,4 @@
>    return nullptr;
>  }
>  
> +static gint getCharacterCountCB(AtkText *aText);

nit: type* name

@@ +165,5 @@
> +static gchar*
> +getCharTextAtOffset(AtkText *aText, gint aOffset,
> +                    gint *aStartOffset, gint *aEndOffset)
> +{
> +  gint end = aOffset + 1;

I wonder if it breaks magic offsets, like -1 or -2

@@ +273,4 @@
>  {
>    nsAutoString autoStr;
>    int32_t startOffset = 0, endOffset = 0;
> +  AccessibleWrap* accWrap;

it is a weird change, could you keep this line unchanged please? :)
(In reply to alexander :surkov from comment #89)
> @@ +165,5 @@
> > +static gchar*
> > +getCharTextAtOffset(AtkText *aText, gint aOffset,
> > +                    gint *aStartOffset, gint *aEndOffset)
> > +{
> > +  gint end = aOffset + 1;
> 
> I wonder if it breaks magic offsets, like -1 or -2

In at-spi, only the -1 magic offset is defined and only for getText's end offsets. Others do not have magic offsets.

> @@ +273,4 @@
> >  {
> >    nsAutoString autoStr;
> >    int32_t startOffset = 0, endOffset = 0;
> > +  AccessibleWrap* accWrap;
> 
> it is a weird change, could you keep this line unchanged please? :)

Well, sure, it's just that we then call GetAccessibleWrap for no use, when aBoundaryType is ATK_TEXT_BOUNDARY_CHAR. Oh, but the firefox coding style doesn't forbid mixing variable declaration and code, does it?
(In reply to Samuel Thibault from comment #90)
> (In reply to alexander :surkov from comment #89)
> > @@ +165,5 @@
> > > +static gchar*
> > > +getCharTextAtOffset(AtkText *aText, gint aOffset,
> > > +                    gint *aStartOffset, gint *aEndOffset)
> > > +{
> > > +  gint end = aOffset + 1;
> > 
> > I wonder if it breaks magic offsets, like -1 or -2
> 
> In at-spi, only the -1 magic offset is defined and only for getText's end
> offsets. Others do not have magic offsets.

oh, that's complicated, you may want to add a comment

> > @@ +273,4 @@
> > >  {
> > >    nsAutoString autoStr;
> > >    int32_t startOffset = 0, endOffset = 0;
> > > +  AccessibleWrap* accWrap;
> > 
> > it is a weird change, could you keep this line unchanged please? :)
> 
> Well, sure, it's just that we then call GetAccessibleWrap for no use, when
> aBoundaryType is ATK_TEXT_BOUNDARY_CHAR. Oh, but the firefox coding style
> doesn't forbid mixing variable declaration and code, does it?

it's not forbidden for sure, but I might be missing something, what's wrong with:

if (aBoundaryType == ATK_TEXT_BOUNDARY_CHAR) {
    return getCharTextAtOffset(aText, aOffset - 1, aStartOffset, aEndOffset);
}
AccessibleWrap* accWrap = GetAccessibleWrap(ATK_OBJECT(aText));
Attached patch multibyte (obsolete) — Splinter Review
Here is the reworked multibyte, only changing string types and coding style (and retested as working fine of course).
Attachment #8958751 - Attachment is obsolete: true
Attachment #8959129 - Flags: review?(surkov.alexander)
Comment on attachment 8959129 [details] [diff] [review]
multibyte

Review of attachment 8959129 [details] [diff] [review]:
-----------------------------------------------------------------

::: accessible/atk/DOMtoATK.cpp
@@ +78,5 @@
> +#endif
> +}
> +
> +gchar*
> +ATKStringConverterHelper::FinishUTF16toUTF8(nsCString& aStr)

I think it should nsACString, A stands for abstract, it works equally both for nsCString and nsAutoCString
Attachment #8959129 - Flags: review?(surkov.alexander) → review+
(In reply to alexander :surkov from comment #93)
> ::: accessible/atk/DOMtoATK.cpp
> @@ +78,5 @@
> > +#endif
> > +}
> > +
> > +gchar*
> > +ATKStringConverterHelper::FinishUTF16toUTF8(nsCString& aStr)
> 
> I think it should nsACString, A stands for abstract, it works equally both
> for nsCString and nsAutoCString

Yes, but since nsACString might be non-null-terminated, it does not provide the get() method, which we do need here.
(In reply to Samuel Thibault from comment #94)

> > > +gchar*
> > > +ATKStringConverterHelper::FinishUTF16toUTF8(nsCString& aStr)
> > 
> > I think it should nsACString, A stands for abstract, it works equally both
> > for nsCString and nsAutoCString
> 
> Yes, but since nsACString might be non-null-terminated, it does not provide
> the get() method, which we do need here.

fair enough
do you have perms to try-server the patch and then land it?
Comment on attachment 8958897 [details] [diff] [review]
getchar

obsolete because was moved to new bug 1445954
Attachment #8958897 - Attachment is obsolete: true
Attachment #8958897 - Flags: review?(surkov.alexander)
No, I don't have any particular permissions on mozilla.
(In reply to Samuel Thibault from comment #98)
> No, I don't have any particular permissions on mozilla.

you probably should request then, you have my vouch if needed.

since patched code is not covered by tests, try server shouldn't fail, thus checkin-needed may be enough to get it landed
Keywords: checkin-needed
Assignee: nobody → samuel.thibault
Could not land patch.
Flags: needinfo?(samuel.thibault)
Attached patch multibyte (obsolete) — Splinter Review
Ah, yes, the patch used to work fine over hg tip, but not with the latest changes. Here is a version which works fine over tip: the only change is the aString.ReplaceLiteral(i, 1, u"*"); line moved unchanged by the patch from ConvertTexttoAsterisks to ConvertTexttoAsterisks which needed an update for the tip.
Attachment #8959129 - Attachment is obsolete: true
Flags: needinfo?(samuel.thibault)
Attachment #8959506 - Flags: review?(surkov.alexander)
I have requested try server access in bug 1446355
Flags: needinfo?(surkov.alexander)
(In reply to Samuel Thibault from comment #102)
> I have requested try server access in bug 1446355

yep, thanks
Flags: needinfo?(surkov.alexander)
(In reply to Samuel Thibault from comment #101)
> Created attachment 8959506 [details] [diff] [review]
> multibyte
> 
> Ah, yes, the patch used to work fine over hg tip, but not with the latest
> changes. Here is a version which works fine over tip: the only change is the
> aString.ReplaceLiteral(i, 1, u"*"); line moved unchanged by the patch from
> ConvertTexttoAsterisks to ConvertTexttoAsterisks which needed an update for
> the tip.

you don't have to ask for subsequent review if the changes are minor
Attachment #8959506 - Flags: review?(surkov.alexander) → review+
> you don't have to ask for subsequent review if the changes are minor

Ok, but I guess a review+ flag is needed for checkin-needed to be taken?  Or perhaps my own review+ could be enough for that to happen?
(In reply to Samuel Thibault from comment #105)
> > you don't have to ask for subsequent review if the changes are minor
> 
> Ok, but I guess a review+ flag is needed for checkin-needed to be taken?  Or
> perhaps my own review+ could be enough for that to happen?

It shouldn't be necessary I think. It is considered that r+ is implicitly inherited from previous patch if the changes are not significant. I believe that human (not a machine) checks in a patch :) You shouldn't do r+ for your own patch, it's confusing :)
Pushed by dluca@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a66043ecf949
atk: Introduce U+FEFF characters to match AT-SPI offsets with DOM offsets r=surkov
Keywords: checkin-needed
Backout by dluca@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5ea694074089
Backed out changeset a66043ecf949 for build bustages
Attached patch multibyteSplinter Review
Ah, I didn't realize that syntax was recent C++. Here is a fixed patch.
Attachment #8959506 - Attachment is obsolete: true
Flags: needinfo?(samuel.thibault)
I don't know if that is supposed to test for such build failures, but linux debug & opt try

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8a6ebf351eeff629110abdb59b09ac380bcb3e8b

passed.
Attachment #8959715 - Flags: checkin+
Attachment #8959715 - Flags: checkin+
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6f9396e9ea0a
atk: Introduce U+FEFF characters to match AT-SPI offsets with DOM offsets. r=surkov, r=dbaron
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/6f9396e9ea0a
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Comment on attachment 8959715 [details] [diff] [review]
multibyte

Approval Request Comment

[Feature/Bug causing the regression]:
miscalculation of offsets between DOM and ATK when a string contains non-BMP characters (e.g. emojis)

[User impact if declined]:
People using screen readers get bogus text (offset by one and such) as soon as a webpage contains non-BMP characters (e.g. emojis, which are more and more common).

[Is this code covered by automated tests?]:
The whole ATK stack is not currently automatically tested.

[Has the fix been verified in Nightly?]:
Yes

[Needs manual test from QE? If yes, steps to reproduce]: 
None

[List of other uplifts needed for the feature/fix]:
None

[Is the change risky?]:
It is not

[Why is the change risky/not risky?]:
It touches only the ATK layer, thus only users needing the fix.

[String changes made/needed]:
None
Attachment #8959715 - Flags: approval-mozilla-beta?
(In reply to Samuel Thibault from comment #114)
> Comment on attachment 8959715 [details] [diff] [review]
> multibyte
> 
> Approval Request Comment

are you sure you want to back port it? usually regressions, crashes, perf/stability issues and serious bugs are candidates for backporting. How many users and how often the bug beats?
Well, without it, ESR users would have to live with the issue until the next ESR. This is indeed not a regression, crash, perf or stability issue, but not being able to properly read the web page text seems serious to me. All users using orca are impacted (i.e. visually impaired users), and the bug poses problem whenever a webpage contains a non-BMP character, for instance emojis, which can be seen more and more used on the web.
BTW, as mentioned in the beta request, I could verify the fix in Nightly.
Status: RESOLVED → VERIFIED
(In reply to alexander :surkov from comment #115)
> (In reply to Samuel Thibault from comment #114)
> > Comment on attachment 8959715 [details] [diff] [review]
> > multibyte
> > 
> > Approval Request Comment
> 
> are you sure you want to back port it? usually regressions, crashes,
> perf/stability issues and serious bugs are candidates for backporting. How
> many users and how often the bug beats?

The user impact is the following: when any emoji is present on a page, orca will skips the next line or paragraph. As a blind or visual-impaired relies only on their screen reader it's for me one of the most important bug we've in Firefox for GNU/Linux. 

Imagine if Firefox does a black rectangle on each page and prevents you to read content: what would you tell?

I've discovered this bug only because I use a screen magnifier with the screen reader and I can see on the screen if Orca skips text.

Best regards,
Alex.
5-byte and 6-byte UTF-8 byte sequences are defined not to exist anymore. Please don't add code that's written as if they still existed.

It seems bad to call U+FEFF a "BOM" when it's not used as a BOM but as zero-width no-break space.

Furthermore, this seems to break ATK's view of astral characters. This seems problematic. There are languages that are entirely astral, though they might not be supported by ATK yet. Han-script text, e.g. Hong Kong Cantonese, can use astral ideographs among BMP ideographs. It seems bad to break these. Additionally, it seems plausible that ATK could develop the capability to read the meaning of emoji.

Is there a reason why the index adjustment isn't done using UTF8CharEnumerator::NextChar() as in the Pango line breaker integration?

(Note: I have an in-flight rewrite of UTF8CharEnumerator::NextChar() to comply with WHATWG requirements and Unicode recommendations in bug 1402247.)
Flags: needinfo?(samuel.thibault)
(In reply to Henri Sivonen (:hsivonen) from comment #119)
> Is there a reason why the index adjustment isn't done using
> UTF8CharEnumerator::NextChar() as in the Pango line breaker integration?

Or, rather UTF8CharEnumerator::NextChar() if we get an index in number of scalar values from ATK and need to map it to UTF-16 data that we have.
Oops. One more time:
Or, rather UTF16CharEnumerator::NextChar() if we get an index in number of scalar values from ATK and need to map it to UTF-16 data that we have.
Henri, it probably makes sense to open another bug for the issues. Frankly, this one is getting too large.
> Furthermore, this seems to break ATK's view of astral characters.

What makes you think so?  It does work in my tests, there are just U+FEFF characters along them, which just get ignored by speech synthesis etc. so only the astral characters get rendered.

> it seems plausible that ATK could develop the capability to read the meaning of emoji.

Sure, screen readers already do read them, and depending on the speech synthesis and braille table they can be rendered. This patch does not change that.

> Is there a reason why the index adjustment isn't done using UTF16CharEnumerator::NextChar() as in the Pango line breaker integration?

Well, perhaps. The complexity looked way higher, and the computation cost probably expensive. Where it gets really more tricky than this is that it's not just a question of converting from UTF-16 to UTF-8. See accessible/generic/HyperTextAccessible.cpp's HyperTextAccessible::TextSubstring: it does not contain just a UTF-16 string, it is actually composed of several children (to represent e.g. hyperlinks).  HyperTextAccessible::GetChildIndexAtOffset uses a binary search within the mOffsets array to efficiently find which children contains a given UTF-16 offset.  To keep this efficiency with a number of UTF-8 scalars, we'd have to introduce another array to do binary search in it.  Basically it'd mean either replicating the whole machinery behind HyperTextAccessible::TextSubstring, using UTF-8 scalars instead of UTF-16 offsets, or to extend the existing machinery with a boolean parameter to switch between the two.

It can probably be done, it's just quite invasive in HyperTextAccessible.
Flags: needinfo?(samuel.thibault)
(as for the 5-6byte UTF-8 byte sequences, it should probably be a separate bug entry indeed, xpcom/string/nsUTF8Utils.h has a lot more to remove than what has just been introduced).
:surkov, any further thoughts on possible risks/benefits based on Samuel's followup comments?
Flags: needinfo?(surkov.alexander)
(I meant in terms of uplifting to 60 or not)
60 is a next esr and thus leaving the users without emoji for next half year is not a good thing. I don't have problems with the patch, but I'm not sure though whether Henri doesn't have other concerns.

Samuel, no regressions so far on trunk, correct?
Flags: needinfo?(surkov.alexander) → needinfo?(hsivonen)
Comment 123 looks convincing enough to me that I don't have concerns about uplifting this.
Flags: needinfo?(hsivonen)
We haven't seen any regression indeed.
(In reply to Samuel Thibault from comment #129)
> We haven't seen any regression indeed.

We're four users (whose three blinds) that using the multi-bytes version of Firefox patched by Samuel and we've encounter any regressions, only improvement. Firefox no longer skips text when I use it with Orca.

Best regards.
Comment on attachment 8959715 [details] [diff] [review]
multibyte

sounds like we can go ahead and get this in 60.0b10.  Thanks for the feedback.
Attachment #8959715 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
I tried to reproduce the issue, I installed Accerciser and IPython Console but I would require some info about how to find the accessible object corresponding with the <div> element. 
Can you please help me with that so I can verify it?
Flags: needinfo?(surkov.alexander)
Attached image tree.png
Here is where I can find it on my system
(In reply to Samuel Thibault from comment #134)
> Created attachment 8979502 [details]
> tree.png
> 
> Here is where I can find it on my system

thanks, Samuel!
Flags: needinfo?(surkov.alexander)
I still cannot find the Firefox in that list from Accerciser on Ubuntu. Am I missing somethng? (I open Firefox, open Accerciser and search for Firefox in the Name box but it doesn't appear)
Flags: needinfo?(samuel.thibault)
IIRC firefox disables accessibility if no screen reader is running, so you might have to run e.g. orca before even starting firefox.
Flags: needinfo?(samuel.thibault)
I managed to follow the steps. Thanks Samuel!

I verified the issue on Nighty 62.0a1 (2018-05-22), Beta 61.0b7 and Firefox 60.

The result is the expected one so I am marking the issue as verified.
See Also: → 1498473
You need to log in before you can comment on or make changes to this bug.