Closed Bug 1260825 Opened 4 years ago Closed 4 years ago

utf16 surrogates causes graphite shaper to produce incorrect cluster information.

Categories

(Core :: Graphics: Text, defect)

45 Branch
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
firefox49 --- fixed

People

(Reporter: firefox, Assigned: jfkthame, NeedInfo)

Details

(Whiteboard: gfx-noted)

Attachments

(7 files)

User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/48.0.2564.116 Chrome/48.0.2564.116 Safari/537.36

Steps to reproduce:

Install attached font

<html>
<body contenteditable='true'>
&#x16f50;
<BR>
&#x16f50;&#x16f50;<BR>
&#x16f50;&#x16f50;&#x16f50;<BR>
&#x16f50;&#x16f50;&#x16f50;&#x16f50;<BR>
aaaa<BR>
&#x16f50;&#x16f0a;<BR>
&#x16f50;&#x16f0a;&#x16f6a;<BR>
</body>
</html>


Actual results:

Strange selection behaviour due to Firefox treating non ligature glyphs as ligatures dues to UTF16 surrogates.


Expected results:

A single non Ligature glyph should not be treated as a ligature just because its character is encoded as UTF16 surrogates.
Attached patch Test caseSplinter Review
Unittest showing problem with graphite shaper with UTF16 surrogates.
Test case uses font "Miao Unicode"
(For my build I had to copy gfxGraphiteShaper.h, gfxHarfBuzzShaper.h, gfxPlatformFontList.h
to obj*/dist/include)
Run test command : "./mach gtest Shaper.*"
Proof of Concept patch.
Makes unittest pass + fixes Firefox selection behaviour.
I think the problem lies in this line (268?):

    if (clusters[cIndex].baseChar + clusters[cIndex].nChars < after + 1) {
        clusters[cIndex].nChars = after + 1 - clusters[cIndex].baseChar;

with the after + 1. That needs to take into account the increased size. I think we need something along the lines of this monster:

    const int nAfter = gr_cinfo_base(gr_seg_cinfo(aSegment, gr_slot_after(slot)) + 1);
    if (clusters[cIndex].baseChar + clusters[cIndex].nChars < nAfter) {
        clusters[cIndex].nChars = nAfter - clusters[cIndex].baseChar;

A bit more of a mouthful, but hopefully fixes the problem cleanly for all sizes of a codepoint.
Eat your own dogfood Martin. OK here's the updated section of the example/clusters.c

        unsigned int nAfter = gr_slot_after(is) + 1;
        unsigned int cAfter = nAfter < numCodePoints ?
                   gr_cinfo_base(gr_seg_cinfo(seg, nAfter)) : lenstr;
        if (clusters[ci].base_char + clusters[ci].num_chars < cAfter)
            clusters[ci].num_chars = cAfter - clusters[ci].base_char;

where lenstr is the length of the string in code units (not characters, which is what numCodePoints is).

See: 2e8e41c1606633b0f02b0e6634e886c80496bc31
Thanks for comments.

I'm assuming "Eat you own dogfood" comment in the next reply - means that this isn't expected to work.

But just to confirm this: the following code doesn't fix the the unittest fail nor fix the selection in the test case example. (The code is slightly modified to fix what I assumed to be a typo - so the +1 doesn't fail at incrementing incomplete ptr type.)

    const int nAfter = gr_cinfo_base(gr_seg_cinfo(aSegment, gr_slot_after(slot))) + 1;
    if (clusters[cIndex].baseChar + clusters[cIndex].nChars < nAfter) {
        clusters[cIndex].nChars = nAfter - clusters[cIndex].baseChar;

(In reply to martin_hosken from comment #3)
> I think the problem lies in this line (268?):
> 
>     if (clusters[cIndex].baseChar + clusters[cIndex].nChars < after + 1) {
>         clusters[cIndex].nChars = after + 1 - clusters[cIndex].baseChar;
> 
> with the after + 1. That needs to take into account the increased size. I
> think we need something along the lines of this monster:
> 
>     const int nAfter = gr_cinfo_base(gr_seg_cinfo(aSegment,
> gr_slot_after(slot)) + 1);
>     if (clusters[cIndex].baseChar + clusters[cIndex].nChars < nAfter) {
>         clusters[cIndex].nChars = nAfter - clusters[cIndex].baseChar;
> 
> A bit more of a mouthful, but hopefully fixes the problem cleanly for all
> sizes of a codepoint.
Assuming this was a pattern I should follow, I produced this (inefficient proof of concept) code.

        uint32_t numCodePoints = aLength;
        for(uint32_t i = 0; i < aLength; i++)
                if (NS_IS_LOW_SURROGATE(aText[i])) numCodePoints--;
        unsigned int nAfter = gr_slot_after(slot) + 1;
        unsigned int cAfter = nAfter < numCodePoints ?
                   gr_cinfo_base(gr_seg_cinfo(aSegment, nAfter)) : aLength;
        if (clusters[cIndex].baseChar + clusters[cIndex].nChars < cAfter)
            clusters[cIndex].nChars = cAfter - clusters[cIndex].baseChar;


which replaced:

        // extend cluster if necessary to reach the glyph's "after" index
        if (clusters[cIndex].baseChar + clusters[cIndex].nChars < after + 1) {
            clusters[cIndex].nChars = after + 1 - clusters[cIndex].baseChar;


With this new code, the unittest passes and the selection is fixed in the test case example.


(In reply to martin_hosken from comment #4)
> Eat your own dogfood Martin. OK here's the updated section of the
> example/clusters.c
> 
>         unsigned int nAfter = gr_slot_after(is) + 1;
>         unsigned int cAfter = nAfter < numCodePoints ?
>                    gr_cinfo_base(gr_seg_cinfo(seg, nAfter)) : lenstr;
>         if (clusters[ci].base_char + clusters[ci].num_chars < cAfter)
>             clusters[ci].num_chars = cAfter - clusters[ci].base_char;
> 
> where lenstr is the length of the string in code units (not characters,
> which is what numCodePoints is).
> 
> See: 2e8e41c1606633b0f02b0e6634e886c80496bc31
By 'eat my own dogfood' I was meaning that the first code fragment I gave in comment #3 turned out to be faulty when I added it to the cluster.c example code in the graphite codebase, but that the code in comment #4 seems to at least not make things worse, but is taken straight from the cluster.c.

You say that that code from comment #3 didn't fix the bug. I agree. It probably didn't. But did adapted code from comment $4 without the loop scanning surrogates, work? I'm intrigued.

TIA.
I did try and send it to the try server - so you could see for your self (and my result would be verified) 

https://hg.mozilla.org/try/rev/c736647d5f10924561e3ee0589f91be3a3cbbe59

but the try server seems to be currently broken.

(In reply to martin_hosken from comment #8)
> By 'eat my own dogfood' I was meaning that the first code fragment I gave in
> comment #3 turned out to be faulty when I added it to the cluster.c example
> code in the graphite codebase, but that the code in comment #4 seems to at
> least not make things worse, but is taken straight from the cluster.c.
> 
> You say that that code from comment #3 didn't fix the bug. I agree. It
> probably didn't. But did adapted code from comment $4 without the loop
> scanning surrogates, work? I'm intrigued.

I'm guess I'm not understand you correctly. Yes the latest patch I posted based on your code example in comment #4 worked. (unittest passed and selection is test case was no longer broken) The surrogate scanning was done in the calculation of numCodePoints. (Since I couldn't find any readily available function to tell me the number of codepoints in char16_t* str)




> 
> TIA.
Whiteboard: gfx-noted
Attachment #8736755 - Flags: review?(jfkthame)
Attachment #8748553 - Attachment mime type: text/plain → text/html
I'm not sure how well I understand this yet, but can we do a simpler fix that just adds a check for surrogate pairs where needed, without adding a whole new loop over the text? AFAICS, this fixes the problematic behavior; are there other cases where issues would still arise?
Attachment #8748564 - Flags: feedback?(firefox)
(In reply to Jonathan Kew (:jfkthame) from comment #12)
> Created attachment 8748564 [details] [diff] [review]
> Alternative patch to add check for surrogate pairs
> 
> I'm not sure how well I understand this yet, but can we do a simpler fix
> that just adds a check for surrogate pairs where needed, without adding a
> whole new loop over the text? AFAICS, this fixes the problematic behavior;
> are there other cases where issues would still arise?

This seems very similar to the original proof of concept patch I submitted:
https://bug1260825.bmoattachments.org/attachment.cgi?id=8736402

So from my perspective, it's likely to fix all the issues I'm aware of.

However I understood (maybe wrongly) from martin's comment #4 that a different fix approach was required. (which is what I tried to do with:
https://bugzilla.mozilla.org/attachment.cgi?id=8736755&action=edit
)


Thanks!
Flags: needinfo?(martin_hosken)
Note that in gecko we're dealing with UTF-16, and therefore the only case where code units ≠ characters is when we have a surrogate pair; we don't need to deal with the more complex variable-length characters that you'd find in UTF-8. I think Martin's example code aims to support any of the encoding forms, but we don't actually need that, so the simpler check for surrogate pairs should be sufficient.
Status: UNCONFIRMED → NEW
Ever confirmed: true
To be clear - the patch from comment #14 solves this issue for me.

What needs to be done to get this patch into firefox?
Does the unittest from comment #1 need cleaned up and included in the patch?

Thanks.
I was hoping Martin would take another look, too, and either confirm that for our purposes this patch should be fine, or explain why we need to do something more complex.

I've just pinged him by email to see if he has any comments to offer.
Shooting from the hip. Yes the SURROGATE test would do the job just fine. I'm just wondering about variation selectors. In Graphite they are treated as full characters, so shouldn't be an issue. But always worth a 'hmm must think that through'. Will take a closer look tomorrow, if you want.
Attachment #8736755 - Flags: review?(jfkthame)
Attachment #8748564 - Flags: feedback?(firefox) → review?(jmuizelaar)
(In reply to Tom Hindle from comment #15)
> Does the unittest from comment #1 need cleaned up and included in the patch?

Given that the test requires installing a specific font on the test system, I'm not sure it's worth trying to include at this point; the patch here is simple enough that I think we can take it based on manual verification that it fixes the selection issues.

I guess it may be possible to create a reftest testcase for the selection highlighting, which could use @font-face to load the required test font...
Here's a reftest based on the strings from comment 0; with current code, it fails because we measure the <span> with the cyan background incorrectly and highlight varying widths depending on the following text. The patch here fixes this so that we correctly identify the extent of the span in all cases.
Attachment #8753789 - Flags: review?(jmuizelaar)
Assignee: nobody → jfkthame
Status: NEW → ASSIGNED
Attachment #8748564 - Flags: review?(jmuizelaar) → review+
Attachment #8753789 - Flags: review?(jmuizelaar) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/a43b0bd1d0807baf826aa2ac8adb3eaee3447e94
Bug 1260825 - Check for surrogate pairs when processing clusters in graphite-shaped text. r=jrmuizel

https://hg.mozilla.org/integration/mozilla-inbound/rev/1bcf7733815714ed9dfa5675b99c12369c5dfbd3
Bug 1260825 - Reftest for highlighting of text involving surrogate pairs with graphite shaping. r=jrmuizel
https://hg.mozilla.org/mozilla-central/rev/a43b0bd1d080
https://hg.mozilla.org/mozilla-central/rev/1bcf77338157
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
You need to log in before you can comment on or make changes to this bug.