Closed Bug 1505508 Opened 2 years ago Closed 2 years ago

Intermittently missing glyphs on Android

Categories

(Core :: Graphics: WebRender, defect, P3)

65 Branch
Unspecified
Android
defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox65 --- affected

People

(Reporter: kats, Assigned: jnicol)

References

(Blocks 3 open bugs, )

Details

Attachments

(3 files)

Attached image Exhibit A
See attached screenshots. These are both from the GeckoView example app with gfx.webrender.all=true on a Sony Z3C phone. Just started the app which will load mozilla.org by default, and scroll down. In this case the letter 'w' is missing from two different headlines. This seems to happen often but not every time, not sure why.
See Also: → 1492443
I'm looking in to this
Assignee: nobody → jnicol
It seems like when the texture cache is grown, by reallocating the texture with an extra layer, the blits from the old texture to the new are failing. If you enable gfx.webrender.debug.texture-cache you can see another image appear in the overlay, but the existing ones become black. So any glyphs that were drawn previous to the cache growing are now invisible, because our cached version is just a transparent image. Newly drawn glyphs will appear, until the cache is resized again.
cc Glenn and Bobby for input.

I've done some debugging with renderdoc. It appears that glBlitFramebuffer can read from any layer, but always draws to layer 0 of the dest texture, regardless of what layer is bound. I tried glCopyTexSubImage3D instead - it uploads to the correct layer, but always reads from layer 0.

glBlitFramebuffer in to a renderbuffer, then glCopyTexSubImage3D from the renderbuffer to the new texture works, but is 2 copies.

I came across the function glCopyImageSubData when investigating this. It seems like it would be able to copy an entire array in one go, so might perform better than the existing code. It's not included in our gl bindings, and I don't know how to add it, so I haven't been able to test it yet. In any case, it seems relatively recent, so won't be supported by all devices.

So I'm left with thinking using a shader to do the copy might be the best solution?
(In reply to Jamie Nicol [:jnicol] from comment #4)
> I've done some debugging with renderdoc. It appears that glBlitFramebuffer
> can read from any layer, but always draws to layer 0 of the dest texture,
> regardless of what layer is bound. I tried glCopyTexSubImage3D instead - it
> uploads to the correct layer, but always reads from layer 0.

Huh, that's pretty weird. Have you verified that the read FBOs are correctly set up to point the right layers? Jeff, is it plausible that a driver bug would make glBlitFramebuffer force the layer of the bound read target to zero?

 
> I came across the function glCopyImageSubData when investigating this. It
> seems like it would be able to copy an entire array in one go, so might
> perform better than the existing code. It's not included in our gl bindings,
> and I don't know how to add it, so I haven't been able to test it yet. In
> any case, it seems relatively recent, so won't be supported by all devices.

Looks like that isn't supposed by OpenGL ES, so it probably wouldn't be helpful for mobile anyway.
 
> So I'm left with thinking using a shader to do the copy might be the best
> solution?

It could be, though I don't have a good sense of performance. Jeff, any sense of the relative performance of a shader versus the N blit calls we do at [1]?

[1] https://searchfox.org/mozilla-central/rev/d850d799a0009f851b5535580e0a8b4bb2c591d7/gfx/webrender/src/device/gl.rs#1570
Flags: needinfo?(jgilbert)
(In reply to Bobby Holley (:bholley) from comment #5)
> (In reply to Jamie Nicol [:jnicol] from comment #4)
> > I've done some debugging with renderdoc. It appears that glBlitFramebuffer
> > can read from any layer, but always draws to layer 0 of the dest texture,
> > regardless of what layer is bound. I tried glCopyTexSubImage3D instead - it
> > uploads to the correct layer, but always reads from layer 0.
> 
> Huh, that's pretty weird. Have you verified that the read FBOs are correctly
> set up to point the right layers? Jeff, is it plausible that a driver bug
> would make glBlitFramebuffer force the layer of the bound read target to
> zero?
> 
>  
> > I came across the function glCopyImageSubData when investigating this. It
> > seems like it would be able to copy an entire array in one go, so might
> > perform better than the existing code. It's not included in our gl bindings,
> > and I don't know how to add it, so I haven't been able to test it yet. In
> > any case, it seems relatively recent, so won't be supported by all devices.
> 
> Looks like that isn't supposed by OpenGL ES, so it probably wouldn't be
> helpful for mobile anyway.
>  
> > So I'm left with thinking using a shader to do the copy might be the best
> > solution?
> 
> It could be, though I don't have a good sense of performance. Jeff, any
> sense of the relative performance of a shader versus the N blit calls we do
> at [1]?
> 
> [1]
> https://searchfox.org/mozilla-central/rev/
> d850d799a0009f851b5535580e0a8b4bb2c591d7/gfx/webrender/src/device/gl.rs#1570

There is GL_EXT_copy_image and GL_OES_copy_image in the ES registry. So it definitely is supported as an extension on ES, it's just a question of which drivers support it or not.
(In reply to Bobby Holley (:bholley) from comment #5)
> (In reply to Jamie Nicol [:jnicol] from comment #4)
> > I've done some debugging with renderdoc. It appears that glBlitFramebuffer
> > can read from any layer, but always draws to layer 0 of the dest texture,
> > regardless of what layer is bound. I tried glCopyTexSubImage3D instead - it
> > uploads to the correct layer, but always reads from layer 0.
> 
> Huh, that's pretty weird. Have you verified that the read FBOs are correctly
> set up to point the right layers? Jeff, is it plausible that a driver bug
> would make glBlitFramebuffer force the layer of the bound read target to
> zero?
This would definitely be a driver bug. Was this on ANGLE? We know ANGLE has some issues here.
>  
> > I came across the function glCopyImageSubData when investigating this. It
> > seems like it would be able to copy an entire array in one go, so might
> > perform better than the existing code. It's not included in our gl bindings,
> > and I don't know how to add it, so I haven't been able to test it yet. In
> > any case, it seems relatively recent, so won't be supported by all devices.
> 
> Looks like that isn't supposed by OpenGL ES, so it probably wouldn't be
> helpful for mobile anyway.
>  
> > So I'm left with thinking using a shader to do the copy might be the best
> > solution?
> 
> It could be, though I don't have a good sense of performance. Jeff, any
> sense of the relative performance of a shader versus the N blit calls we do
> at [1]?
> 
> [1]
> https://searchfox.org/mozilla-central/rev/
> d850d799a0009f851b5535580e0a8b4bb2c591d7/gfx/webrender/src/device/gl.rs#1570

I expect it's always faster to do multiple blits than an MRT shader. I don't expect this to be hot code, either way.
Flags: needinfo?(jgilbert)
This is on Android. It definitely smells like a driver bug, but it affects at least Kats' Z3C and my pixel 2. Both adrenos, but several generations apart.
(In reply to Jamie Nicol [:jnicol] from comment #8)
> This is on Android. It definitely smells like a driver bug, but it affects
> at least Kats' Z3C and my pixel 2. Both adrenos, but several generations
> apart.

Do you have example code that demonstrates this, for my curiosity? This is something I expect the DEQP tests to guarantee, so it's surprising to hear!
(In reply to Jeff Gilbert [:jgilbert] from comment #9)
> (In reply to Jamie Nicol [:jnicol] from comment #8)
> > This is on Android. It definitely smells like a driver bug, but it affects
> > at least Kats' Z3C and my pixel 2. Both adrenos, but several generations
> > apart.
> 
> Do you have example code that demonstrates this, for my curiosity? This is
> something I expect the DEQP tests to guarantee, so it's surprising to hear!

Yeah, that's why I'm wondering if something else is going wrong somewhere.
Is it worth trying the adreno profiler / debugger, to see if that provides any insights to what is happening? https://developer.qualcomm.com/software/adreno-gpu-profiler
I'll put together a reduced test case tomorrow. It would be really nice if it turns out we've just forgotten to bind something somewhere, that was my original suspicion, but unfortunately I don't think that's the case. I'll try to find a non-adreno android phone to test on too.
I put together an android app based on the gles3 sdk example: https://github.com/jamienicol/blitframebuffers-test

The interesting file is RendererES3.cpp, the rest is just boilerplate.

I created several texture arrays with 3 layers. The first I manually uploaded red, green, and blue to each layer.

Then I tried to copy these across to the other textures in different ways:
* glBlitFramebuffer, like we currently do in webrender
* glCopyTexSubImage
* glBlitFramebuffer to a renderbuffer, then glCopyTexSubImage
* glCopyImageSubData

Then I draw each of these to the screen. (From bottom to top: src, blit, copytexsubimage, renderbuffer, copyimagesubdata)

It gets the same results as we're seeing with webrender (though it is of course possible I'm just making the same mistake as webrender): glBlitFramebuffer always draws to layer 0, glCopyTexSubImage always reads from layer 0.
Attached image output of test app
Priority: -- → P3
I managed to find 2 non-adreno phones: Samsung Galaxy S6 (Mali-T760), and Huawei P20 (Mali-G72). It works just fine on those, which points towards this being an adreno bug.

So, assuming it is, we need to pick a fallback that will work everywhere. The extra copy to a temporary renderbuffer would be the easiest, but a shader might be quicker. Then, when available, we could use glCopyImageSubData. It seems to work on all the devices I have on my desk, at least.
I just ran into jgilbert, he said he'd have a look at the testcase to confirm.

The advantage of the shader would be to use identical code everywhere. If we're going to take different paths depending on what's available, we might as well try to detect Adreno and just use the temporary renderbuffer in that case. But divergent codepaths aren't really the WebRender Way, so I'd probably vote for the shader.
Flags: needinfo?(jgilbert)
Definitely looks like a driver bug. This is a relatively basic capability, so it's pretty shocking that it's not working on such a common and modern driver.

I'm going to make a webgl2 test (and standalone jsfiddle) for this, and at least integrate it into the WebGL conformance tests.
Flags: needinfo?(jgilbert)
If we feel strongly against divergent codepaths, I'd like to see if I can't cajole it into doing the right thing.
I can confirm that this fails the same way via webgl2 on Adreno:
https://jsfiddle.net/nu1xdgs3/6/
Here it is with blit, copy, and blit+copy(+invalidate).

I might recommend blit+copy+invalidate over using a shader. It might increase memory bandwidth, but it should stay in the copy engines, instead of hitting the graphics cores.

Also it could conditionally decay into just blit like we want on unaffected systems, if we find we want/are ready to do that.
I understand wanting to avoid diverging code paths, but the perf hit of using a shader on all platforms because of this driver bug seems like a bad deal. Adding the extra copy is only a very minor deviation from the main code path - we could easily do it conditionally. I think in total it would still comfortably be less code than using a shader.
Avoiding divergent code paths is a goal, not a requirement, from my perspective. I think it's fine to make exceptions on a case by case basis (we do this when the subpixel blending extension is present, for example).

In this case, it sounds like it might make sense to have two code paths, if there's a clear performance overhead in the rare workaround path.
Do the Adreno drivers support glCopyImageSubData? If so, and we think this bug is limited to those drivers, then we default to glCopyImageSubData and then fall back to the blit loop we have now, which would then presumably not hit the bug.
(In reply to Bobby Holley (:bholley) from comment #24)
> Do the Adreno drivers support glCopyImageSubData? If so, and we think this
> bug is limited to those drivers, then we default to glCopyImageSubData and
> then fall back to the blit loop we have now, which would then presumably not
> hit the bug.

I think we currently support Adreno 3xx onwards, as they were the first to support GLES 3.0. Unfortunately it looks like only Adreno 4xx onwards support the Android Extension Pack, and therefore glCopyImageSubData. So that leaves the 3xx series needing this workaround.
Servo regressed yesterday to a similar behavior: servo/servo#22232 (Moto G5, GPU Renderer: Adreno 505)
(In reply to Jamie Nicol [:jnicol] from comment #25)
> (In reply to Bobby Holley (:bholley) from comment #24)
> > Do the Adreno drivers support glCopyImageSubData? If so, and we think this
> > bug is limited to those drivers, then we default to glCopyImageSubData and
> > then fall back to the blit loop we have now, which would then presumably not
> > hit the bug.
> 
> I think we currently support Adreno 3xx onwards, as they were the first to
> support GLES 3.0. Unfortunately it looks like only Adreno 4xx onwards
> support the Android Extension Pack, and therefore glCopyImageSubData. So
> that leaves the 3xx series needing this workaround.

Ok, so we'll have three code paths then? First glCopyImageSubData, then the blit loop without the intermediate surface, then with?
(Or would we just always use the intermediate surface if glCopyImageSubData wasn't supported?)
We should have a fast path that isn't CopyImageSubData. Since CopyImageSubData isn't available on some of our affected devices, I think we should just drop CopyImageSubData, since BlitFramebuffer shouldn't be much slower, and has guaranteed availability. (Since it's N<10 Blits, which is unlikely to bottleneck us)

CopyImageSubData went core in ES3.2 and GL4.3:
https://jdashg.github.io/misc/gl/search-headers.html#str=CopyImageSubData&flags=

I don't think it's supported on OSX either.

I recommend just BlitFramebuffer with and without intermediate surface, unless we find the perf is insufficient.
I will say in favour of glCopyImageSubData, that it is easily the simplest method in terms of lines of code, and most directly expresses our intent (hopefully reducing the chances of bugs). It's a shame it won't be available on Mac, or older android devices, but I think it makes sense to use it where available.

Kats, jgilbert, and I discovered today that the bug on Adreno 3xx devices is different than on 4xx and later. The blit seems to work on Jeff's jsfiddle in comment 21. Additionally, the corruption in webrender seems less extreme to me, but is definitely still there, so it might have a different cause. Furthermore, the intermediate renderbuffer workaround does not work on 3xx or 4xx.

With that in mind, I think using glCopyImageSubData where available, and falling back to the current blit loop is the best approach for now. That will leave 3xx not working, which can be investigated later (not blocking the demo).
I still don't grok the value proposition of using glCopyImageSubData. If we still need the blit-loop fallback, and the blit-loop is not a perf bottleneck, I don't understand what we gain by having an additional glCopyImageSubData path. The concrete cost is adding risk to any refactoring of the blit-loop, because the code may or may not be exercised on the developer's machine.
The value is what you suggest in comment 24 - it avoids the Adreno driver bug. The intermediate renderbuffer workaround only works on Adreno 5xx and 6xx, but not 3xx or 4xx. glCopyImageSubData works on 4xx, 5xx, and 6xx.

The bug on 3xx requires additional investigation. As does why the intermediate renderbuffer is failing on 3xx and 4xx. Once we understand those, it might make sense to drop the glCopyImageSubData path for the one true fix. But in the short term it is the simplest way to fix this for as many devices as we can.
Can we get a table of which of {blit,copy,blit+copy} works on [3456]xx? Via kats on IRC, it seems like at least one of 3xx and 4xx Just Works with blit alone or copy alone, though blit+copy fails. Is there a driver where neither blit nor blit+copy works?
|------------------+------+-----+-----+-----|
|                  | 3xx  | 4xx | 5xx | 6xx |
|------------------+------+-----+-----+-----|
| Blit             | y[1] | n   | n   | n   |
| Blit + Copy      | n    | n   | y   | y   |
| CopyImageSubData | n    | y   | y   | y   |
|------------------+------+-----+-----+-----|

[1] While the 3xx passes the jsfiddle blit test, there is still a bug with glyphs disappearing as you scroll. It might be completely unrelated to the blitting bug on 4xx+, or similar but just subtly different. When I investigated this report I used a 5xx and just assumed it was the same problem, but it looks like that might not be the case.
Ok, so that suggests we don't get any value by blit+copy, because the devices on which it would be needed also support CopyImageSubData.

So it seems like we should default to CopyImageSubData and then fall back to blit. We then need to figure out the other, presumably-different bug on 3xx.
PR adding glCopyImageSubData to gleam has been merged. Here is the PR for using it in webrender: https://github.com/servo/webrender/pull/3338.

Heads up to kats: that webrender update will need a gleam update. I'm not sure what that involves.

Once this is fixed we should get another bug on file for the 3xx bug (even though that's originally what this one was for!)
Should we close this, or leave it open for the 3xx bug?
(In reply to Jamie Nicol [:jnicol] from comment #37)
> Should we close this, or leave it open for the 3xx bug?

Given that the 3xx bug is presumed different than the one discussed in this bug, I think we should close this and open a new one with a description of what we know about the 3xx bug.
Flags: needinfo?(jnicol)
I'm also inclined to close this one since most of the discussion here is related to the one that was fixed.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
See Also: → 1513185
Filed bug 1513185 for the Adreno 3xx bug.
Flags: needinfo?(jnicol)
You need to log in before you can comment on or make changes to this bug.