Closed Bug 1555544 Opened 8 months ago Closed 8 months ago

Heavy flickering of WebGL background when webrender is on and canvas repeatedly added and removed

Categories

(Core :: Graphics: WebRender, defect, P2)

69 Branch
Unspecified
Windows 10
defect

Tracking

()

VERIFIED FIXED
mozilla69
Tracking Status
firefox-esr60 --- unaffected
firefox67 - wontfix
firefox67.0.1 - wontfix
firefox68 --- wontfix
firefox69 --- verified

People

(Reporter: karlcow, Assigned: sotaro)

References

(Blocks 1 open bug, )

Details

(Keywords: regression)

Attachments

(2 files, 2 obsolete files)

This was initially reported on https://webcompat.com/issues/31728

May trigger epilepsy? Flickers quite rapidly.

This only happens when WebRender is enabled, and only with the background with the rotating cubes.
Bottom frame is normal, top (with black background) is abnormal.

Component: Graphics → Graphics: WebRender

I don't see this with WebRender on Mac on 69

Does this show up in 67?

Flags: needinfo?(kdubost)

I don't know for 67. I was not able to reproduce on mac either. But the report in https://webcompat.com/issues/31728 was someone reporting on Windows 10 with Firefox 69 and it was reproduced by one of our tester.

Flags: needinfo?(kdubost)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #2)

Does this show up in 67?

I was able to reproduce the issue on Dell XPS 15 9550 running windows 10 on 67 with WR enabled.

It looks like this is a probably a problem with WebGL/WebRender interaction.

Assignee: nobody → sotaro.ikeda.g
Summary: Heavy flickering of background when webrender is on → Heavy flickering of WebGL background when webrender is on

I was able to reproduce the problem on my Win 10 laptop P50.

When the problem happened, graphic section of about:support had the following Failure Log. The error was returned without waiting at AcquireSync().

GP+[GFX1]: RenderDXGITextureHostOGL AcquireSync timeout, hr=0x80070057

With local build RenderDXGITextureHostOGL::EnsureLockable() also failed because of device->OpenSharedResource() failure.
https://searchfox.org/mozilla-central/source/gfx/layers/d3d11/TextureD3D11.cpp#758

SurfaceFactory creation and destruction happened very often. It might be related to the problem.

(In reply to Sotaro Ikeda [:sotaro] from comment #8)

SurfaceFactory creation and destruction happened very often. It might be related to the problem.

It happened since WebRenderCanvasData was destroyed/created very often in WebRenderCommandBuilder::BuildWebRenderCommands() by mLastCanvasDatas.Clear();
https://searchfox.org/mozilla-central/source/gfx/layers/wr/WebRenderCommandBuilder.cpp#1542

It did not happen during normal WebGL usage.

SharedSurface_ANGLEShareHandle is wrapped by SharedSurfaceTextureClient. In normal situation, SharedSurface_ANGLEShareHandle is not destroyed before end of usage by RenderDXGITextureHostOGL. But when SurfaceFactory is destroyed, TextureClient::CancelWaitForRecycle() is called. It stop to wait end of the resource usage by RenderDXGITextureHostOGL.
https://searchfox.org/mozilla-central/source/gfx/layers/client/CanvasClient.cpp#502
https://searchfox.org/mozilla-central/source/gfx/gl/SharedSurface.cpp#279

Hi, I managed to get a regression range for this issue, here are the results:

Last known good build:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=f8e6e95065e943a2208675cabb0eb0cdb06eab93&tochange=c7a754017e70c0cddce73fcfbc4c3ba0f1aed97b

First Known Bad build:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=3e792c898f69fabe1bc2387f8c98631d2ec7b7c7&tochange=c7a754017e70c0cddce73fcfbc4c3ba0f1aed97b

I'm not sure but I think this is the changeset that caused the issue:

f8e6e95065e943a2208675cabb0eb0cdb06eab93 Andrei Oprea — Bug 1472599 - Reset browser.startup.page preference for older profiles r=k88hudson

Confirmed that attachment 9068649 [details] addressed the problem for me.

It is not good to use TextureFlags::DEALLOCATE_CLIENT, since TextureClient de-allocation becomes synchronous.
https://searchfox.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#424
https://searchfox.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#338

Actually we do not need to call TextureClient::CancelWaitForRecycle(). Without calling it, TextureClient is destroyed automatically, when host side ends to use it. Removing the CancelWaitForRecycle() call seems like a better solution.

Attachment #9068649 - Attachment is obsolete: true

Confirmed that Attachment 9068661 [details] addressed the problem for me.

Sotaro, do you know why the problem is happening on this page and we haven't otherwise seen this issue? How common do you expect this problem to be?

Flags: needinfo?(sotaro.ikeda.g)

[Tracking Requested - why for this release]: We're not sure how widespread the problem is yet, but based on what we figure out we may want to consider disabling WebRender or doing a dot release.

Priority: P3 → P1

(In reply to Jeff Muizelaar [:jrmuizel] from comment #19)

Sotaro, do you know why the problem is happening on this page and we haven't otherwise seen this issue? How common do you expect this problem to be?

I do not know why it happened. But nsHTMLCanvasFrame seemed to be re-created by NS_NewHTMLCanvasFrame() for every WebRenderLayerManager transaction. I am not sure how the problem is common, though I did not saw such symptom other than this page.

Flags: needinfo?(sotaro.ikeda.g)

Indeed the page has the following code:

        function f() {
            document.getElementById(e).appendChild(t.domElement);
            for (var d = 0; d < s.length; d++)
                s[d].position.y += .1,
                s[d].rotation.x += .02,
                s[d].rotation.y += .02,
                s[d].rotation.z += .02;
            if (i % o == 0) {
                var l = i % (2 * o) == 0 ? 1 : -1
                  , p = new THREE.BoxGeometry(a,a,a)
                  , u = new THREE.Mesh(p,m());
                u.castShadow = !0,
                u.position.x = -1 * c(10),
                u.position.y = -40,
                u.position.z = c(40) * l,
                s.push(u),
                r.add(u)
            }
            requestAnimationFrame(f),
            t.render(r, n),
            i++
        }

This is removing and adding the canvas node every frame which is a bizarre thing to do.

Priority: P1 → P2
Summary: Heavy flickering of WebGL background when webrender is on → Heavy flickering of WebGL background when webrender is on and canvas repeatedly added and removed

(In reply to Sotaro Ikeda [:sotaro] from comment #15)

Actually we do not need to call TextureClient::CancelWaitForRecycle(). Without calling it, TextureClient is destroyed automatically, when host side ends to use it. Removing the CancelWaitForRecycle() call seems like a better solution.

If we remove CancelWaitForRecycle(), shouldn't we expect it to wait for recycle instead, instead of eagerly recycling the TextureClient? It doesn't seem to me like this should fix the issue here, so I either don't understand why the patch is changing what it does, or perhaps why we're getting flickering in the first place.

Flags: needinfo?(sotaro.ikeda.g)

perhaps why we're getting flickering in the first place.

Flickering happened when SharedSurface_ANGLEShareHandle is destroyed before RenderDXGITextureHostOGL::EnsureLockable() is called on Render thread. RenderDXGITextureHostOGL failed at device->OpenSharedResource() . In this case, SharedSurface_ANGLEShareHandle failed to render. Then black was rendered.

SharedSurface was recreated mostly for every transaction. RenderDXGITextureHostOGLs were also created and the EnsureLockable() was called. Some times rendering of RenderDXGITextureHostOGL succeeded, sometimes rendering of RenderDXGITextureHostOGL was failed. When it succeeded, background was rendered as white. When it failed, background was rendered as black. Then flickering happened.

If we remove CancelWaitForRecycle(), shouldn't we expect it to wait for recycle instead, instead of eagerly recycling the TextureClient?

Actual stop recycling is already done by SurfaceFactory::StopRecycling(). TextureClient::CancelWaitForRecycle() does different thing. The function name is not good. With it TextureClient lose refcount that is held when host side is in use. Then if TextureClient::CancelWaitForRecycle() is called, TextureClient is destroyed even when TextureHost is still in use. If it happens, the above problem could happen.

If TextureClient::CancelWaitForRecycle() is not called, the refcount is kept until the host side ends its usage. The refcount is removed by CompositorBridgeChild::NotifyNotUsed()
https://searchfox.org/mozilla-central/source/gfx/layers/ipc/CompositorBridgeChild.cpp#840

With it, SharedSurface_ANGLEShareHandle is kept alive until host side ends its usage. Then the above problem does not happen.

Flags: needinfo?(sotaro.ikeda.g)
Attachment #9068661 - Attachment description: Bug 1555544 - Remove calling TextureClient::CancelWaitForRecycle() → Bug 1555544 - Remove calling TextureClient::CancelWaitForNotifyNotUsed()
Attachment #9068947 - Attachment is obsolete: true
Pushed by sikeda@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/996a636778d0
Change function name from CancelWaitForRecycle() to CancelWaitForNotifyNotUsed() r=jgilbert
https://hg.mozilla.org/integration/autoland/rev/59a62f52fe32
Remove calling TextureClient::CancelWaitForNotifyNotUsed() r=jgilbert
Status: NEW → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla69
See Also: → 1556106

(In reply to Jeff Muizelaar [:jrmuizel] from comment #20)

[Tracking Requested - why for this release]: We're not sure how widespread the problem is yet, but based on what we figure out we may want to consider disabling WebRender or doing a dot release.

Tracking for 67 but do we know if this is actually a widespread problem or a problem affecting top sites?

Setting qe-verify+ keyword as I would like a manual verification that this is actually fixed on Nightly.

Thanks

Flags: qe-verify+
Flags: needinfo?(jmuizelaar)

(In reply to Pascal Chevrel:pascalc from comment #30)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #20)

[Tracking Requested - why for this release]: We're not sure how widespread the problem is yet, but based on what we figure out we may want to consider disabling WebRender or doing a dot release.

Tracking for 67 but do we know if this is actually a widespread problem or a problem affecting top sites?

The problem seems to happen if the site is recreating the canvas on every frame. I suspect this is quite uncommon. It's probably not common enough to ship in a dot release. It is probably worth adding something to the release notes though.

Flags: needinfo?(jmuizelaar)

Hi again, This issue is verified as fixed in Firefox 69.0a1 (2019-06-03). I will mark it accordingly.

I agree with jrmuizel: This is a bug in how we handle this edgecase, but literally no one should be doing this, so we should probably let this ride the trains and engage with dev-evang to try to talk people out of doing this. Unless this is a huge site that can't migrate away from this edge-case quickly, we probably don't need a dot-release.

Marking as wontfix for 67 as this not seems to be a major problem for this release.

wontfix for 68 based on comment 33.

Hi, since this is a wont fix in 68 I will remove the Qe-verify flag.

Flags: qe-verify+
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.