Closed Bug 1762424 Opened 2 years ago Closed 2 years ago

Fix GPU process restart on Android 12 and unblock

Categories

(GeckoView Graveyard :: Sandboxing, defect, P1)

Unspecified
Android

Tracking

(firefox-esr91 unaffected, firefox99 unaffected, firefox100 disabled, firefox101 fixed)

RESOLVED FIXED
101 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox99 --- unaffected
firefox100 --- disabled
firefox101 --- fixed

People

(Reporter: jnicol, Assigned: jnicol)

References

Details

(Whiteboard: [geckoview:m101])

Attachments

(4 files)

On Android 12, we are unable to create EGL surfaces following a GPU process crash. This is because when the GPU process crashes, its EGL context doesn't get detached from the Android Surface. Subsequent attempts to create a new EGL context therefore fail because it thinks the Surface is already attached.

This was originally filed in bug 1762025, and in that bug we blocked the GPU process on Android 12. In this bug we'll find a workaround to the problem and unblock it.

Here is the relevant excerpt from the logcat:

BufferQueueProducer: [SurfaceView[org.mozilla.fenix/org.mozilla.fenix.App]#11(BLAST Consumer)11](id:29bb0000000b,api:1,p:10818,c:10683) connect: already connected (cur=1 req=1)
libEGL  : eglCreateWindowSurface: native_window_api_connect (win=0x7ab34a728440) failed (0xffffffea) (already connected to another API?)
libEGL  : eglCreateWindowSurfaceTmpl:676 error 3003 (EGL_BAD_ALLOC)
Gecko   : [GFX1-]: Failed to create EGLSurface
CompositorBridgeParent: Unable to renew compositor surface; remaining in paused state

Here is my prototype GPU process Android app: https://github.com/jamienicol/GpuProcessTest

It encounters the same error as Firefox does on Android 12. I've managed to find a solution using the SurfaceControl APIs. Next I'll see if I can port that over to Gecko.

This fix blocks re-enabling the GPU process on Android 12+.

Setting priority P1 because it looks like Jamie is actively working on this bug now.

Severity: -- → S2
Priority: -- → P1
Whiteboard: [geckoview:m101]

So this bug is caused by the fact that on Android 12, when the process which attached an EGL Surface to a Surface dies, the EGL surface does not properly get disconnected. Subsequent attempts to create a new EGL surface fail, because the Surface thinks it is already connected to an API.

This only affects android 12 (and potentially later), and only when the Surface comes from a SurfaceView, not eg when rendering in to a SurfaceTexture. This appears to be because the SurfaceView uses the BlastBufferQueue implementation under the hood, which is presumably where the bug lies.

In order to work around this bug, we can pass the SurfaceControl object associated with the SurfaceView we are rendering in to across to the GPU process (in addition to passing the Surface like it already does. We still require the normal Surface in order to query the window size, but rather than rendering in to the normal Surface the GPU process can create a child Surface from that SurfaceControl, and render in to that.

If the GPU process dies, the child Surface is correctly disconnected from the SurfaceControl, meaning we are able to successfully create a new one and resume rendering. There doesn't appear to be any negative performance implications from my testing, and adb shell dumpsys SurfaceFlinger output looks sensible - I'm confident we are still getting directly composited and not causing any additional work.

Agi, this would require adding new overloads of the API GeckoDisplay.surfaceChanged(), with SurfaceControl arguments. It would be the embedder's responsibility to call this variant to supply the SurfaceControl when rendering in to a SurfaceView and on API >= 29. (The bug technically doesn't exist until 31, but the SurfaceControl APIs were added in 29, so it's probably less confusing to state we require it from then.)

I don't think we have any way of detecting if this contract is broken, but we can try to document it as best we can. And if the embedder did not provide a SurfaceControl, and we therefore find we are unable to resume after a GPU process crash, then we can cause a parent process crash meaning they are no worse off than if the GPU process was disabled.

Does that sound reasonable?

Flags: needinfo?(agi)

That sounds fine. From what I'm reading it's similar to what Chromium does. The only concern I have is that we do support TextureView as a View backend: https://searchfox.org/mozilla-central/rev/0e93b94f4c2045c6a5f5260ee48bbf7a94a993bc/mobile/android/geckoview/src/main/java/org/mozilla/geckoview/GeckoView.java#316 and my undestanding is that TextureView cannot be used with SurfaceControl?

Could we track whether the surface can be used with SurfaceControl or not? That's what Chromium seems to be doing: https://source.chromium.org/chromium/chromium/src/+/main:content/public/android/java/src/org/chromium/content/common/SurfaceWrapper.java;l=26;drc=0ebeaa8dcad7a2f6de4e1640da4368e1d775a432

Flags: needinfo?(agi)

Correct, I don't believe TextureView/SurfaceTexture can be used with SurfaceControl. But the good news is when using a TextureView (or just a plain offscreen SurfaceTexture as we do in our junit tests) we don't encounter this bug. It's only with SurfaceView. So in the TextureView case, the embedder can use the existing GeckoDisplay.surfaceChanged() API, or pass null as the SurfaceControl argument to the new overloads I would add. And GPU process restarts will already work as expected.

Chromium appears to use SurfaceControl through the NDK API ASurfaceControl. Which you can create from an ANativeWindow with ASurfaceControl_createFromWindow(), where the ANativeWindow can be obtained from the Surface. So they only send the java Surface object to the compositor, and create the SurfaceControl on the compositor end. This means they also need to send a flag indicating whether the Surface can be used with SurfaceControl, which indeed seems to be set to true when it comes from a SurfaceView callback and false when it comes from a TextureView one: https://source.chromium.org/chromium/chromium/src/+/main:weblayer/browser/java/org/chromium/weblayer_private/ContentViewRenderView.java;l=610;drc=6752e8fb5180a7a66206b4ba947b44d137d04ad1

Currently we need to use the Java SurfaceControl API to achieve this goal, rather than the NDK one. The NDK API deals with individual AHardwareBuffers, whereas the Java one allows you to give the SurfaceControl a Surface (a queue of buffers) to manage. We render in to an EGL surface attached to a Surface, meaning the system deals with queueing and dequeing buffers from it for us, whereas I believe Chrome handles their own queue of buffers that they render in to, meaning they can pass the individual buffers to the surface control. This would certainly be something I'd like to implement in the future (we could do neat things like hand video frames and webrender content tiles directly to hardware composer), but we're not there yet. There also doesn't appear to be a method to obtain a java SurfaceControl from an NDK ASurfaceControl, or vice versa. So we must send the java SurfaceControl object.

Essentially whether our java SurfaceControl object is non-null or null would act as our canBeUsedWithSurfaceControl flag, so we don't need the additional flag. Does that make sense?

And android.view.SurfaceControl$Transaction.

In order to generate these bindings we must increase the maximum SDK
version argument we pass to SDKProcessor to 29. However, doing so
means that we now attempt to generate bindings for both Surface's
Surface(SurfaceTexture) and Surface(SurfaceControl)
constructors. These both translate in to C++ functions with identical
signatures - Surface::New(jni::Object::Param) - causing a compilation
failure.

This patch therefore also allows the stubName property to override
constructor names, and overrides the latter to
Surface::FromSurfaceControl(), thereby avoiding the conflict.

This adds new overloads to the GeckoView API
GeckoDisplay.surfaceChanged(), containing an additional SurfaceControl
argument. This must be provided when rendering in to a SurfaceView on
SDK level 29 or greater. On earlier SDK levels, or when rendering in
to a TextureView or SurfaceTexture, the existing surfaceChanged()
methods can be called, or null can be passed as the SurfaceControl
argument. SurfaceViewWrapper and GeckoView classes are updated to
handle this correctly.

When provided, the SurfaceControl is passed along with the Surface
through to the widget and, when enabled, over to the GPU process. The
compositor widget then creates a child Surface from that
SurfaceControl, and renders in to that child Surface rather than the
parent one.

This works around a bug on Android 12 where following the GPU process
dying the Surface was left in an unusable state, meaning subsequent
attempts to initialize a compositor would fail. Because the Surface is
now created by the GPU process it gets destroyed when the process
dies, therefore a new Surface can successfully be created when we
reinitialize the compositor.

Depends on D143484

In bug 1762025 we found ourselves in a situation where we are unable
to ever create an EGLSurface, meaning we were continually attempting
to initialize a compositor, encountering a NEW_SURFACE webrender
error, causing us to tear down the compositor and create a new one,
repeating the cycle indefinitely.

This leaves the user with an unusable browser. We'd be better off
simply crashing. This patch adds a flag to FallbackFromAcceleration(),
which forces a crash if we have already exhausted all of the fallback
options. When calling FallbackFromAcceleration due to a NEW_SURFACE
error we set the flag to true. It may also be worthwhile setting this
flag for more errors in the future.

Depends on D143485

With the previous patches in this series we can properly recover from
a GPU process restart, so unblock it.

Depends on D143486

(In reply to Jamie Nicol [:jnicol] from comment #5)

Essentially whether our java SurfaceControl object is non-null or null would act as our canBeUsedWithSurfaceControl flag, so we don't need the additional flag. Does that make sense?

Makes sense to me! Thanks for explaining.

Pushed by jnicol@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9b1523ea3a50
Generate SDK bindings for android.view.SurfaceControl. r=agi
https://hg.mozilla.org/integration/autoland/rev/a7aedf0944b5
Provide SurfaceControl to compositor and render in to child Surface. r=agi,gfx-reviewers,geckoview-reviewers,jrmuizel,owlish
https://hg.mozilla.org/integration/autoland/rev/f6c7c0ed6ebb
Force crash if unable to create Surface after trying all fallback configurations. r=gfx-reviewers,nical
https://hg.mozilla.org/integration/autoland/rev/6d695b6b9769
Unblock GPU process on Android 12. r=gfx-reviewers,nical
Regressions: 1767128
See Also: → 1768197

Moving GPU process bugs to the new GeckoView::Sandboxing component.

Component: General → Sandboxing
See Also: → 1780093
Regressions: 1781180
Product: GeckoView → GeckoView Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: