Closed Bug 1756700 Opened 2 years ago Closed 2 years ago

GPU process suspected to introduce ~128ms regression on Moto G5 to COLD VIEW nav start fenix benchmark

Categories

(GeckoView :: Sandboxing, defect, P3)

Unspecified
All

Tracking

(Performance Impact:?, firefox99 fixed)

RESOLVED FIXED
99 Branch
Performance Impact ?
Tracking Status
firefox99 --- fixed

People

(Reporter: mcomella, Assigned: jnicol)

References

Details

Attachments

(2 files)

The GPU process was enabled last week and disabled. Around the same time, we noticed a 128ms regression on Moto G5 to fenix's COLD VIEW nav start benchmark that was then reverted: https://github.com/mozilla-mobile/fenix/issues/23725 Because of that, we suspect the GPU process was the root cause. See the GitHub bug for more details.

Are there instructions anywhere for how to reproduce the test results and take a performance profile locally?

Flags: needinfo?(michael.l.comella)

To reproduce the test results, you can run the cold_view_nav_start benchmark locally with these instructions. You should try to run on a low-end device or you may not be able to notice a difference between your runs.

To take a start up profile, you can follow these instructions. Let me or :mleclair know if you have questions. You can find us in https://matrix.to/#/#perf-android-frontend:mozilla.org

Flags: needinfo?(michael.l.comella)

Thanks Michael, got both the cold_view_nav_start benchmarks and startup profiling to work well!

Okay so in an idea world what should happen is that we launch the GPU process asynchronously when gfxPlatform is initialized, then at some point in the future it finishes launching and GPUProcessHost::InitAfterConnect() gets called, then a wee bit later GPUChild::RecvInitComplete() gets called. At this point the GPU process is ready, so any future callers to EnsureGPUReady() (called by anything that needs a usable GPU process) can proceed immediately without having to call GeckoChildProcessHost::WaitUntilConnected() which we see in the profile.

But instead what's happening is that we only initialize gfxPlatform and launch the GPU process here, then we create a compositor immediately afterwards here. Creating the compositor needs the GPU process to be up and running, so it synchronously blocks in GeckoChildProcessHost::WaitUntilConnected(). Additionally, as we haven't had time to receive the async GPUChild::RecvInitComplete(), we have to synchronously call GPUChild::SendGetDeviceStatus() to get the GPUDeviceData before we can proceed.

We should try both initializing the gfxPlatform earlier, and creating the compositor later, to hopefully avoid this synchronous wait.

On windows we don't create the compositor until later so it usually avoids this issue. On Android we started deliberately creating the compositor there because of bug 1453501, so I'll need to process that bug and see whether it's still applicable and whether there's an alternative solution

Reading snorp's description in bug 1453501 comment 2, I think that bug is no longer applicable. It occurred because the layer manager (nowadays referred to as a window renderer) returned from GetLayerManager() / GetWindowRenderer() is null in RenderFrameParent::Init() (now RemoteLayerTreeOwner::Initialize()), meaning we passed a null compositor to AllocateAndConnectLayerTreeId() here. But since bug 1741156 we now initialize the compositor on demand in android's nsWindow::GetWindowRenderer(), so it will not return null.

So I believe we are safe to remove the CreateLayerManager() call from nsWindow::Create()

Severity: -- → S3
Priority: -- → P3
Performance Impact: --- → ?

We noticed a cold_view_nav_start regression on Fenix from enabling the
GPU process, and profiles showed time spent synchronously waiting for
the GPU process to launch. This occured because the compositor was
being created in nsWindow::Create, and as it requires the GPU process
to be running it had to block until launch completed. The process is
launched when the gfxPlatform is first initialized, but that was only
occuring immediately prior to creating the compositor, which did not
give it enough time to complete asynchronously.

This patch makes it so that we initialize the gfxPlatform slightly
earlier, and importantly delay creating the compositor until it is
actually required. This gives the process enough time to launch
asynchronously meaning we do not have to block.

We started deliberately creating the compositor early on Android
because of bug 1453501, to avoid a race condition where the compositor
didn't exist when RemoteLayerTreeOwner::Initialize was called, causing
us to use a basic layer manager. However, since bug 1741156 landed we
now create the compositor on-demand, meaning this is no longer a
possibility.

Delaying compositor creation can, however, uncover another race
condition. If the UICompositorControllerChild is opened on the UI
thread before the main thread is able to set its pointer to the
widget, then the java GeckoSession will never be notified that the
compositor has been opened, and composition will never be
resumed. This patch fixes this issue by setting the
UiCompositorControllerChild's widget pointer in its constructor rather
than immediately afterwards.

Assignee: nobody → jnicol
Status: NEW → ASSIGNED
Pushed by jnicol@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0946d4ce352f
Delay compositor creation on Android to allow time for GPU process to launch. r=gfx-reviewers,geckoview-reviewers,aosmond,calu

Backed out for causing geckoview crashes on Android

  • Backout link
  • Push with failures
  • Failure Log
  • Failure line: PROCESS-CRASH | org.mozilla.geckoview.test.ScreenshotTest#giantScreenshot | application crashed [@ arena_t::ArenaRunRegAlloc(arena_run_t*, arena_bin_t*)]
Flags: needinfo?(jnicol)

With the later patch in this series to delay initializing the
compositor, we started crashing in ScreenshotTest#giantScreenshot due
to webrender's window size sanity check.

This check panics early if we attempt to render an area larger than
webrender can handle (rather than panicing internally in
webrender). However, this test deliberately creates a 999999x999999
sized window, to ensure that attempting to allocate a bitmap this size
for a screenshot results in an out of memory exception.

Previously this test only succeeded because we created the compositor
early with a default size of 0x0, whereas now we create it after the
widget has its very large size. Additionally, the test completes
before we have a chance to render anything, otherwise it would indeed
have crashed.

To ensure users do not hit the panic in the wild, in bug 1653649 we
added the necessary limit to the default widget size constraints,
ensuring we never create widgets that are too large. On Android,
however, we do not use the size constraints, so this had no effect.

This patch starts applying size constraints to android widgets,
meaning we do not attempt to render too large an area, and webrender
does not panic. The test still attempts to allocate a large bitmap,
and therefore still throws an out of memory exception and passes.

Attachment #9266052 - Attachment description: Bug 1756700 - Enforce widget size constraints on Android. r?calu → Bug 1756700 - Enforce widget size constraints on Android. r?#gfx-reviewers
Pushed by jnicol@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f23b890ca61b
Enforce widget size constraints on Android. r=gfx-reviewers,geckoview-reviewers,calu,aosmond
https://hg.mozilla.org/integration/autoland/rev/0d79ee46fa5a
Delay compositor creation on Android to allow time for GPU process to launch. r=gfx-reviewers,geckoview-reviewers,aosmond,calu
Flags: needinfo?(jnicol)
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 99 Branch
Regressions: 1758217
Regressions: 1760794
Regressions: 1760880
No longer regressions: 1760880

Moving GPU process bugs to the new GeckoView::Sandboxing component.

Component: General → Sandboxing
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: