<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Aria Beingessner [:Gankra]

Reporter

Comment 2

•

5 years ago

Thanks, Sotaro, this is extremely helpful!

Comment 3

•

5 years ago

Technically "just" a performance bug now that this is hidden behind a pref from Bug 1599862..?

Blocks: wr-perf

Keywords: perf

Priority: -- → P3

Reporter

Comment 4

•

5 years ago

Performance and battery usage, yes. But there are multiple aspects to the performance of this: not only is there overhead per frame due to the additional copies, there's also a problem with 200ms-500ms of latency during scrolling sometimes. It seems that the discrete GPU can go to sleep, and then doing any kind of GPU work needs to wait for it to wake up.

Updated

•

5 years ago

Blocks: wr-mac-block

Updated

•

4 years ago

Whiteboard: wr-planning

Updated

•

4 years ago

Flags: needinfo?(bpeers)

Comment 5

•

4 years ago

Thank you Markus and Sotaro for providing a lot of context and explanations, that is very helpful!

For starters, just catching up, I have a few questions or things I want to confirm.

First, under Option A, how do we "migrate" between GPUs ? Is this something that happens in the OS, where we continue to use the same GL context, but it just magically got pointed to a new GPU for us? This would effectively be a "device reset" but hidden from the app. Or do we handle it explicitly somewhere?

Second, I'm reading this Apple page, I noticed it says:

This option lets the eGPU accelerate apps on any display connected to the Mac—including displays built in to iMac, iMac Pro, MacBook Air, and MacBook Pro

It sounds like the display and the GPU are not 1:1, maybe there is a compositor step that can present from different sources. So that could mean we don't have to worry about straddling display boundaries and/or moving between displays. Since the GL context seems to be owned by a compositor instance and each window has a compositor (even if they all point back to RenderThread::SharedGL() at the moment), it might be doable to "just" give each window its own GL. It's still not clear which one it should be though -- if you have an eGPU and it's true that it can render on the internal display, you might still want to use the eGPU.

Third, there is only a single ProgramCache and Shaders pair that lives on the RenderThread. So that looks like an implied assumption that all GL contexts are the same, or at the very least generate shaders that are compatible. That probably needs to be a shared cache+shaderset+context per GPU then?

Finally, on Windows all GL contexts point back to RenderThread's SharedGL (on Mac this is maybe not 100% guaranteed, as RenderCompositorOGL stores an mGL which it could theoretically initialize to something unique if sharedGL is null). How do we deal with Optimus-style laptops with an Intel + NVidia setup and the device selection set to "auto" ? How is Mac different, what is missing compared to the behavior on Windows?

Hopefully the above makes sense and is relevant to the conversation :| Thanks!

Flags: needinfo?(bpeers) → needinfo?(mstange)

Reporter

Comment 6

•

4 years ago

(In reply to Bert Peers [:bpeers] from comment #5)

First, under Option A, how do we "migrate" between GPUs ? Is this something that happens in the OS, where we continue to use the same GL context, but it just magically got pointed to a new GPU for us? This would effectively be a "device reset" but hidden from the app. Or do we handle it explicitly somewhere?

We handle it, in GLContextCGL::MigrateToActiveGPU(). The system would do it for us if we used one of the documented (but slow) OpenGL present paths, i.e. if we had an NSOpenGLContext attached to a window via setView, or if we used a CAOpenGLLayer. We don't use either of those; we have an offscreen context and present via IOSurface. This means that certain code in -[NSOpenGLContext update] doesn't kick in for us; this code would usually call CGLSetVirtualScreen. So we have to do it manually.

Important: MigrateToActiveGPU() is currently not called when WebRender is used! That's due to a check that was added in bug 1599862. This check is basically a workaround for an instance of the following disadvantage of "Option A":

When WebRender initializes its GL Device, it makes many decisions based on the renderer's capabilities and those decisions have a large impact

In this case, that decision is "what stride alignment should we use for pbo uploads": https://searchfox.org/mozilla-central/rev/baf1cd492406a9ac31d9ccb7a51c924c7fbb151f/gfx/wr/webrender/src/device/gl.rs#1554,1557
We make it based on the GPU that the GL device is initialized with. If we later switch to a different GPU (here: the discrete AMD GPU) with different requirements, things go bad.

Second, I'm reading this Apple page, I noticed it says:

This option lets the eGPU accelerate apps on any display connected to the Mac—including displays built in to iMac, iMac Pro, MacBook Air, and MacBook Pro

It sounds like the display and the GPU are not 1:1, maybe there is a compositor step that can present from different sources. So that could mean we don't have to worry about straddling display boundaries and/or moving between displays.

I'm not convinced. It's news to me that you can use an eGPU to present to the internal screen - that's nice. But this doesn't say that "whenever an eGPU is used for an external screen, this eGPU also displays the internal screen". It only says that there is an option to do this.

Since the GL context seems to be owned by a compositor instance and each window has a compositor (even if they all point back to RenderThread::SharedGL() at the moment), it might be doable to "just" give each window its own GL. It's still not clear which one it should be though -- if you have an eGPU and it's true that it can render on the internal display, you might still want to use the eGPU.

I agree. You would want to use the eGPU if the window is on a screen that is being displayed by the eGPU, even if that's the internal screen.

Third, there is only a single ProgramCache and Shaders pair that lives on the RenderThread. So that looks like an implied assumption that all GL contexts are the same, or at the very least generate shaders that are compatible. That probably needs to be a shared cache+shaderset+context per GPU then?

Yes. The ProgramCache and Shaders is per context. We have only a single ProgramCache and Shaders because we know that we only have a single context.

Finally, on Windows all GL contexts point back to RenderThread's SharedGL (on Mac this is maybe not 100% guaranteed, as RenderCompositorOGL stores an mGL which it could theoretically initialize to something unique if sharedGL is null).

Ah. We should probably remove that non-shared branch there. It's not expected to ever be hit; SharedGL should never return null.

How do we deal with Optimus-style laptops with an Intel + NVidia setup and the device selection set to "auto" ? How is Mac different, what is missing compared to the behavior on Windows?

I cannot answer this question, I do not know what the behavior on Windows is.

Flags: needinfo?(mstange)

Comment 7

•

4 years ago

Thanks for the follow up Markus.

It sounds like under Option B, when a window needs to change GPUs, we will still have the problem that WebRender must adjust PBO settings etc. to react. So Option A and B both share the problem of WebRender being GPU specific, and our choice of switching virtual screens versus switching (cached) contexts is orthogonal.

Thus I wonder if it would be a useful first step to avoid the lowest-common-denominator problem by recreating WebRender when we switch devices, and re-enable Option A properly? The teardown might not have to be as complete as what Sotaro explains -- we can try to recreate the WebRender inside OGLRenderer, and probably also the ProgramCache and Shaders and see how far that gets us.

In short enabling Option A requires some functionality we'll need for Option B anyway, and it improves things in the meantime, and allows users to start testing this code.

For Option B, if I understand correctly, these are the main concerns?

power management repeatedly switching between iGPU and dGPU, causing overhead. The improvement over option A would be caching of contexts and programs, but again, the WebRender cost will be there either way;
switching based on internal display (the "active" GPU) will never pick the eGPU;
Firefox might be on multiple virtual screens at the same time and should use each GPU to avoid copies.

So we'd continue re-creating a WebRender instance when the Window that owns it switches, but now,

we'd be switching based on "which GPU is driving the display that contains the majority of pixels of this Window", which we somehow need to figure out (in response to a window move/resize?);
we cache every GL context, perhaps keyed by virtual screen (which is unique per pixel format so that should work if we stick to a single pixel format, always), and programs with it, and the switch is now really just refreshing WebRender;

We may need to populate the GL context cache lazily, since eagerly enumerating screens and their GPUs into a context might wake them up / prevent them from sleeping?
Once it's been created, but not currently used by anyone, do we need to release/uncache contexts aggressively to make sure the GPU can sleep? If so, that might cancel out some out the benefits of B (= we're back to cold starts).
Likewise, what happens if the window is on an eGPU that gets removed/unmounted? Is there a callback for this? Or does it all fall under DisplayReconfigurationCallback and we need to use that to scan all our cached GPUs and see if they're still alive?

PS I realized your second link explicitly says "an external GPU cannot drive an internal display", sorry for the distraction.

Reporter

Comment 8

•

4 years ago

(In reply to Bert Peers [:bpeers] from comment #7)

It sounds like under Option B, when a window needs to change GPUs, we will still have the problem that WebRender must adjust PBO settings etc. to react. So Option A and B both share the problem of WebRender being GPU specific, and our choice of switching virtual screens versus switching (cached) contexts is orthogonal.

True, I hadn't thought of it that way. Yes, with both options we need WebRender to be able to be able to update its settings for an existing window.

Thus I wonder if it would be a useful first step to avoid the lowest-common-denominator problem by recreating WebRender when we switch devices, and re-enable Option A properly? The teardown might not have to be as complete as what Sotaro explains -- we can try to recreate the WebRender inside OGLRenderer, and probably also the ProgramCache and Shaders and see how far that gets us.

That's definitely worth a try!

In short enabling Option A requires some functionality we'll need for Option B anyway, and it improves things in the meantime, and allows users to start testing this code.

My guess is that this functionality is actually the bigger work item. Once we have it, I'd expect that going from Option A to Option B would be comparatively easy. In fact, Option A might actually be harder than Option B in the sense that, when a GPU switch occurs, you have to enumerate all renderers and migrate all of them. (I think.) Option B is more of a per-window thing.

For Option B, if I understand correctly, these are the main concerns?

I'm not sure I completely understand the question. Are you asking what the reasons are for preferring Option B over Option A?

power management repeatedly switching between iGPU and dGPU, causing overhead. The improvement over option A would be caching of contexts and programs, but again, the WebRender cost will be there either way;

Yeah, caching of programs is a nice to have but this isn't really a major reason.

switching based on internal display (the "active" GPU) will never pick the eGPU;

Firefox might be on multiple virtual screens at the same time and should use each GPU to avoid copies.

Yes, this is the main reason. Or, more blandly: if Firefox has windows on multiple virtual screens at the same time, how do you even decide which virtual screen to use for the shared GL context? (I guess the answer would be: "use the one that GetFreshContextDisplayMask() picks, just like today".)

So we'd continue re-creating a WebRender instance when the Window that owns it switches, but now,

I assume "now" means "with Option B"?

we'd be switching based on "which GPU is driving the display that contains the majority of pixels of this Window", which we somehow need to figure out (in response to a window move/resize?);

Windows cannot straddle screens on macOS. A window always maps to one GPU. And we can register notifications with macOS that will be called whenever the GPU for a window changes (with a combination of windowDidChangeScreen: and CGDisplayRegisterReconfigurationCallback).

Here's how to map a window to an OpenGL context whose virtual screen matches the GPU that drives the screen that the window is on: If we have one GL context per GPU, then each context's NSOpenGLPixelFormat has a disjoint "display mask" bitset, which can be queried from the pixel format's NSOpenGLPFAScreenMask attribute. If you have an NSWindow* window, then [window screen] gives the NSScreen* that the window is on, [[[screen deviceDescription] objectForKey:@"NSScreenNumber"] unsignedIntValue] gives the current CGDirectDisplayID for that NSScreen*, and CGDisplayIDToOpenGLDisplayMask(displayID) returns a bit in a display mask. If you look up that bit in the GL context bitmasks, you find the correct GL context for the window. See bug 1579664 comment 6 for some example values.

we cache every GL context, perhaps keyed by virtual screen (which is unique per pixel format so that should work if we stick to a single pixel format, always), and programs with it, and the switch is now really just refreshing WebRender;

Yup. I'd cache the contexts keyed by display mask.

We may need to populate the GL context cache lazily, since eagerly enumerating screens and their GPUs into a context might wake them up / prevent them from sleeping?

Yeah, let's do it lazily, there's not really a reason not to. I'd be surprised if enumerating screens would wake up the GPUs, though.

Once it's been created, but not currently used by anyone, do we need to release/uncache contexts aggressively to make sure the GPU can sleep? If so, that might cancel out some out the benefits of B (= we're back to cold starts).

Here's what I thought about this question in bug 1579664 comment 2: "When a CGDirectDisplayID becomes unused, because no window is on a screen with that ID any more, keep around the GLContext if that display ID is for the internal display (CGDisplayIsBuiltin), otherwise throw the GLContext away."

Likewise, what happens if the window is on an eGPU that gets removed/unmounted? Is there a callback for this? Or does it all fall under DisplayReconfigurationCallback and we need to use that to scan all our cached GPUs and see if they're still alive?

The latter, I think.

PS I realized your second link explicitly says "an external GPU cannot drive an internal display", sorry for the distraction.

Ah, thanks for the clarification.

Comment 9

•

4 years ago

we'd be switching based on "which GPU is driving the display that contains the majority of pixels of this Window", which we somehow need to figure out (in response to a window move/resize?);

Windows cannot straddle screens on macOS. A window always maps to one GPU. And we can register notifications with macOS that will be called whenever the GPU for a window changes (with a combination of windowDidChangeScreen: and CGDisplayRegisterReconfigurationCallback).

Ah, interesting. I was going by this page, just below Fig. 1-11: "OpenGL dynamically switches renderers when the virtual screen that contains the majority of the pixels in an OpenGL window changes. ". But the specifics don't matter if there is an OS callback that does the right thing.

My guess is that this functionality is actually the bigger work item. Once we have it, I'd expect that going from Option A to Option B would be comparatively easy. In fact, Option A might actually be harder than Option B in the sense that, when a GPU switch occurs, you have to enumerate all renderers and migrate all of them. (I think.) Option B is more of a per-window thing.

Right, and we definitely want to keep in mind where we're going with B while doing A.
For now just focusing on how to add Option A as a stepping stone, I see a few ways we could go about it:

First option, we could do all the renderer re-creation inside RenderThread, "behind the scenes". WebRenderAPI controls the lifetime of the wr::Renderer instances but they're stored in the RenderThread, accessed through a WindowID. So in theory we could recreate all these renderers without having to recreate the WebRenderAPI instances, ie. without going through AllocPWebRenderBridgeParent?

The change is invisible from the outside (= to the compositor), but a bit hacky on the inside: we need to make the exact same call to wr_window_new that lives in NewRenderer::Run, there could be other unintended side effects from trying to hot-swap renderers behind WebRenderAPI's back. Also as you mentioned maybe display lists need to be resent and such.
Plus, even though an active gpu switch may be a good fit to handle inside RenderThread, something like "Window was moved to a new virtual screen / monitor unplugged" probably not so much.

Second option, we could mark all renderers in the RenderThread as poisoned on a GPU switch, have this poison-flag propagate into the WebRenderAPI instance (?) and then the Compositor that owns the API can detect this and re-create it properly. I think this is Chromium's approach ? It is sort of introducing a concept of a "Context Reset" which the compositor can handle lazily.
Then later, Option B might be implemented by only poisoning specific contexts, eg. if an external GPU goes away, poison those contexts and wait for the compositor to spot it and rebuild.
Unplugging a monitor could then also be done by finding the context(s) and poisoning them?

Third option, a "proper" new message at the compositor level, some variant of SendReinitRenderingForDeviceReset maybe. I'm not very familiar with all this, it looks complicated so I don't know what the benefits would be, or if/why it'd be necessary.

Also, have we tried what happens if we broadcast a device reset when there is an active GPU switch? That sounds like it should "work", just probably with a lot of flicker if we destroy the compositor and all. But it could confirm that it solves our problem (by forcing a webrender re-create the hard way).
(I have a single iGPU in the 13" MBP so I can't try)

Reporter

Comment 10

•

4 years ago

•

Edited

(In reply to Bert Peers [:bpeers] from comment #9)

we'd be switching based on "which GPU is driving the display that contains the majority of pixels of this Window", which we somehow need to figure out (in response to a window move/resize?);

Windows cannot straddle screens on macOS. A window always maps to one GPU. And we can register notifications with macOS that will be called whenever the GPU for a window changes (with a combination of windowDidChangeScreen: and CGDisplayRegisterReconfigurationCallback).

Ah, interesting. I was going by this page, just below Fig. 1-11: "OpenGL dynamically switches renderers when the virtual screen that contains the majority of the pixels in an OpenGL window changes. ".

Interesting. I think at the time when this documentation was written, windows were able to straddle screens, but it got changed in a later release.

But the specifics don't matter if there is an OS callback that does the right thing.

True.

I read your description of the proposed implementations, and they all sound reasonable to me, but I don't know this code well enough to make an educated assessment. I think we need sotaro, nical, gw or kvark here, possibly all of them.

Also, have we tried what happens if we broadcast a device reset when there is an active GPU switch? That sounds like it should "work", just probably with a lot of flicker if we destroy the compositor and all. But it could confirm that it solves our problem (by forcing a webrender re-create the hard way).

I have not tried this. I don't know if the code for device resets is even fully hooked up on macOS.

(I have a single iGPU in the 13" MBP so I can't try)

I think we should get you a machine with two GPUs then, if you want to keep working on this bug.

Comment 11

•

4 years ago

Attached file Bug 1600178 - multi GPU on Mac — Details

Prototype in-place updates, where a message is sent to notify
the render backend, renderer and device that the GPU has changed.
In response, they re-initialize values that are GPU specific, like
optimal PBO stride, and wipe all resources that depend on those in
turn, to be recreated the next frame.

Some RendererOptions are GPU specific, they are split off into a
RendererGpuOptions structure. Their value was set up by
wr_window_new so this code is now shared with the gpu-change
notification.

For testing, this notification is hooked up to the "Capture" event.
In theory this means we should be able to trigger Bug 1579664 and then
"fix it" with Ctrl-Shift-3.

Comment 12

•

4 years ago

I prototyped two ideas for making webrender respond to GPU changes:

send a "GPU changed" message to webrender, similar to "memory pressure"; update parameters that are gpu-specific, and refresh resources;
recreate the wr::Renderer inside RendererOGL; the RendererOGL itself and the WebRenderAPI that "owns" it are both still the same instance.

Don't recreate, just "refresh" the existing wr::Renderer in-place (and the Device inside of it and the caches on the render backend):
https://phabricator.services.mozilla.com/D80118
It's mostly plumbing, for now I've hooked it up to Ctrl-Shift-3. Eventually inside Device (gl.rs) we get a chance to re-evaluate the (new) gl context and set new PBO parameters. If we then also wipe all existing PBOs and recreate them the next frame, then in theory that fixes Bug 1579664: switch GPUs + press Ctrl-Shift-3 = optimal PBO value.

This was slightly complicated by the fact that some of the code that calls wr_window_new fills out the RendererOptions based on the GPU. I've split those values off into a new RendererGpuOptions, and made sure they are re-evaluated and then sent along with the "Gpu Changed" notification.

If refreshing some of the PBO-dependent data requires the compositor to re-send a display list or something, it could do so after calling webrender_api->NotifyGpuChanged().

The idea is to call NotifyGpuChanged() in response to MigrateToActiveGPU.

The other experiment was to hot-swap a brand new wr::Renderer. This Renderer is stored inside a RendererOGL, which itself is "hidden" in the RenderThread, keyed by a WindowID stored from the WebRenderAPI. So in theory it should be possible to call wr_window_new again, and swap the new instance in, with nobody being the wiser.
In theory this is more robust than Option 1, because we don't have to manually adjust and refresh on the fly; it's a brand new webrender instance so it's guaranteed to be 100% about the new GL context. No risk that we missed something or get ourselves into an inconsistent state.

However there are quite a few knock-ons, the encapsulation is a bit leaky. The API channels into the renderer are stored in a few places, so they all start panicking on api_tx.send().expect() when the channel goes dead. Also, a brand new instance has no resources, so all the mKey values for blob data in a DIGroup go stale. There may be more, this is about where I stopped hacking.

So in practice it turns out that a lot of code would need to notified and handle & refresh in response. At which point I don't know if that's an improvement over simply creating a new WebRenderAPI instead. Especially since there is already some code added here (see AdoptChild) for a CompositorParent to accept a new api? (like here). Not sure what that is all about.

Thus I don't have patch for this, it's buggy & crashy , and we might get the same result but a lot more robustly by starting with a device reset and then optimizing it to be less destructive for the special case of "the GPU changed".

Reporter

Comment 13

•

4 years ago

Here's another argument for using different contexts per GPU, and not calling CGLSetVirtualScreen: It'll make it easier to add support for Metal. As far as I know, Metal doesn't have an equivalent of CGLSetVirtualScreen. Instead, there's a Metal device for each GPU.

Comment 14

•

4 years ago

Attached file Bug 1600178 - multiGPU improvements — Details

Push the GL context, surface pool, and other gpu-specific data to the
compositors. Previously they would ask for a SharedGL.

This should make it easier to pass different GL contexts to
different windows.
There is room for multiple contexts keyed by a "Gpu key" but this is not
used yet.

Two exceptions to this one-way flow:

1/ device reset on Angle: remembers the Gpu Key it was created with and
uses it to lazy-recreate the PerGpuData for it after a device reset.

2/ when there is no valid GL created for the GPU (Wayland),
RenderCompositorOGL will lazy-create one using
CreateForCompositorWidget. This breaks the notion that the RenderThread
owns and is aware of all contexts so it can respond when a GPU goes
away.
If null-GL doesn't happen on multi-GPU platforms this doesn't have to be
a showstopper (yet).

Comment 15

•

4 years ago

^-- work in progress of the other side of the problem: storing multiple contexts based on "the GPU". As a first step, there is still only one default GPU and default GL context, but it's being pushed "top down" into the compositors that need it -- instead of them lazily asking for it.

This should help centralize the lifetime (RenderThread owns everything, no more RefPtrs) and send different contexts to different destinations eventually.

RenderCompositorOGL throws a bit of a wrench into this, and device-reset also breaks the one-way-flow. But as a first step, it does work, ie. still getting WR enabled rendering on Angle, it survives a Device Reset (via the about:support button), and same on Linux (VM).

I don't plan to land this in this state, but wanted to share the WIP, hopefully the general vibe of it makes sense. Thanks.

Phabricator Automation

Updated

•

4 years ago

Assignee: nobody → bpeers

Status: NEW → ASSIGNED

Updated

•

4 years ago

No longer blocks: wr-mac-block

BugBot [:suhaib / :marco/ :calixte]

Reporter

Comment 16

•

4 years ago

I think the patch is worth taking, but in a different bug.

Let's file new bugs on the individual steps we identified for our plan here, and resolve this bug, because we do have a plan now! :)

And the short-term plan is already completed: We now support GPU switching on macOS, via CGLSetVirtualScreen (bug 1650475). WR initialization on macOS now always chooses to do all macOS driver workarounds all the time (both the 256 stride alignment, and not using texture storage), so that we can switch from a driver that doesn't need the workaround to a driver that does.

Nicolas Silva [:nical]

Updated

•

4 years ago

No longer blocks: wr-perf

Comment 17

•

3 years ago

The bug assignee didn't login in Bugzilla in the last 7 months.
:bhood, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: bpeers → nobody

Status: ASSIGNED → NEW

Flags: needinfo?(bhood)

Bob Hood [:bhood]

Updated

•

3 years ago

Flags: needinfo?(bhood)

Bob Hood [:bhood]

Comment 18

•

3 years ago

Markus, based on your last comment, would you consider this report complete?

Flags: needinfo?(mstange.moz)