Closed Bug 1500017 Opened Last year Closed Last year

Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain

Categories

(Core :: Graphics: WebRender, enhancement, P2)

Unspecified
Windows 10
enhancement

Tracking

()

RESOLVED FIXED
mozilla65
Tracking Status
firefox65 --- fixed

People

(Reporter: sotaro, Assigned: sotaro)

References

(Blocks 5 open bugs)

Details

Attachments

(6 files, 12 obsolete files)

2.43 KB, text/html
Details
1.10 MB, image/png
Details
996.09 KB, image/png
Details
874.46 KB, image/png
Details
8.49 KB, patch
sotaro
: review+
Details | Diff | Splinter Review
20.78 KB, patch
sotaro
: review+
Details | Diff | Splinter Review
We use DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL with DirectComposition on Windows if it is possible to use on Windows. It is for performance reason(Bug 1453991).

But when we use DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL we face long wait at RenderCompositorANGLE::WaitForPreviousPresentQuery(). It causes performance problem like Bug 1474595 and Bug 1487564.

https://dxr.mozilla.org/mozilla-central/source/gfx/webrender_bindings/RenderCompositorANGLE.cpp#482

Triple buffer increases memory usage. But if we enough performance improvement, it could be a viable choice. WR uses more GPU than non-WR use case. Then WR usage could cause more WaitForPreviousPresentQuery() wait problem.
Blocks: 1474595, 1487564
Attached patch patch - Use triple buffer (obsolete) — Splinter Review
Assignee: nobody → sotaro.ikeda.g
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=0f71a01811fa9081260f28053fa6c9e18ece7628&newProject=try&newRevision=8858c1b18de84e27f617de0f0acc4e6d692e168c&framework=1

With attachment 9018185 [details] [diff] [review], we get big performance win at glterrain, tart, tp5o_scroll and tscrollx. There are small performance decrease tsvg_static and tsvgr_opacity. In total, we get big performance improvement.
And attachment 9018185 [details] [diff] [review] addressed Bug 1474595 on my Win10 PC(P50). FPS was changed from 30fps to 60fps.
Comment on attachment 9018185 [details] [diff] [review]
patch - Use triple buffer

:jrmuizel, how do you think about usage of triple buffer?
Attachment #9018185 - Flags: feedback?(jmuizelaar)
Comment on attachment 9018185 [details] [diff] [review]
patch - Use triple buffer

Review of attachment 9018185 [details] [diff] [review]:
-----------------------------------------------------------------

I haven't thought that much about this yet, but perhaps it's worth trying out DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT and GetFrameLatencyWaitableObject() to get more control over what's going on and ensure we're waiting for the right things. Using GetFrameLatencyWaitableObject() will also let us avoid the Sleep()/GetData() loop.
OS: Unspecified → Windows 10
GetFrameLatencyWaitableObject() and "what gecko do" seem to have a bit different idea.

-[1] Waitable handle of GetFrameLatencyWaitableObject() signals when the DXGI adapter has finished presenting a new frame.
-[2] WaitForPreviousPresentQuery() waits until the previous frames work has been completed.

[2] was originally added to gecko by Bug 1260611. It was mainly for TextureClient recycling. Since there was a case that TextureClient was recycled even when D3DTexture of TextureClient is still in use by GPU.


The following explains several DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT usage patterns.
  https://software.intel.com/en-us/articles/sample-application-for-direct3d-12-flip-model-swap-chains
From TextureClient point of view, [2] does not cause the problem. But [1] might cause the problem like Bug 1260611 when triple buffering is used. Since [1] does not care about GPU task complete.
It would be interesting to see profiles of the different scenarios to make the timing more clear.
Also do you have any thoughts on what's right to do here, Matt?
Flags: needinfo?(matt.woodrow)
I got profiles of http://arodic.github.io/p/jellyfish/ (Bug 1474595) on Win10(P50).

-[1] Default gecko
  https://perfht.ml/2PgMVUE

-[2] Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain
  https://perfht.ml/2PleaO9

-[3] Use FrameLatencyWaitableObject BufferSize(2) MaximumFrameLatency(1)
  https://perfht.ml/2PgC0du

-[4] Use FrameLatencyWaitableObject BufferSize(2) MaximumFrameLatency(2)
  https://perfht.ml/2Pgp9Im

-[5] Use FrameLatencyWaitableObject BufferSize(3) MaximumFrameLatency(2)
  https://perfht.ml/2Pf5HMf
In [1], Renderer thread spent majority of time at RenderCompositorANGLE::WaitForPreviousPresentQuery()
In [3] and [4], Renderer thread spent majority of time for waiting for FrameLatencyWaitableObject in mozilla::wr::RenderCompositorANGLE::BeginFrame().
In [2] and [4], Renderer thread spent majority of time as Idle

Form the above, gpu task complete wait/frame wait did not happen when triple buffer was used.
From Comment 11, "Use FrameLatencyWaitableObject BufferSize(2) MaximumFrameLatency(2)" and "default gecko(double buffer + previous gpu task wait)" were similar performance. The wait mechanisms were different, but both 2 seemed to work the same way as a result.
I got profiles with attachment 9018960 [details] + "PgDn button push scrolling" on Win10(P50).

-[1] Default gecko
  https://perfht.ml/2Pg71hP

-[2] Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain
  https://perfht.ml/2Pbb6E4

-[3] Use FrameLatencyWaitableObject BufferSize(2) MaximumFrameLatency(1)
  https://perfht.ml/2PgRgao

-[4] Use FrameLatencyWaitableObject BufferSize(2) MaximumFrameLatency(2)
  https://perfht.ml/2P9BIFx

-[5] Use FrameLatencyWaitableObject BufferSize(3) MaximumFrameLatency(2)
  https://perfht.ml/2PgJVHK
(In reply to Sotaro Ikeda [:sotaro] from comment #15)

> In [2] and [4], Renderer thread spent majority of time as Idle

Correction:
 In [2] and [5], Renderer thread spent majority of time as Idle
About Comment 18, the result was similar to  Comment 14.

gpu task complete wait/frame wait did not happen when triple buffer was used([2] [5]).
By Comment 11, "Use FrameLatencyWaitableObject BufferSize(3) MaximumFrameLatency(2)" get best performance. And attachment 9018185 [details] [diff] [review] seems to cause at most 2 frame latency. Then it seems better that to use "Use FrameLatencyWaitableObject BufferSize(3) MaximumFrameLatency(2)" than attachment 9018185 [details] [diff] [review].
I have some thoughts on the issue. First of all, `WaitForPreviousPresentQuery` seems to be waiting on CPU for GPU to complete the previous frame. This is certainly not what we want. We need more frames to be in flight/queued, and that's what using `FrameLatencyWaitableObject` with `MaximumFrameLatency > 1` gives us.

However, `WaitForPreviousPresentQuery` was introduced, if I read correctly, to allow TextureClient to re-use the GPU surfaces the way it does. And that part is most concerning to me. I would expect the texture client to *check* for which GPU surfaces can be reused at the time it needs, not force the wait on CPU side (unless, say, it doesn't have any spare surfaces to work on and it can't create more, so it has to wait).

So we have 2 issues:
1) controlling the queue length of frames for GPU to process, and CPU to stay ahead
2) controlling the lifetimes of external textures with regards to the work done by GPU

In this sense, `WaitForPreviousPresentQuery` fixes 2) at the expense of making 1) slow. Introducing `WaitForPreviousPresentQuery` instead would improve 1) but break the current contract of 2) (as suspected in Comment 11).

I don't quite understand why [2] improves performance, given that `WaitForPreviousPresentQuery` is still waiting on the previous frame (so it shouldn't matter how many frames in the swapchain queue we have).

The action plan I see here is rethinking/rewriting the signalling we do from compositor to the texture client. We need to turn the wait into a single check, and make the texture client try to use *other* (spare) surfaces instead of waiting for the one used by GPU. When this is fixed, we can play with `WaitForPreviousPresentQuery` and set up a proper backpressure strategy.
(In reply to Dzmitry Malyshau [:kvark] from comment #22)

> I don't quite understand why [2] improves performance, given that
> `WaitForPreviousPresentQuery` is still waiting on the previous frame (so it
> shouldn't matter how many frames in the swapchain queue we have).

I was wondering the same, understanding this might prove useful. Sotaro do you have any ideas?


(In reply to Dzmitry Malyshau [:kvark] from comment #22)

> So we have 2 issues:
> 1) controlling the queue length of frames for GPU to process, and CPU to
> stay ahead

We should be able to control this independently of other changes, if we increase the length of the swap chain, and also set up our queries to block on the oldest frame, not the previous one.

Adding an extra frame increases memory usage (not just for the buffer itself, but also the shared video textures that now get released later) as well as increases latency for users. The latter part seems like it might be of concern for WebGL, but maybe it's not a big deal (since we have quite a lot of latency already).

Definitely worth having a knob to configure, and we can trial it to see if the trade-offs are worthwhile.


> 2) controlling the lifetimes of external textures with regards to the work
> done by GPU

The idea of the current code was that it only blocks when the D3D queue is full, and if we didn't block here, then we'd start the next frame and just block in the Present() call instead. Blocking here shouldn't actually make anything slower, it just gives us a nice time to send out notification that the textures are now usable again. It's also somewhat simple, since if a subsequent frame wasn't scheduled, we don't need to continually poll for the texture queries to complete.

I do agree that having a proper system for waiting for textures to be released would be preferable, I'm just not sure if it needs to be a blocker for the WR MVP.

> 
> In this sense, `WaitForPreviousPresentQuery` fixes 2) at the expense of
> making 1) slow. Introducing `WaitForPreviousPresentQuery` instead would
> improve 1) but break the current contract of 2) (as suspected in Comment 11).

I assume the second instance of `WaitForPreviousPresentQuery`  here was meant to be `FrameLatencyWaitableObject`?

Could we use it during EndFrame (where the current blocking query is) and then it would be equivalent?

Note that FrameLatencyWaitableObject is only available in newer versions of D3D11, so we couldn't use this unconditionally when we want to roll out WebRender to all D3D11 users.
Flags: needinfo?(matt.woodrow)
(In reply to Matt Woodrow (:mattwoodrow) from comment #23)
> 
> > 2) controlling the lifetimes of external textures with regards to the work
> > done by GPU
> 
> The idea of the current code was that it only blocks when the D3D queue is
> full, and if we didn't block here, then we'd start the next frame and just
> block in the Present() call instead. 
> 

I confirmed when WaitForPreviousPresentQuery() was removed, wait was moved to mSwapChain->Present().
  https://perfht.ml/2EFxDos
(In reply to Matt Woodrow (:mattwoodrow) from comment #23)
> (In reply to Dzmitry Malyshau [:kvark] from comment #22)
> 
> > I don't quite understand why [2] improves performance, given that
> > `WaitForPreviousPresentQuery` is still waiting on the previous frame (so it
> > shouldn't matter how many frames in the swapchain queue we have).
> 
> I was wondering the same, understanding this might prove useful. Sotaro do
> you have any ideas?

I checked GPU related tasks during http://arodic.github.io/p/jellyfish/ with GPUView.

From GPUView, Device Context Queue always has 2 Present token packets. It seems to be related. The Queue has 2 presentation tasks, but SwapChain has only 2 buffers. One buffer is already used as front buffer.
attachment 9019302 [details] took longer time for DMA task than attachment 9019303 [details]. Comment 25 might be related.
(In reply to Matt Woodrow (:mattwoodrow) from comment #23)
> 
> Note that FrameLatencyWaitableObject is only available in newer versions of
> D3D11, so we couldn't use this unconditionally when we want to roll out
> WebRender to all D3D11 users.

We enable DirectComposition and DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL since Win10. Then when DirectComposition is used, we could always use FrameLatencyWaitableObject.
> The idea of the current code was that it only blocks when the D3D queue is
> full, and if we didn't block here, then we'd start the next frame and just
> block in the Present() call instead. Blocking here shouldn't actually make
> anything slower, it just gives us a nice time to send out notification that
> the textures are now usable again. It's also somewhat simple, since if a
> subsequent frame wasn't scheduled, we don't need to continually poll for the
> texture queries to complete.

Looking at the code now, "RenderCompositorANGLE::EndFrame" does the wait right after presenting a frame. e.g. it presents frame number N and waits for frame number N-1 right after that. If we remove that "WaitForPreviousPresentQuery" then the actual wait will be moved to the *next* "Present()" which I assume happens much later. This should give us more parallelism between GPU and CPU, if I understand the code correctly. Thus, I do think that waiting at the end of "EndFrame()" makes stuff slower - we are shortening effectively shortening the frame queue here.

> > In this sense, `WaitForPreviousPresentQuery` fixes 2) at the expense of
> > making 1) slow. Introducing `WaitForPreviousPresentQuery` instead would
> > improve 1) but break the current contract of 2) (as suspected in Comment 11).
> 
> I assume the second instance of `WaitForPreviousPresentQuery`  here was
> meant to be `FrameLatencyWaitableObject`?

Yes, thanks for correction!

> Could we use it during EndFrame (where the current blocking query is) and
> then it would be equivalent?

I'm not yet completely clear on what "FrameLatencyWaitableObject" gives us. If there is a texture in flight used by the D3D11 command list that we want to keep unchanged, then a simple query inserted after that command list is submitted would be the only thing we need to track the lifetime of that texture. Tracking the presentation specifically (instead) doesn't give us much, although the settings to automatically control the frame latency are nice.
Attachment #9018185 - Attachment description: patch - Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain → patch - Use triple buffer
See Also: → 1260611
(In reply to Sotaro Ikeda [:sotaro] from comment #21)
> By Comment 11, "Use FrameLatencyWaitableObject BufferSize(3)
> MaximumFrameLatency(2)" get best performance. And attachment 9018185 [details] [diff] [review]
> [details] [diff] [review] seems to cause at most 2 frame latency. Then it
> seems better that to use "Use FrameLatencyWaitableObject BufferSize(3)
> MaximumFrameLatency(2)" than attachment 9018185 [details] [diff] [review].

I rethink about it. I suspect that attachment 9018185 [details] [diff] [review] is stricter about frame wait than "Use FrameLatencyWaitableObject BufferSize(3) MaximumFrameLatency(2)". Then "Use FrameLatencyWaitableObject BufferSize(3) MaximumFrameLatency(2)" might get better performance.
(In reply to Sotaro Ikeda [:sotaro] from comment #31)
> I rethink about it. I suspect that attachment 9018185 [details] [diff] [review]
> [review] is stricter about frame wait than "Use FrameLatencyWaitableObject
> BufferSize(3) MaximumFrameLatency(2)". Then "Use FrameLatencyWaitableObject
> BufferSize(3) MaximumFrameLatency(2)" might get better performance.

Then if we also relax frame latency in attachment 9018185 [details] [diff] [review] to 2 frames, we might also get good performance like "Use FrameLatencyWaitableObject BufferSize(3) MaximumFrameLatency(2)".
(In reply to Dzmitry Malyshau [:kvark] from comment #30)
>
> > Could we use it during EndFrame (where the current blocking query is) and
> > then it would be equivalent?
> 
> I'm not yet completely clear on what "FrameLatencyWaitableObject" gives us.
> If there is a texture in flight used by the D3D11 command list that we want
> to keep unchanged, then a simple query inserted after that command list is
> submitted would be the only thing we need to track the lifetime of that
> texture. Tracking the presentation specifically (instead) doesn't give us
> much, although the settings to automatically control the frame latency are
> nice.

Big difference of "FrameLatencyWaitableObject" is that it could wait "DXGI adapter has finished presenting a new frame". ID3D11Query could not wait "presenting a new frame" If Comment 32 is right, we do not need to use "FrameLatencyWaitableObject"  in this bug.
Attachment #9018185 - Flags: feedback?(jmuizelaar)
(In reply to Matt Woodrow (:mattwoodrow) from comment #23)
> (In reply to Dzmitry Malyshau [:kvark] from comment #22)
> 
> > 
> > In this sense, `WaitForPreviousPresentQuery` fixes 2) at the expense of
> > making 1) slow. Introducing `WaitForPreviousPresentQuery` instead would
> > improve 1) but break the current contract of 2) (as suspected in Comment 11).
> 
> I assume the second instance of `WaitForPreviousPresentQuery`  here was
> meant to be `FrameLatencyWaitableObject`?
> 
> Could we use it during EndFrame (where the current blocking query is) and
> then it would be equivalent?

I tested it locally. But it did not work as expected. I checked wait during EndFrame during "BufferSize(2) MaximumFrameLatency(1)". But the wait worked like MaximumFrameLatency(2) on GPUView.

docs.microsoft.com says that "For every frame it renders, the app should wait on this handle before starting any rendering operations. " Then API is not provided to be used after rendering.

 https://docs.microsoft.com/en-us/windows/desktop/api/dxgi1_3/nf-dxgi1_3-idxgiswapchain2-getframelatencywaitableobject
It seems that we need to use 3 buffer, if we want to get enough performance on middle/low end Win10 PC.

From attachment 9019303 [details] and Comment 28, frame present took more than 1 frame interval. Therefore 2 frame needs to be on-going and 1 front frame in total 3 buffers.
Compared attachment 9019302 [details] and attachment 9019635 [details]. From GPUView point of view, Use triple buffer and 2 frames gpu task latency(attachment 9019609 [details] [diff] [review]) is a lot better than Use triple buffer(attachment 9018185 [details] [diff] [review]).
By Comment 39, attachment 9019609 [details] [diff] [review] and attachment 9018534 [details] [diff] [review] have similar talos performance. But from TextureClient recycling point of view, attachment 9019609 [details] [diff] [review] is better.
Priority: -- → P2
From the followings, when double buffer is used, talos performance of "Use double buffer and 1 frames gpu task latency(current gecko)" was best performance. "1 frames gpu task latency" completed "presentation task" earlier than others. It might be related. Then, when we use double buffer, it is better to use current "1 frames gpu task latency".

- "Use double buffer and 2 frames gpu task latency" vs "Use double buffer and 1 frames gpu task latency(current gecko)": defautl gecko's performance was better 
 https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=e6bd21974200f979554171107dad94f16a9bfafe&newProject=try&newRevision=4572774d5b5fd0fd9647bdc2d96794d2cefb6b93&framework=1

- "Use double buffer and no wait query." vs "Use double buffer and 1 frames gpu task latency(current gecko)": defautl gecko's performance was better.
 https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=e6bd21974200f979554171107dad94f16a9bfafe&newProject=try&newRevision=a38b63c3bcd5f7a72bf1b32030c6b9d6fa58b772&framework=1
Depends on: 1502789
Attachment #9018185 - Attachment is obsolete: true
Attachment #9018528 - Attachment is obsolete: true
Attachment #9018532 - Attachment is obsolete: true
Attachment #9018534 - Attachment is obsolete: true
Attachment #9019609 - Attachment is obsolete: true
AsyncImagePipelineManager does not care if TextureHost is direct binding texture. If it is direct binding texture, we need to extend to hold the texture until next WebRender rendering. Because the texture might be still be used by GPU.
Attachment #9021749 - Attachment description: patch - Fix direct binding texture recycling handling in AsyncImagePipelineManager → patch part 1 - Fix direct binding texture recycling handling in AsyncImagePipelineManager
Fixed build failure on try.
Attachment #9021749 - Attachment is obsolete: true
Attachment #9021769 - Flags: review?(matt.woodrow)
Attachment #9021775 - Flags: review?(matt.woodrow)
Comment on attachment 9021769 [details] [diff] [review]
patch part 1 - Fix direct binding texture recycling handling in AsyncImagePipelineManager

Review of attachment 9021769 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/wr/AsyncImagePipelineManager.cpp
@@ +670,5 @@
>    }
>  }
>  
>  void
> +AsyncImagePipelineManager::ProcessPipelineRemoved(const wr::PipelineId& aPipelineId, const uint64_t aUpdatesCount)

I realize this is based on existing code, but I much prefer referring to these as 'id' rather than 'count'.

I think ideally the existing TransactionId class would be templated so that you could do:

class PipelineIdTag {};
typedef TransactionId<PipelineIdTag> PipelineId;

Might be a nice follow-up :)

@@ +705,5 @@
> +{
> +  MOZ_ASSERT(aTextureHost);
> +
> +  if (aTextureHost->HasIntermediateBuffer()) {
> +    // If texutre is not direct binding texture, gpu already ended to use it.

gpu has already finished using it.

::: gfx/layers/wr/AsyncImagePipelineManager.h
@@ +217,5 @@
>                               wr::TransactionBuilder& aTxn);
>  
> +  // If texture is direct binding texture, keep it until it is not used by GPU.
> +  void HoldUntilNotUsedByGPU(const CompositableTextureHostRef& aTextureHost, uint64_t aUpdatesCount);
> +  void CheckIfTextureHostsNotUsedByGPU();

CheckForTextureHostsNotUsedByGPU?

@@ +270,5 @@
>    };
>    std::queue<UniquePtr<PipelineUpdates>> mUpdatesQueues;
> +
> +  // Queue to store TextureHosts that might still be used by GPU.
> +  std::queue<std::pair<uint64_t, CompositableTextureHostRef>> mTexturesWatingNotUsedByGPU;

Might be easier to reverse the naming here and call it something like mTexturesInUseByGPU.
Attachment #9021769 - Flags: review?(matt.woodrow) → review+
Comment on attachment 9021775 [details] [diff] [review]
patch part 2 - Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain

Review of attachment 9021775 [details] [diff] [review]:
-----------------------------------------------------------------

Looks awesome!

::: gfx/webrender_bindings/RenderCompositorANGLE.cpp
@@ +498,3 @@
>      BOOL result;
> +    layers::WaitForGPUQuery(mDevice, mCtx, query, &result);
> +

Another follow-up idea, I see CreateQuery show up a bit in profiles sometime, so recycling the entries in this queue (make it a ring buffer instead of a queue?) would be nice!
Attachment #9021775 - Flags: review?(matt.woodrow) → review+
(In reply to Matt Woodrow (:mattwoodrow) from comment #49)
> Comment on attachment 9021769 [details] [diff] [review]
> patch part 1 - Fix direct binding texture recycling handling in
> AsyncImagePipelineManager
> 
> Review of attachment 9021769 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: gfx/layers/wr/AsyncImagePipelineManager.cpp
> @@ +670,5 @@
> >    }
> >  }
> >  
> >  void
> > +AsyncImagePipelineManager::ProcessPipelineRemoved(const wr::PipelineId& aPipelineId, const uint64_t aUpdatesCount)
> 
> I realize this is based on existing code, but I much prefer referring to
> these as 'id' rather than 'count'.
> 
> I think ideally the existing TransactionId class would be templated so that
> you could do:
> 
> class PipelineIdTag {};
> typedef TransactionId<PipelineIdTag> PipelineId;
> 
> Might be a nice follow-up :)

Thanks for the advice :) I will update it as a follow-up.

> 
> @@ +705,5 @@
> > +{
> > +  MOZ_ASSERT(aTextureHost);
> > +
> > +  if (aTextureHost->HasIntermediateBuffer()) {
> > +    // If texutre is not direct binding texture, gpu already ended to use it.
> 
> gpu has already finished using it.

I will update the comment.

> 
> ::: gfx/layers/wr/AsyncImagePipelineManager.h
> @@ +217,5 @@
> >                               wr::TransactionBuilder& aTxn);
> >  
> > +  // If texture is direct binding texture, keep it until it is not used by GPU.
> > +  void HoldUntilNotUsedByGPU(const CompositableTextureHostRef& aTextureHost, uint64_t aUpdatesCount);
> > +  void CheckIfTextureHostsNotUsedByGPU();
> 
> CheckForTextureHostsNotUsedByGPU?

I will update the comment.

> 
> @@ +270,5 @@
> >    };
> >    std::queue<UniquePtr<PipelineUpdates>> mUpdatesQueues;
> > +
> > +  // Queue to store TextureHosts that might still be used by GPU.
> > +  std::queue<std::pair<uint64_t, CompositableTextureHostRef>> mTexturesWatingNotUsedByGPU;
> 
> Might be easier to reverse the naming here and call it something like
> mTexturesInUseByGPU.

I will update the comment.
(In reply to Matt Woodrow (:mattwoodrow) from comment #50)
> Comment on attachment 9021775 [details] [diff] [review]
> patch part 2 - Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL
> SwapChain
> 
> Review of attachment 9021775 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks awesome!
> 
> ::: gfx/webrender_bindings/RenderCompositorANGLE.cpp
> @@ +498,3 @@
> >      BOOL result;
> > +    layers::WaitForGPUQuery(mDevice, mCtx, query, &result);
> > +
> 
> Another follow-up idea, I see CreateQuery show up a bit in profiles
> sometime, so recycling the entries in this queue (make it a ring buffer
> instead of a queue?) would be nice!

Yea, it is a good idea for the follow-up!
Rebased.
Attachment #9021775 - Attachment is obsolete: true
Attachment #9022494 - Flags: review+
attachment 9022493 [details] [diff] [review] and attachment 9022494 [details] [diff] [review] are expected to increase memory usage by using triple buffering and fixing texture recycling for direct bounding textures.
Pushed by sikeda@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d139f68fd874
Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain r=mattwoodrow
Fix build problem on try.
Attachment #9022494 - Attachment is obsolete: true
Flags: needinfo?(sotaro.ikeda.g)
Attachment #9022526 - Flags: review+
Pushed by sikeda@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c27b21d98ce4
Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain r=mattwoodrow
https://hg.mozilla.org/mozilla-central/rev/c27b21d98ce4
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Depends on: 1505259
Big perf wins on QR! \o/

== Change summary for alert #17378 (as of Mon, 05 Nov 2018 08:23:58 GMT) ==

Improvements:

 27%  tscrollx windows10-64-qr opt e10s stylo        1.67 -> 1.21
 23%  glterrain windows10-64-qr opt e10s stylo       1.43 -> 1.10
 19%  tart windows10-64-qr opt e10s stylo            3.79 -> 3.06
  9%  tp5o_scroll windows10-64-qr opt e10s stylo     2.91 -> 2.65

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=17378
Depends on: 1532457
You need to log in before you can comment on or make changes to this bug.