Closed Bug 1500017 Opened 6 years ago Closed 6 years ago

Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla65

Tracking Flags:

Tracking

Status

firefox65

---

fixed

People

(Reporter: sotaro, Assigned: sotaro)

References

Details

Attachments

(6 files, 12 obsolete files)

patch - Use triple buffer 6 years ago Sotaro Ikeda [:sotaro] 1.08 KB, patch		Details \| Diff \| Splinter Review
patch - Use FrameLatencyWaitableObject BufferSize(2) MaximumFrameLatency(1) 6 years ago Sotaro Ikeda [:sotaro] 5.09 KB, patch		Details \| Diff \| Splinter Review
patch - Use FrameLatencyWaitableObject BufferSize(2) MaximumFrameLatency(2) 6 years ago Sotaro Ikeda [:sotaro] 5.09 KB, patch		Details \| Diff \| Splinter Review
patch - Use FrameLatencyWaitableObject BufferSize(3) MaximumFrameLatency(2) 6 years ago Sotaro Ikeda [:sotaro] 5.31 KB, patch		Details \| Diff \| Splinter Review
content for scroll testing 6 years ago Sotaro Ikeda [:sotaro] 2.43 KB, text/html		Details
GPUView diagram of default(using double buffer) gecko 6 years ago Sotaro Ikeda [:sotaro] 1.10 MB, image/png		Details
GPUView diagram of using triple buffer 6 years ago Sotaro Ikeda [:sotaro] 996.09 KB, image/png		Details
patch - Use triple buffer and 2 frames gpu task latency 6 years ago Sotaro Ikeda [:sotaro] 3.61 KB, patch		Details \| Diff \| Splinter Review
GPUView diagram of triple buffer and 2 frames gpu task latency 6 years ago Sotaro Ikeda [:sotaro] 874.46 KB, image/png		Details
patch - Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain 6 years ago Sotaro Ikeda [:sotaro] 21.75 KB, patch		Details \| Diff \| Splinter Review
patch - Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain 6 years ago Sotaro Ikeda [:sotaro] 21.63 KB, patch		Details \| Diff \| Splinter Review
patch part 1 - Fix direct binding texture recycling handling in AsyncImagePipelineManager 6 years ago Sotaro Ikeda [:sotaro] 8.48 KB, patch		Details \| Diff \| Splinter Review
patch part 2 - Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain 6 years ago Sotaro Ikeda [:sotaro] 20.75 KB, patch		Details \| Diff \| Splinter Review
patch part 1 - Fix direct binding texture recycling handling in AsyncImagePipelineManager 6 years ago Sotaro Ikeda [:sotaro] 8.52 KB, patch	mattwoodrow : review+	Details \| Diff \| Splinter Review
patch part 2 - Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain 6 years ago Sotaro Ikeda [:sotaro] 20.77 KB, patch	mattwoodrow : review+	Details \| Diff \| Splinter Review
patch part 1 - Fix direct binding texture recycling handling in AsyncImagePipelineManager 6 years ago Sotaro Ikeda [:sotaro] 8.49 KB, patch	sotaro : review+	Details \| Diff \| Splinter Review
patch part 2 - Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain 6 years ago Sotaro Ikeda [:sotaro] 20.77 KB, patch	sotaro : review+	Details \| Diff \| Splinter Review
patch part 2 - Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain 6 years ago Sotaro Ikeda [:sotaro] 20.78 KB, patch	sotaro : review+	Details \| Diff \| Splinter Review

Sotaro Ikeda [:sotaro]

Assignee

Description

•

6 years ago

We use DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL with DirectComposition on Windows if it is possible to use on Windows. It is for performance reason(Bug 1453991). But when we use DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL we face long wait at RenderCompositorANGLE::WaitForPreviousPresentQuery(). It causes performance problem like Bug 1474595 and Bug 1487564. https://dxr.mozilla.org/mozilla-central/source/gfx/webrender_bindings/RenderCompositorANGLE.cpp#482 Triple buffer increases memory usage. But if we enough performance improvement, it could be a viable choice. WR uses more GPU than non-WR use case. Then WR usage could cause more WaitForPreviousPresentQuery() wait problem.

Sotaro Ikeda [:sotaro]

Assignee

Updated

•

6 years ago

Blocks: 1474595, 1487564

Sotaro Ikeda [:sotaro]

Assignee

Comment 1

•

6 years ago

Attached patch patch - Use triple buffer (obsolete) — Details — Splinter Review

Assignee: nobody → sotaro.ikeda.g

Sotaro Ikeda [:sotaro]

Assignee

Comment 2

•

6 years ago

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=0f71a01811fa9081260f28053fa6c9e18ece7628&newProject=try&newRevision=8858c1b18de84e27f617de0f0acc4e6d692e168c&framework=1 With attachment 9018185 [details] [diff] [review], we get big performance win at glterrain, tart, tp5o_scroll and tscrollx. There are small performance decrease tsvg_static and tsvgr_opacity. In total, we get big performance improvement.

Sotaro Ikeda [:sotaro]

Assignee

Comment 3

•

6 years ago

And attachment 9018185 [details] [diff] [review] addressed Bug 1474595 on my Win10 PC(P50). FPS was changed from 30fps to 60fps.

Sotaro Ikeda [:sotaro]

Assignee

Comment 4

•

6 years ago

Comment on attachment 9018185 [details] [diff] [review] patch - Use triple buffer :jrmuizel, how do you think about usage of triple buffer?

Attachment #9018185 - Flags: feedback?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Comment 5

•

6 years ago

Comment on attachment 9018185 [details] [diff] [review] patch - Use triple buffer Review of attachment 9018185 [details] [diff] [review]: ----------------------------------------------------------------- I haven't thought that much about this yet, but perhaps it's worth trying out DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT and GetFrameLatencyWaitableObject() to get more control over what's going on and ensure we're waiting for the right things. Using GetFrameLatencyWaitableObject() will also let us avoid the Sleep()/GetData() loop.

Sotaro Ikeda [:sotaro]

Assignee

Updated

•

6 years ago

OS: Unspecified → Windows 10

Sotaro Ikeda [:sotaro]

Assignee

Comment 6

•

6 years ago

GetFrameLatencyWaitableObject() and "what gecko do" seem to have a bit different idea. -[1] Waitable handle of GetFrameLatencyWaitableObject() signals when the DXGI adapter has finished presenting a new frame. -[2] WaitForPreviousPresentQuery() waits until the previous frames work has been completed. [2] was originally added to gecko by Bug 1260611. It was mainly for TextureClient recycling. Since there was a case that TextureClient was recycled even when D3DTexture of TextureClient is still in use by GPU. The following explains several DXGI_SWAP_CHAIN_FLAG_FRAME_LATENCY_WAITABLE_OBJECT usage patterns. https://software.intel.com/en-us/articles/sample-application-for-direct3d-12-flip-model-swap-chains

Sotaro Ikeda [:sotaro]

Assignee

Comment 7

•

6 years ago

From TextureClient point of view, [2] does not cause the problem. But [1] might cause the problem like Bug 1260611 when triple buffering is used. Since [1] does not care about GPU task complete.

Sotaro Ikeda [:sotaro]

Assignee

Comment 8

•

6 years ago

Attached patch patch - Use FrameLatencyWaitableObject BufferSize(2) MaximumFrameLatency(1) (obsolete) — Details — Splinter Review

Sotaro Ikeda [:sotaro]

Assignee

Comment 9

•

6 years ago

Attached patch patch - Use FrameLatencyWaitableObject BufferSize(2) MaximumFrameLatency(2) (obsolete) — Details — Splinter Review

Sotaro Ikeda [:sotaro]

Assignee

Comment 10

•

6 years ago

Attached patch patch - Use FrameLatencyWaitableObject BufferSize(3) MaximumFrameLatency(2) (obsolete) — Details — Splinter Review

Sotaro Ikeda [:sotaro]

Assignee

Comment 11

•

6 years ago

-(1) Use FrameLatencyWaitableObject BufferSize(2) MaximumFrameLatency(1): Performance is worse than default. https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1bef6f3b4f799ff10170d8f037048d24919ea771&newProject=try&newRevision=2f4e3f51435202306f7501050a8d564d1d7dce56&framework=1 -(2) Use FrameLatencyWaitableObject BufferSize(2) MaximumFrameLatency(2): similar performance to default. https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1bef6f3b4f799ff10170d8f037048d24919ea771&newProject=try&newRevision=065d8ebd9966304442d531387acb6dde4e493c38&framework=1 -(3) Use FrameLatencyWaitableObject BufferSize(3) MaximumFrameLatency(2):a lot better performance than default https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1bef6f3b4f799ff10170d8f037048d24919ea771&newProject=try&newRevision=cfc5aac499ff5bf17c3c8fe40e6408feddd97a63&framework=1 (3) has better performance than "Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain". But there is a concern about TextureClient recycling as in Comment 7. We might need to modify around recycling a bit. https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=e67815b5cb00a5e1fe8288c6689f8abc10c2f64b&newProject=try&newRevision=cfc5aac499ff5bf17c3c8fe40e6408feddd97a63&framework=1

Jeff Muizelaar [:jrmuizel]

Comment 12

•

6 years ago

It would be interesting to see profiles of the different scenarios to make the timing more clear.

Jeff Muizelaar [:jrmuizel]

Comment 13

•

6 years ago

Also do you have any thoughts on what's right to do here, Matt?

Flags: needinfo?(matt.woodrow)

Sotaro Ikeda [:sotaro]

Assignee

Comment 14

•

6 years ago

I got profiles of http://arodic.github.io/p/jellyfish/ (Bug 1474595) on Win10(P50). -[1] Default gecko https://perfht.ml/2PgMVUE -[2] Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain https://perfht.ml/2PleaO9 -[3] Use FrameLatencyWaitableObject BufferSize(2) MaximumFrameLatency(1) https://perfht.ml/2PgC0du -[4] Use FrameLatencyWaitableObject BufferSize(2) MaximumFrameLatency(2) https://perfht.ml/2Pgp9Im -[5] Use FrameLatencyWaitableObject BufferSize(3) MaximumFrameLatency(2) https://perfht.ml/2Pf5HMf

Sotaro Ikeda [:sotaro]

Assignee

Comment 15

•

6 years ago

In [1], Renderer thread spent majority of time at RenderCompositorANGLE::WaitForPreviousPresentQuery() In [3] and [4], Renderer thread spent majority of time for waiting for FrameLatencyWaitableObject in mozilla::wr::RenderCompositorANGLE::BeginFrame(). In [2] and [4], Renderer thread spent majority of time as Idle Form the above, gpu task complete wait/frame wait did not happen when triple buffer was used.

Sotaro Ikeda [:sotaro]

Assignee

Comment 16

•

6 years ago

From Comment 11, "Use FrameLatencyWaitableObject BufferSize(2) MaximumFrameLatency(2)" and "default gecko(double buffer + previous gpu task wait)" were similar performance. The wait mechanisms were different, but both 2 seemed to work the same way as a result.

Sotaro Ikeda [:sotaro]

Assignee

Comment 17

•

6 years ago

Attached file content for scroll testing — Details

Sotaro Ikeda [:sotaro]

Assignee

Comment 18

•

6 years ago

I got profiles with attachment 9018960 [details] + "PgDn button push scrolling" on Win10(P50). -[1] Default gecko https://perfht.ml/2Pg71hP -[2] Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain https://perfht.ml/2Pbb6E4 -[3] Use FrameLatencyWaitableObject BufferSize(2) MaximumFrameLatency(1) https://perfht.ml/2PgRgao -[4] Use FrameLatencyWaitableObject BufferSize(2) MaximumFrameLatency(2) https://perfht.ml/2P9BIFx -[5] Use FrameLatencyWaitableObject BufferSize(3) MaximumFrameLatency(2) https://perfht.ml/2PgJVHK

Sotaro Ikeda [:sotaro]

Assignee

Comment 19

•

6 years ago

(In reply to Sotaro Ikeda [:sotaro] from comment #15) > In [2] and [4], Renderer thread spent majority of time as Idle Correction: In [2] and [5], Renderer thread spent majority of time as Idle

Sotaro Ikeda [:sotaro]

Assignee

Comment 20

•

6 years ago

About Comment 18, the result was similar to Comment 14. gpu task complete wait/frame wait did not happen when triple buffer was used([2] [5]).

Sotaro Ikeda [:sotaro]

Assignee

Comment 21

•

6 years ago

By Comment 11, "Use FrameLatencyWaitableObject BufferSize(3) MaximumFrameLatency(2)" get best performance. And attachment 9018185 [details] [diff] [review] seems to cause at most 2 frame latency. Then it seems better that to use "Use FrameLatencyWaitableObject BufferSize(3) MaximumFrameLatency(2)" than attachment 9018185 [details] [diff] [review].

Dzmitry Malyshau [:kvark]

Comment 22

•

6 years ago

I have some thoughts on the issue. First of all, `WaitForPreviousPresentQuery` seems to be waiting on CPU for GPU to complete the previous frame. This is certainly not what we want. We need more frames to be in flight/queued, and that's what using `FrameLatencyWaitableObject` with `MaximumFrameLatency > 1` gives us. However, `WaitForPreviousPresentQuery` was introduced, if I read correctly, to allow TextureClient to re-use the GPU surfaces the way it does. And that part is most concerning to me. I would expect the texture client to *check* for which GPU surfaces can be reused at the time it needs, not force the wait on CPU side (unless, say, it doesn't have any spare surfaces to work on and it can't create more, so it has to wait). So we have 2 issues: 1) controlling the queue length of frames for GPU to process, and CPU to stay ahead 2) controlling the lifetimes of external textures with regards to the work done by GPU In this sense, `WaitForPreviousPresentQuery` fixes 2) at the expense of making 1) slow. Introducing `WaitForPreviousPresentQuery` instead would improve 1) but break the current contract of 2) (as suspected in Comment 11). I don't quite understand why [2] improves performance, given that `WaitForPreviousPresentQuery` is still waiting on the previous frame (so it shouldn't matter how many frames in the swapchain queue we have). The action plan I see here is rethinking/rewriting the signalling we do from compositor to the texture client. We need to turn the wait into a single check, and make the texture client try to use *other* (spare) surfaces instead of waiting for the one used by GPU. When this is fixed, we can play with `WaitForPreviousPresentQuery` and set up a proper backpressure strategy.

Matt Woodrow (:mattwoodrow)

Comment 23

•

6 years ago

(In reply to Dzmitry Malyshau [:kvark] from comment #22) > I don't quite understand why [2] improves performance, given that > `WaitForPreviousPresentQuery` is still waiting on the previous frame (so it > shouldn't matter how many frames in the swapchain queue we have). I was wondering the same, understanding this might prove useful. Sotaro do you have any ideas? (In reply to Dzmitry Malyshau [:kvark] from comment #22) > So we have 2 issues: > 1) controlling the queue length of frames for GPU to process, and CPU to > stay ahead We should be able to control this independently of other changes, if we increase the length of the swap chain, and also set up our queries to block on the oldest frame, not the previous one. Adding an extra frame increases memory usage (not just for the buffer itself, but also the shared video textures that now get released later) as well as increases latency for users. The latter part seems like it might be of concern for WebGL, but maybe it's not a big deal (since we have quite a lot of latency already). Definitely worth having a knob to configure, and we can trial it to see if the trade-offs are worthwhile. > 2) controlling the lifetimes of external textures with regards to the work > done by GPU The idea of the current code was that it only blocks when the D3D queue is full, and if we didn't block here, then we'd start the next frame and just block in the Present() call instead. Blocking here shouldn't actually make anything slower, it just gives us a nice time to send out notification that the textures are now usable again. It's also somewhat simple, since if a subsequent frame wasn't scheduled, we don't need to continually poll for the texture queries to complete. I do agree that having a proper system for waiting for textures to be released would be preferable, I'm just not sure if it needs to be a blocker for the WR MVP. > > In this sense, `WaitForPreviousPresentQuery` fixes 2) at the expense of > making 1) slow. Introducing `WaitForPreviousPresentQuery` instead would > improve 1) but break the current contract of 2) (as suspected in Comment 11). I assume the second instance of `WaitForPreviousPresentQuery` here was meant to be `FrameLatencyWaitableObject`? Could we use it during EndFrame (where the current blocking query is) and then it would be equivalent? Note that FrameLatencyWaitableObject is only available in newer versions of D3D11, so we couldn't use this unconditionally when we want to roll out WebRender to all D3D11 users.

Flags: needinfo?(matt.woodrow)

Sotaro Ikeda [:sotaro]

Assignee

Comment 24

•

6 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #23) > > > 2) controlling the lifetimes of external textures with regards to the work > > done by GPU > > The idea of the current code was that it only blocks when the D3D queue is > full, and if we didn't block here, then we'd start the next frame and just > block in the Present() call instead. > I confirmed when WaitForPreviousPresentQuery() was removed, wait was moved to mSwapChain->Present(). https://perfht.ml/2EFxDos

Sotaro Ikeda [:sotaro]

Assignee

Comment 25

•

6 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #23) > (In reply to Dzmitry Malyshau [:kvark] from comment #22) > > > I don't quite understand why [2] improves performance, given that > > `WaitForPreviousPresentQuery` is still waiting on the previous frame (so it > > shouldn't matter how many frames in the swapchain queue we have). > > I was wondering the same, understanding this might prove useful. Sotaro do > you have any ideas? I checked GPU related tasks during http://arodic.github.io/p/jellyfish/ with GPUView. From GPUView, Device Context Queue always has 2 Present token packets. It seems to be related. The Queue has 2 presentation tasks, but SwapChain has only 2 buffers. One buffer is already used as front buffer.

Sotaro Ikeda [:sotaro]

Assignee

Comment 26

•

6 years ago

Attached image GPUView diagram of default(using double buffer) gecko — Details

Sotaro Ikeda [:sotaro]

Assignee

Comment 27

•

6 years ago

Attached image GPUView diagram of using triple buffer — Details

Sotaro Ikeda [:sotaro]

Assignee

Comment 28

•

6 years ago

attachment 9019302 [details] took longer time for DMA task than attachment 9019303 [details]. Comment 25 might be related.

Sotaro Ikeda [:sotaro]

Assignee

Comment 29

•

6 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #23) > > Note that FrameLatencyWaitableObject is only available in newer versions of > D3D11, so we couldn't use this unconditionally when we want to roll out > WebRender to all D3D11 users. We enable DirectComposition and DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL since Win10. Then when DirectComposition is used, we could always use FrameLatencyWaitableObject.

Dzmitry Malyshau [:kvark]

Comment 30

•

6 years ago

> The idea of the current code was that it only blocks when the D3D queue is > full, and if we didn't block here, then we'd start the next frame and just > block in the Present() call instead. Blocking here shouldn't actually make > anything slower, it just gives us a nice time to send out notification that > the textures are now usable again. It's also somewhat simple, since if a > subsequent frame wasn't scheduled, we don't need to continually poll for the > texture queries to complete. Looking at the code now, "RenderCompositorANGLE::EndFrame" does the wait right after presenting a frame. e.g. it presents frame number N and waits for frame number N-1 right after that. If we remove that "WaitForPreviousPresentQuery" then the actual wait will be moved to the *next* "Present()" which I assume happens much later. This should give us more parallelism between GPU and CPU, if I understand the code correctly. Thus, I do think that waiting at the end of "EndFrame()" makes stuff slower - we are shortening effectively shortening the frame queue here. > > In this sense, `WaitForPreviousPresentQuery` fixes 2) at the expense of > > making 1) slow. Introducing `WaitForPreviousPresentQuery` instead would > > improve 1) but break the current contract of 2) (as suspected in Comment 11). > > I assume the second instance of `WaitForPreviousPresentQuery` here was > meant to be `FrameLatencyWaitableObject`? Yes, thanks for correction! > Could we use it during EndFrame (where the current blocking query is) and > then it would be equivalent? I'm not yet completely clear on what "FrameLatencyWaitableObject" gives us. If there is a texture in flight used by the D3D11 command list that we want to keep unchanged, then a simple query inserted after that command list is submitted would be the only thing we need to track the lifetime of that texture. Tracking the presentation specifically (instead) doesn't give us much, although the settings to automatically control the frame latency are nice.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Updated

•

6 years ago

Blocks: 1501378

Sotaro Ikeda [:sotaro]

Assignee

Updated

•

6 years ago

Attachment #9018185 - Attachment description: patch - Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain → patch - Use triple buffer

Sotaro Ikeda [:sotaro]

Assignee

Updated

•

6 years ago

Comment 31

•

6 years ago

(In reply to Sotaro Ikeda [:sotaro] from comment #21) > By Comment 11, "Use FrameLatencyWaitableObject BufferSize(3) > MaximumFrameLatency(2)" get best performance. And attachment 9018185 [details] [diff] [review] > [details] [diff] [review] seems to cause at most 2 frame latency. Then it > seems better that to use "Use FrameLatencyWaitableObject BufferSize(3) > MaximumFrameLatency(2)" than attachment 9018185 [details] [diff] [review]. I rethink about it. I suspect that attachment 9018185 [details] [diff] [review] is stricter about frame wait than "Use FrameLatencyWaitableObject BufferSize(3) MaximumFrameLatency(2)". Then "Use FrameLatencyWaitableObject BufferSize(3) MaximumFrameLatency(2)" might get better performance.

Sotaro Ikeda [:sotaro]

Assignee

Comment 32

•

6 years ago

(In reply to Sotaro Ikeda [:sotaro] from comment #31) > I rethink about it. I suspect that attachment 9018185 [details] [diff] [review] > [review] is stricter about frame wait than "Use FrameLatencyWaitableObject > BufferSize(3) MaximumFrameLatency(2)". Then "Use FrameLatencyWaitableObject > BufferSize(3) MaximumFrameLatency(2)" might get better performance. Then if we also relax frame latency in attachment 9018185 [details] [diff] [review] to 2 frames, we might also get good performance like "Use FrameLatencyWaitableObject BufferSize(3) MaximumFrameLatency(2)".

Sotaro Ikeda [:sotaro]

Assignee

Comment 33

•

6 years ago

(In reply to Dzmitry Malyshau [:kvark] from comment #30) > > > Could we use it during EndFrame (where the current blocking query is) and > > then it would be equivalent? > > I'm not yet completely clear on what "FrameLatencyWaitableObject" gives us. > If there is a texture in flight used by the D3D11 command list that we want > to keep unchanged, then a simple query inserted after that command list is > submitted would be the only thing we need to track the lifetime of that > texture. Tracking the presentation specifically (instead) doesn't give us > much, although the settings to automatically control the frame latency are > nice. Big difference of "FrameLatencyWaitableObject" is that it could wait "DXGI adapter has finished presenting a new frame". ID3D11Query could not wait "presenting a new frame" If Comment 32 is right, we do not need to use "FrameLatencyWaitableObject" in this bug.

Sotaro Ikeda [:sotaro]

Assignee

Comment 34

•

6 years ago

Attached patch patch - Use triple buffer and 2 frames gpu task latency (obsolete) — Details — Splinter Review

Sotaro Ikeda [:sotaro]

Assignee

Updated

•

6 years ago

Attachment #9018185 - Flags: feedback?(jmuizelaar)

Sotaro Ikeda [:sotaro]

Assignee

Comment 35

•

6 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #23) > (In reply to Dzmitry Malyshau [:kvark] from comment #22) > > > > > In this sense, `WaitForPreviousPresentQuery` fixes 2) at the expense of > > making 1) slow. Introducing `WaitForPreviousPresentQuery` instead would > > improve 1) but break the current contract of 2) (as suspected in Comment 11). > > I assume the second instance of `WaitForPreviousPresentQuery` here was > meant to be `FrameLatencyWaitableObject`? > > Could we use it during EndFrame (where the current blocking query is) and > then it would be equivalent? I tested it locally. But it did not work as expected. I checked wait during EndFrame during "BufferSize(2) MaximumFrameLatency(1)". But the wait worked like MaximumFrameLatency(2) on GPUView. docs.microsoft.com says that "For every frame it renders, the app should wait on this handle before starting any rendering operations. " Then API is not provided to be used after rendering. https://docs.microsoft.com/en-us/windows/desktop/api/dxgi1_3/nf-dxgi1_3-idxgiswapchain2-getframelatencywaitableobject

Sotaro Ikeda [:sotaro]

Assignee

Comment 36

•

6 years ago

Attached image GPUView diagram of triple buffer and 2 frames gpu task latency — Details

Sotaro Ikeda [:sotaro]

Assignee

Comment 37

•

6 years ago

It seems that we need to use 3 buffer, if we want to get enough performance on middle/low end Win10 PC. From attachment 9019303 [details] and Comment 28, frame present took more than 1 frame interval. Therefore 2 frame needs to be on-going and 1 front frame in total 3 buffers.

Sotaro Ikeda [:sotaro]

Assignee

Comment 38

•

6 years ago

Compared attachment 9019302 [details] and attachment 9019635 [details]. From GPUView point of view, Use triple buffer and 2 frames gpu task latency(attachment 9019609 [details] [diff] [review]) is a lot better than Use triple buffer(attachment 9018185 [details] [diff] [review]).

Sotaro Ikeda [:sotaro]

Assignee

Comment 39

•

6 years ago

Compare talos performance of http://arodic.github.io/p/jellyfish/ (Bug 1474595) on Win10(P50). -(1) attachment 9019609 [details] [diff] [review] performance was a lot better performance than default gecko https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=53d7249609b3ac9c327d3dff529db9d578b02792&newProject=try&newRevision=4a8dd158300b60eb2c63925e8cdaede658cadd11&framework=1 - (2) attachment 9019609 [details] [diff] [review] and attachment 9018534 [details] [diff] [review] have similar performance. But attachment 9018534 [details] [diff] [review] performance is a bit better. attachment 9018534 [details] [diff] [review] could start rendering just after vsync start. It seems to be related. https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=4a8dd158300b60eb2c63925e8cdaede658cadd11&newProject=try&newRevision=3b3f964566250a58e01321720db65ff04a5d2167&framework=1

Sotaro Ikeda [:sotaro]

Assignee

Comment 40

•

6 years ago

By Comment 39, attachment 9019609 [details] [diff] [review] and attachment 9018534 [details] [diff] [review] have similar talos performance. But from TextureClient recycling point of view, attachment 9019609 [details] [diff] [review] is better.

Maire Reavy [:mreavy]

Updated

•

6 years ago

Blocks: stage-wr-trains

Priority: -- → P2

Sotaro Ikeda [:sotaro]

Assignee

Comment 41

•

6 years ago

From the followings, when double buffer is used, talos performance of "Use double buffer and 1 frames gpu task latency(current gecko)" was best performance. "1 frames gpu task latency" completed "presentation task" earlier than others. It might be related. Then, when we use double buffer, it is better to use current "1 frames gpu task latency". - "Use double buffer and 2 frames gpu task latency" vs "Use double buffer and 1 frames gpu task latency(current gecko)": defautl gecko's performance was better https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=e6bd21974200f979554171107dad94f16a9bfafe&newProject=try&newRevision=4572774d5b5fd0fd9647bdc2d96794d2cefb6b93&framework=1 - "Use double buffer and no wait query." vs "Use double buffer and 1 frames gpu task latency(current gecko)": defautl gecko's performance was better. https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=e6bd21974200f979554171107dad94f16a9bfafe&newProject=try&newRevision=a38b63c3bcd5f7a72bf1b32030c6b9d6fa58b772&framework=1

Sotaro Ikeda [:sotaro]

Assignee

Updated

•

6 years ago

Depends on: 1502789

Sotaro Ikeda [:sotaro]

Assignee

Updated

•

6 years ago

Attachment #9018185 - Attachment is obsolete: true

Sotaro Ikeda [:sotaro]

Assignee

Updated

•

6 years ago

Attachment #9018528 - Attachment is obsolete: true

Sotaro Ikeda [:sotaro]

Assignee

Updated

•

6 years ago

Attachment #9018532 - Attachment is obsolete: true

Sotaro Ikeda [:sotaro]

Assignee

Updated

•

6 years ago

Attachment #9018534 - Attachment is obsolete: true

Sotaro Ikeda [:sotaro]

Assignee

Updated

•

6 years ago

Attachment #9019609 - Attachment is obsolete: true

Comment hidden (obsolete)

Sotaro Ikeda [:sotaro]

Assignee

Comment 43

•

6 years ago

Attached patch patch - Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain (obsolete) — Details — Splinter Review

Attachment #9021438 - Attachment is obsolete: true

Sotaro Ikeda [:sotaro]

Assignee

Comment 44

•

6 years ago

Attached patch patch part 1 - Fix direct binding texture recycling handling in AsyncImagePipelineManager (obsolete) — Details — Splinter Review

AsyncImagePipelineManager does not care if TextureHost is direct binding texture. If it is direct binding texture, we need to extend to hold the texture until next WebRender rendering. Because the texture might be still be used by GPU.

Sotaro Ikeda [:sotaro]

Assignee

Updated

•

6 years ago

Attachment #9021749 - Attachment description: patch - Fix direct binding texture recycling handling in AsyncImagePipelineManager → patch part 1 - Fix direct binding texture recycling handling in AsyncImagePipelineManager

Sotaro Ikeda [:sotaro]

Assignee

Comment 45

•

6 years ago

Attached patch patch part 2 - Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain (obsolete) — Details — Splinter Review

Attachment #9021447 - Attachment is obsolete: true

Sotaro Ikeda [:sotaro]

Assignee

Comment 46

•

6 years ago

Attached patch patch part 1 - Fix direct binding texture recycling handling in AsyncImagePipelineManager (obsolete) — Details — Splinter Review

Fixed build failure on try.

Attachment #9021749 - Attachment is obsolete: true

Sotaro Ikeda [:sotaro]

Assignee

Comment 47

•

6 years ago

Attached patch patch part 2 - Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain (obsolete) — Details — Splinter Review

Rebased.

Attachment #9021754 - Attachment is obsolete: true

Sotaro Ikeda [:sotaro]

Assignee

Comment 48

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=209140129&revision=13a45e1030746483da4c2758991d95d857d6f952

Sotaro Ikeda [:sotaro]

Assignee

Updated

•

6 years ago

Attachment #9021769 - Flags: review?(matt.woodrow)

Sotaro Ikeda [:sotaro]

Assignee

Updated

•

6 years ago

Attachment #9021775 - Flags: review?(matt.woodrow)

Matt Woodrow (:mattwoodrow)

Comment 49

•

6 years ago

Comment on attachment 9021769 [details] [diff] [review] patch part 1 - Fix direct binding texture recycling handling in AsyncImagePipelineManager Review of attachment 9021769 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/layers/wr/AsyncImagePipelineManager.cpp @@ +670,5 @@ > } > } > > void > +AsyncImagePipelineManager::ProcessPipelineRemoved(const wr::PipelineId& aPipelineId, const uint64_t aUpdatesCount) I realize this is based on existing code, but I much prefer referring to these as 'id' rather than 'count'. I think ideally the existing TransactionId class would be templated so that you could do: class PipelineIdTag {}; typedef TransactionId<PipelineIdTag> PipelineId; Might be a nice follow-up :) @@ +705,5 @@ > +{ > + MOZ_ASSERT(aTextureHost); > + > + if (aTextureHost->HasIntermediateBuffer()) { > + // If texutre is not direct binding texture, gpu already ended to use it. gpu has already finished using it. ::: gfx/layers/wr/AsyncImagePipelineManager.h @@ +217,5 @@ > wr::TransactionBuilder& aTxn); > > + // If texture is direct binding texture, keep it until it is not used by GPU. > + void HoldUntilNotUsedByGPU(const CompositableTextureHostRef& aTextureHost, uint64_t aUpdatesCount); > + void CheckIfTextureHostsNotUsedByGPU(); CheckForTextureHostsNotUsedByGPU? @@ +270,5 @@ > }; > std::queue<UniquePtr<PipelineUpdates>> mUpdatesQueues; > + > + // Queue to store TextureHosts that might still be used by GPU. > + std::queue<std::pair<uint64_t, CompositableTextureHostRef>> mTexturesWatingNotUsedByGPU; Might be easier to reverse the naming here and call it something like mTexturesInUseByGPU.

Attachment #9021769 - Flags: review?(matt.woodrow) → review+

Matt Woodrow (:mattwoodrow)

Comment 50

•

6 years ago

Comment on attachment 9021775 [details] [diff] [review] patch part 2 - Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain Review of attachment 9021775 [details] [diff] [review]: ----------------------------------------------------------------- Looks awesome! ::: gfx/webrender_bindings/RenderCompositorANGLE.cpp @@ +498,3 @@ > BOOL result; > + layers::WaitForGPUQuery(mDevice, mCtx, query, &result); > + Another follow-up idea, I see CreateQuery show up a bit in profiles sometime, so recycling the entries in this queue (make it a ring buffer instead of a queue?) would be nice!

Attachment #9021775 - Flags: review?(matt.woodrow) → review+

Sotaro Ikeda [:sotaro]

Assignee

Comment 51

•

6 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #49) > Comment on attachment 9021769 [details] [diff] [review] > patch part 1 - Fix direct binding texture recycling handling in > AsyncImagePipelineManager > > Review of attachment 9021769 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: gfx/layers/wr/AsyncImagePipelineManager.cpp > @@ +670,5 @@ > > } > > } > > > > void > > +AsyncImagePipelineManager::ProcessPipelineRemoved(const wr::PipelineId& aPipelineId, const uint64_t aUpdatesCount) > > I realize this is based on existing code, but I much prefer referring to > these as 'id' rather than 'count'. > > I think ideally the existing TransactionId class would be templated so that > you could do: > > class PipelineIdTag {}; > typedef TransactionId<PipelineIdTag> PipelineId; > > Might be a nice follow-up :) Thanks for the advice :) I will update it as a follow-up. > > @@ +705,5 @@ > > +{ > > + MOZ_ASSERT(aTextureHost); > > + > > + if (aTextureHost->HasIntermediateBuffer()) { > > + // If texutre is not direct binding texture, gpu already ended to use it. > > gpu has already finished using it. I will update the comment. > > ::: gfx/layers/wr/AsyncImagePipelineManager.h > @@ +217,5 @@ > > wr::TransactionBuilder& aTxn); > > > > + // If texture is direct binding texture, keep it until it is not used by GPU. > > + void HoldUntilNotUsedByGPU(const CompositableTextureHostRef& aTextureHost, uint64_t aUpdatesCount); > > + void CheckIfTextureHostsNotUsedByGPU(); > > CheckForTextureHostsNotUsedByGPU? I will update the comment. > > @@ +270,5 @@ > > }; > > std::queue<UniquePtr<PipelineUpdates>> mUpdatesQueues; > > + > > + // Queue to store TextureHosts that might still be used by GPU. > > + std::queue<std::pair<uint64_t, CompositableTextureHostRef>> mTexturesWatingNotUsedByGPU; > > Might be easier to reverse the naming here and call it something like > mTexturesInUseByGPU. I will update the comment.

Sotaro Ikeda [:sotaro]

Assignee

Comment 52

•

6 years ago

(In reply to Matt Woodrow (:mattwoodrow) from comment #50) > Comment on attachment 9021775 [details] [diff] [review] > patch part 2 - Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL > SwapChain > > Review of attachment 9021775 [details] [diff] [review]: > ----------------------------------------------------------------- > > Looks awesome! > > ::: gfx/webrender_bindings/RenderCompositorANGLE.cpp > @@ +498,3 @@ > > BOOL result; > > + layers::WaitForGPUQuery(mDevice, mCtx, query, &result); > > + > > Another follow-up idea, I see CreateQuery show up a bit in profiles > sometime, so recycling the entries in this queue (make it a ring buffer > instead of a queue?) would be nice! Yea, it is a good idea for the follow-up!

Sotaro Ikeda [:sotaro]

Assignee

Comment 53

•

6 years ago

Attached patch patch part 1 - Fix direct binding texture recycling handling in AsyncImagePipelineManager — Details — Splinter Review

Apply the comment.

Attachment #9021769 - Attachment is obsolete: true

Attachment #9022493 - Flags: review+

Sotaro Ikeda [:sotaro]

Assignee

Comment 54

•

6 years ago

Attached patch patch part 2 - Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain (obsolete) — Details — Splinter Review

Rebased.

Attachment #9021775 - Attachment is obsolete: true

Attachment #9022494 - Flags: review+

Sotaro Ikeda [:sotaro]

Assignee

Comment 55

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=209743185&revision=3893e2c1248511bbbba8f1a0870a4fb24e12428a

Sotaro Ikeda [:sotaro]

Assignee

Comment 56

•

6 years ago

attachment 9022493 [details] [diff] [review] and attachment 9022494 [details] [diff] [review] are expected to increase memory usage by using triple buffering and fixing texture recycling for direct bounding textures.

Pulsebot

Comment 57

•

6 years ago

Pushed by sikeda@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/d139f68fd874 Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain r=mattwoodrow

Cosmin Sabou [:CosminS]

Comment 58

•

6 years ago

Backed out for build bustages on RenderCompositorANGLE. Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&resultStatus=testfailed%2Cbusted%2Crunning%2Cpending%2Crunnable&fromchange=d139f68fd874510ece138ac2a8879ab0156072f2&tochange=8bc2bb564ad01e967566688e6544cbcdcd72fa2a&searchStr=windows%2C2012%2Cdebug%2Cbuild-win32-msvc%2Fdebug%2C%28bmsvc%29&selectedJob=209756291 Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=209756291&repo=mozilla-inbound&lineNumber=19173 Backout link: https://hg.mozilla.org/integration/mozilla-inbound/rev/8bc2bb564ad01e967566688e6544cbcdcd72fa2a

Flags: needinfo?(sotaro.ikeda.g)

Sotaro Ikeda [:sotaro]

Assignee

Comment 59

•

6 years ago

Attached patch patch part 2 - Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain — Details — Splinter Review

Fix build problem on try.

Attachment #9022494 - Attachment is obsolete: true

Flags: needinfo?(sotaro.ikeda.g)

Attachment #9022526 - Flags: review+

Sotaro Ikeda [:sotaro]

Assignee

Comment 60

•

6 years ago

The problem of Comment 58 is addressed. https://treeherder.mozilla.org/#/jobs?repo=try&revision=3e1bdbd4b232ad8625afad8099bec885a9450900

Pulsebot

Comment 61

•

6 years ago

Pushed by sikeda@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/c27b21d98ce4 Use triple buffer with DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL SwapChain r=mattwoodrow

Raul Gurzau (:RaulG)

Comment 62

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/c27b21d98ce4

Status: NEW → RESOLVED

Closed: 6 years ago

status-firefox65: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla65

Sotaro Ikeda [:sotaro]

Assignee

Updated

•

6 years ago

Depends on: 1505259

Ionuț Goldan [:igoldan]

Comment 63

•

6 years ago

Big perf wins on QR! \o/ == Change summary for alert #17378 (as of Mon, 05 Nov 2018 08:23:58 GMT) == Improvements: 27% tscrollx windows10-64-qr opt e10s stylo 1.67 -> 1.21 23% glterrain windows10-64-qr opt e10s stylo 1.43 -> 1.10 19% tart windows10-64-qr opt e10s stylo 3.79 -> 3.06 9% tp5o_scroll windows10-64-qr opt e10s stylo 2.91 -> 2.65 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=17378

Bobby Holley (:bholley)

Updated

•

6 years ago

Blocks: 1416645

Bobby Holley (:bholley)

Updated

•

6 years ago

Blocks: 1497627

Bobby Holley (:bholley)

Updated

•

6 years ago

Blocks: 1497625

Sotaro Ikeda [:sotaro]

Assignee

Updated

•

6 years ago

Depends on: 1532457

You need to log in before you can comment on or make changes to this bug.