Open Bug 1639336 Opened 4 years ago Updated 3 years ago

Investigate high times in target_init GPU profiling bucket on low-end Intel GPUs

Categories

(Core :: Graphics: WebRender, task)

task

Tracking

()

People

(Reporter: gw, Unassigned)

References

(Blocks 2 open bugs)

Details

(Whiteboard: wr-planning)

Attachments

(5 files, 1 obsolete file)

No description provided.
Whiteboard: wr-planning
Flags: needinfo?(bpeers)
Attached image p400_bytes_cleared.png

Glenn, how did you measure target_init?

It looks like all gpu profile timers collapse into a single paint_time_ns; are we logging the individual timers somewhere else? I'm setting up some logging of the number of bytes we clear to compare with GPU_TAG_SETUP_TARGET, see if the bandwidth checks out. I'd almost want to keep every single timer tag as a separate field in GpuProfile but am currently fighting Rust to make that happen :) Thus I'm curious what we already do around that.

Once I have those numbers, I'll run on Gen 6.5 to see if time and bandwidth line up.
If not, I can only think of two possible scenarios (other than somehow clearing less):

  • the scissor rect has no effect and we clear the whole thing; drawing quads might be faster; or
  • using scissor rects somehow inhibits use of CMASK and maybe not using scissors at all turns out to be faster.

If you had some other ideas please let me know! Thanks.

(Edit: correction -- that's pixels cleared per frame in the chart, not bytes -- based on scissor.area. And now that I think about it, "no scissor" is probably under-reported. So just ignore the graph basically :roll-eyes: )

Flags: needinfo?(bpeers) → needinfo?(gwatson)
Assignee: nobody → bpeers

If you enable gfx.webrender.debug.profiler and gfx.webrender.debug.gpu-time-queries you should see a graph overlay down the bottom of the profiler which shows the percentage of the GPU frame time spent in each of the GPU timer blocks.

Just beware that enabling those GPU timer queries can have a very significant effect on overall GPU times / stalls, especially when running under ANGLE (but I believe the relative percentages reported for each GPU timer block are still reasonably accurate).

For the clears with scissor, I'm 99% sure that this will mean we're losing fast clear, especially on those older-gen GPUs. However, the idea is that we're scissoring to a (hopefully) much smaller dirty rect than the entire size of the tile, so it should be outweighed by the saving in GPU time not rasterizing the non-dirty part of the tile (but maybe that's not true in some cases / GPUs!).

It's also possible that the reported time is not actually doing useful clear work here - maybe it's actually some deferred work and/or GPU stalls blocked on some kind of synchronization primitive. I think it'd be useful to try and verify if the reported cost is the work of the clear or something else as well.

Flags: needinfo?(gwatson)

Another possibility is that on those older drivers we lose fast clear even if the scissor rect is effectively a no-op (i.e. the dirty rect is the entire tile). That would be an easy fix if that's the case and it occurs often.

We might want to consider testing things like - only setting the scissor / dirty rect if it's < 50% of the tile area, for example, if it turns out that losing the fast clear is a really significant cost.

Some of the older Intel GPUs also seem to only support fast clear if the color is a specific value, which might be relevant here (a good way to see things like this is browsing the Intel open-source Linux GPU driver source :) ).

Attached image hd4600.png

Thanks for the pointers Glenn! I was able to add the logging code that I needed to plot the GPU time with the number of pixels cleared.

With the usual caveats (skew, timer not measuring just the clear, bugs in my spreadsheet), I measure 0.9 pixels/ns of clearing speed on a HD4600. So 3.6GB/sec of clearing with an average 4 bytes per pixel (RGBA8, no depth, no stencil), or 5.46GB/sec at 6 bytes per pixel (RGBA8+D16). I got a bit lazy there reporting pixels instead of bytes, but absolute numbers are not super important. Just needed a baseline.

For comparison the 970M on the same laptop gets 7.78pixels/ns (or 31.1GB/sec resp. 46.67GB/sec). About 8 times faster.

Now that I don't have to eyeball changes, I'll start experimenting a bit :) Cheers.

Attached file 970_gpu_profile.log (obsolete) —

The 970M for comparison.

Attached image 970M.png

The 970M for comparison, second try @_@

Attachment #9150646 - Attachment is obsolete: true
Attached image no_scissor_hack.png

Unfortunately, pretty much all the clears are targetting a NativeSurface backend, which is atlassed. So even in cases where we're asking for a full 1024x512 clear, this gets shifted by the native surface's offset to some arbitrary position. For example this is typical:

draw_picture_cache_target
      map |rect| Rect(1024×512 at (0,0)) content_origin (0,0)
      NativeSurface offset (1,515)
      clear_target(1) color Some([0.9647059, 0.9647059, 0.9647059, 1.0])
                      depth Some(1.0)
      scissor rect Some(Rect(1024×512 at (1,515)))

I guess we have a 1 pixel border around each atlassed chunk in the NativeSurface backend storage.
Further the offsets seem to be a bit all over the place, I'm not sure if this is maybe fragmentation -- or maybe my logging is broken, I'm even seeing negative offsets:

NativeSurface offset (55,-160)
NativeSurface offset (59,-160)
NativeSurface offset (1,1)
NativeSurface offset (1,-184)
NativeSurface offset (1,330)

Either way, asking for a full 1024x512 seems to be the exception so it might not matter. Most clears would be, say, 942x221, owing to the dirty rect optimization I assume.

And to add to the confusion, I removed the scissor rect for this specific clear (draw_picture_cache_target) expecting to see glitches all over the place, and, uh, it looks fine? :| It really shouldn't? :)
It did double the clear perf on Intel, but, that's not useful (unless there is some special case for NativeSurfaces that makes this work and we can in fact do this?).

It's also possible I haven't had enough coffee yet.

Followed the plumbing, and, no-scissor-rect seems to work because we pass picture_target.dirty_rect into compositor.bind into RenderCompositorANGLE::Bind into CreateEGLSurfaceForCompositionSurface into BeginDraw.

So I speculate that BeginDraw guarantees that drawing out of bounds will have no effect, ie. it acts like a glViewport that does clip glClear? (unlike the real glViewport).

That would explain why the scissor is unnecessary.

If the scissor rect completely contains the rect passed to BeginDraw and if that guarantees that the scissor is a no-op and we know for sure that the NativeSurface is backed by a compositor that guarantees this (ie. composition is enabled and we're using Angle) then we could consider omitting the scissor rect if we're on a platform where that is faster (or at least not slower) :^)

I have a patch that uses prefs + CompositorCapabilities to do the above, seems to kick in and seems to give a speed up, I'll test a bit more tomorrow to double check the numbers and make sure there's no regressions on any of the platforms I can test on.

Attached image scissor_all.png

Measured on all 5 systems and that's all well and good but I'm less and less convinced this optimization is legit -- couldn't find anything in Angle that would guarantee correct behaviour (like silently recovering the original BeginDraw rect and using it for ClearView).

A few bits from stepping-through-Angle:

  • it will already cancel out unnecessary scissors (that encompass entire FBO/RTV);
  • Skylake has this "old drivers need to Clear twice" workaround -- might skew measurements for the worse on that specific platform;
  • Angle supports ClearView to scissor an RTV -- it only issues a quad for the DSV so we might still get a fast clear there.

Some other ideas from Glenn/Jeff on legitimate optimizations:

  • Switch to D16 depth;
  • Clear depth only once and partition its z-range so tiles can re-use depth without needing a clear;
  • Check if the clear rect isn't larger than it needs to be (driven by dirty-rect, is it well behaving under scrolling?)
Blocks: wr-gpu-time
No longer blocks: wr-perf

The bug assignee didn't login in Bugzilla in the last 7 months.
:bhood, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: bpeers → nobody
Flags: needinfo?(bhood)
Flags: needinfo?(bhood)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: