Closed Bug 1599502 Opened 5 years ago Closed 4 years ago

A long time is spent in glBufferData during draw_instanced_batch (Intel)

Categories

(Core :: Graphics: WebRender, defect, P3)

All
macOS
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: mstange, Unassigned)

References

(Blocks 4 open bugs)

Details

(Keywords: perf)

In this profile of a 1080p VP9 60FPS video playing, we spend in 13% of the non-idle time in glBufferData -> gleAcquireBufferData -> gleGetFreeOrphanNode: https://perfht.ml/2XQPazL

It's not clear to me which path from draw_instanced_batch to buffer_data_untyped is being taken here because there's a lot of inlining going on.

One idea I was thinking about earlier today is to create bigger buffers when uploading. That would make the driver to do less allocation and internal buffer renaming, hopefully avoiding the associated slowness.

Here is what we do today:

FrameBuilder:
  - make vectors of instance data, one per batch
Renderer:
  - upload texture data
  - for each target
    - for each batch
      - create a buffer with the instance data for *this batch*
      - draw

Instead, we could do the following:

FrameBuilder:
  - make vectors of instance data, one per *type of a* batch
  - an actual batch would then just contain the range of that instance vector
Renderer:
  - upload texture data
  - upload all batch data
  - for each target
    - for each batch
      - bind the relevant buffer (that is already on GPU)
      - draw with specified base instance

Aside from having less driver work for managing the buffers (tracking, renaming, allocating), this approach also has a benefit of reducing our heap allocations. It also plays better with the Szeged fork.

Blocks: wr-73
Keywords: perf
Priority: -- → P3

dropping wr-73 on the assumption that this is mac-specific

No longer blocks: wr-73
Blocks: wr-intel
Summary: A long time is spent in glBufferData during draw_instanced_batch on my Intel GPU → A long time is spent in glBufferData during draw_instanced_batch on my Intel GPU on Mac

We are seeing issues with that on non-mac platforms as well.

Summary: A long time is spent in glBufferData during draw_instanced_batch on my Intel GPU on Mac → A long time is spent in glBufferData during draw_instanced_batch (Intel)

@Markus: Could you check if this is still occurring?

Flags: needinfo?(mstange.moz)

https://phabricator.services.mozilla.com/D102333 is implementing the instance data consolidation, which reduces the number of PBOs we create for the instance data. The last try push with artifacts is https://treeherder.mozilla.org/jobs?repo=try&revision=c8330a8863a258f68e3b77c3aba8917007e41653 . Where is this reproducible, exactly? If I can't find a good repro case, I'd have to ask one of you guys to test an artifact from this build.

Flags: needinfo?(nical.bugzilla)

(In reply to Kris Taeleman (:ktaeleman) from comment #5)

@Markus: Could you check if this is still occurring?

I haven't noticed it recently... but I also don't currently get driver symbols in my profiles (bug 1683758), and it's an issue that gets worse over time as we accumulate orphaned PBOs. I'm not sure how to reproduce it. The 1080p video case from comment 0 no longer reproduces it on macOS because those videos are now handled in the native compositor.

Flags: needinfo?(mstange.moz)

I looked at a profile of Element web client just scrolling back and forth on mac with Intel 550. The CPU timings for draw_instanced_batch take about 1%-1.5% total time, while the number of draw calls is within 50-100. This isn't reproducing the issue here, unable to optimize this.

Edit: after playing with the profile some more, I see the total of 6% time in drawing batches.

From a quick profile of scrolling Elements on Linux + nvidia width proprietary drivers I see ~14% of the renderer frame time spent in glBufferData under draw_instanced_batch. Note that the total render time was pretty good so even if 14% is a somewhat significant portion, it's a portion of something small, so I'm not overly worried.

No longer blocks: wr-perf-p1
Flags: needinfo?(nical.bugzilla)

Nicola, are these numbers with or without the change I linked? If the time is already OK and doesn't need fixing, any reason to keep the issue open?

Flags: needinfo?(nical.bugzilla)

These numbers are without the change (just an official nightly build). I just checked on the kangax compatibility table which has loads of primitives and is a bit heavier on the renderer thread (around 16ms Renderer::update on average) https://kangax.github.io/compat-table/es6/

on Linux+Intel The time spent in glBufferData is 17% of frame building vs 40% spent in glDrawElementsInstanced.
I'm giving this number beause I have a linux box handy but some intel+windows number would be more useful probably (Linux tends to do better on average for renderer times).

I think that things used to be worse, and picture caching has helped a lot papering over driver overhead in common cases. I wouldn't say that this is very high priority but if you think there is significant room for improvement and fruits are hanging low enough we can keep it open.

Flags: needinfo?(nical.bugzilla)
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.