Open Bug 995805 Opened 10 years ago Updated 2 years ago

Add a GL buffer (geometry) cache.

Categories

(Core :: Graphics, defect)

x86
macOS
defect

Tracking

()

People

(Reporter: gal, Unassigned)

References

Details

Attachments

(1 file, 1 obsolete file)

Bug 980364 bounced on Mac 10.6 because glGenBuffers/glDeleteBuffers is slow (and potentially also glBufferData). We have seen this before with mobile GL drivers as well (Vivante!). When rendering layer trees we tend to upload over and over the same geometries. The right approach to do this is to cache those VBOs and re-use them across frames.
Attached patch Add a GL buffer/geometry cache. (obsolete) — Splinter Review
Comment on attachment 8405917 [details] [diff] [review]
Add a GL buffer/geometry cache.

If you like the approach I will add this to context similar to the other helpers (GLContext->BufferCache()->Bind(target, data, size)).
Attachment #8405917 - Flags: feedback?(bgirard)
Blocks: 980364
Attachment #8405917 - Flags: feedback?(bgirard) → feedback?(bjacob)
Attached patch patchSplinter Review
Assignee: nobody → gal
Attachment #8405917 - Attachment is obsolete: true
Attachment #8405917 - Flags: feedback?(bjacob)
Attachment #8406302 - Flags: feedback?(bjacob)
Comment on attachment 8406302 [details] [diff] [review]
patch

Review of attachment 8406302 [details] [diff] [review]:
-----------------------------------------------------------------

Hm, neither a feedback+ nor a feedback-. This is complicated.

This buffer cache is caching deep copies of the buffer contents, and stores them for 1 second, registering an expiration tracker to remove them from cache.

This means that under high stress, we could be storing a bunch of buffer copies, with a bunch of expiration trackers on top. That could mean increased transient memory usage.

There is also the speed overhead. Buffers are memcpy'd when entered into the cache, and memcmp'd before being reused from the cache. How expensive could that be?

I don't have proof that these overheads are too large, but I don't know that they're not too large, either. I could ask for some analysis to be done on that, but I feel rather that it may be worth asking if another approach to caching might have inherently less overhead?
Attachment #8406302 - Flags: feedback?(bjacob)
(In reply to Benoit Jacob [:bjacob] from comment #4)
> Comment on attachment 8406302 [details] [diff] [review]
> patch
> 
> Review of attachment 8406302 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Hm, neither a feedback+ nor a feedback-. This is complicated.
> 
> This buffer cache is caching deep copies of the buffer contents, and stores
> them for 1 second, registering an expiration tracker to remove them from
> cache.
> 
> This means that under high stress, we could be storing a bunch of buffer
> copies, with a bunch of expiration trackers on top. That could mean
> increased transient memory usage.

Keep the sizes in mind here. Most geometries are 

2 triangles with 6 edges = 6 floats times 2 coords times 2 (vertex coords and tex coords) = 96 bytes

With some additional link pointers etc lets call it 128 bytes.

My casual test set has around 4 layers on average. Only a fairly small subset of them goes through this path since for most we can use the uniform quad, with very high cache hit rates (most layers don't change geometry, this will be even more true once we kill buffer rotation).

As a result the memory difference here is basically 0.

> 
> There is also the speed overhead. Buffers are memcpy'd when entered into the
> cache, and memcmp'd before being reused from the cache. How expensive could
> that be?

In comparison to a buffer allocation in the GL driver the overhead is trivial. We are talking about shoveling 4 cache lines around in main memory. Some GL drivers maintain buffers fairly efficiently, but others suck at it horribly and really expect that you infrequently upload large geometries (typical game use case). The two drivers I have source access to both use mmap for buffers internally, which explains the suck-y performance. Due to the way most GPU pipelines work re-using these buffers in the driver is a really bad idea as well since it almost always causes a pipeline stall trying to access a submitted buffer.

Executive summary: On some drivers (mac 10.6, Vivante) is critical to minimize buffer allocation and upload. Other drivers don't care as much but its always a win.

> 
> I don't have proof that these overheads are too large, but I don't know that
> they're not too large, either. I could ask for some analysis to be done on
> that, but I feel rather that it may be worth asking if another approach to
> caching might have inherently less overhead?

We have seen that getting this wrong is expensive (2.3% performance loss on 10.6 for tscroll) when we backed out my previous patch. If this has 0 overhead or is a win, can we land the patch?
Quite note on memory: On desktop systems with dedicated VRAM there is no way we could ever upload ever geometries for memory to matter. On mobile, nsExpirationTracker is hooked up to low memory notifications and exhausting VRAM automatically exhausts main memory which triggers a low memory notification which purges the cache.
(In reply to Andreas Gal :gal from comment #5)
> Keep the sizes in mind here. Most geometries are 
> 
> 2 triangles with 6 edges = 6 floats times 2 coords times 2 (vertex coords
> and tex coords) = 96 bytes

That's actually a very good point --- "Keep the sizes in mind here".

If most geometries are a single quad (2 triangles) then having separate VBOs for each such simple geometry is wrong in the first place, and we should instead have a single "standard quad" VBO and only adjust uniform values. This approach would even scale up easily to up to a few dozen quads, if needed (see http://webglstats.com/, MAX_VERTEX_UNIFORM_VECTORS seems to be never less than 64 or 128 and hardly ever below 250). In this way, we could cover the majority of cases, while aligning our behavior with what 3D games typically do (always a plus) and the remaining cases would be few enough that their performance wouldn't matter so much.


(In reply to Andreas Gal :gal from comment #5)
> Executive summary: On some drivers (mac 10.6, Vivante) is critical to
> minimize buffer allocation and upload. Other drivers don't care as much but
> its always a win.
>
> [ ... SNIP ... ]
> 
> We have seen that getting this wrong is expensive (2.3% performance loss on
> 10.6 for tscroll) when we backed out my previous patch. If this has 0
> overhead or is a win, can we land the patch?

I like Talos and I like the fact that 10.6 slaves have caught some GL performance regressions that would likely also have affected some mobile GPUs, but I'm not confident that they will catch every regression that would affect any mobile GPU --- we still need to be very careful.
There are a bunch of different options that we can take when uploading geometry. The simplest is to allocate buffers, set the data, draw. I'm not surprised that this isn't great due to the allocations. Reusing one buffer for ever draw also seems like it could be bad. However, this still leaves us with a number of options:

1. Special case a small number of quads and use uniforms to set the coordinates (This is intuitively and according to driver vendors the sanest way to mutate geometry) It can also be used on top of any of the other methods and might make anything more sophisticated unneeded.

2. Have a pool of vertex buffers of a particular sizes upload to them in a fifo order. (This should remove the allocation problem and reduce the mutate data in the pipeline problem).

3. Take a deferred (or pre-pass depending on how you look at it) approach where we accumulate all of the geometry that we'll need into a single vertex buffer and do a single upload.

4. Use a single vertex buffer but use it as fifo geometry queue. Updating sub ranges of it.

5. Cache vertex buffers by value on the gpu.

The choice between these isn't obvious to me. I believe Direct2D uses something like option 3 or 4. Personally I have a preference for something that has deterministic performance characteristics, which is something the caching approach does not have unless we avoid caching vbos across frames. Either way, the right thing to do here is get performance numbers for the different approaches across different devices and choose based on that. Further, without a decent way of streaming geometry to the GPU any kind of GL based 2D drawing is going to be very difficult to make performant and memory efficient.
(In reply to Benoit Jacob [:bjacob] from comment #7)
> (In reply to Andreas Gal :gal from comment #5)
> > Keep the sizes in mind here. Most geometries are 
> > 
> > 2 triangles with 6 edges = 6 floats times 2 coords times 2 (vertex coords
> > and tex coords) = 96 bytes
> 
> That's actually a very good point --- "Keep the sizes in mind here".
> 
> If most geometries are a single quad (2 triangles) then having separate VBOs
> for each such simple geometry is wrong in the first place, and we should
> instead have a single "standard quad" VBO and only adjust uniform values.

We are doing this already. Look for mQuadVBO. Its right now being done all over the place and my patch queue is unifying that into one place. In certain rare cases we can't use mQuadVBO and this path covers those cases only.

> This approach would even scale up easily to up to a few dozen quads, if
> needed (see http://webglstats.com/, MAX_VERTEX_UNIFORM_VECTORS seems to be
> never less than 64 or 128 and hardly ever below 250). In this way, we could
> cover the majority of cases, while aligning our behavior with what 3D games
> typically do (always a plus) and the remaining cases would be few enough
> that their performance wouldn't matter so much.

Mobile drivers don't like the uniform trick much. I have seen recompiles with Vivante and Adreno for uniform changes, especially if you accidentally hit certain uniform values (1.0, 0.0) that simplify matrix multiplications. This gives us uneven performance depending on uniform values. Thats pretty terrible.

> 
> 
> (In reply to Andreas Gal :gal from comment #5)
> > Executive summary: On some drivers (mac 10.6, Vivante) is critical to
> > minimize buffer allocation and upload. Other drivers don't care as much but
> > its always a win.
> >
> > [ ... SNIP ... ]
> > 
> > We have seen that getting this wrong is expensive (2.3% performance loss on
> > 10.6 for tscroll) when we backed out my previous patch. If this has 0
> > overhead or is a win, can we land the patch?
> 
> I like Talos and I like the fact that 10.6 slaves have caught some GL
> performance regressions that would likely also have affected some mobile
> GPUs, but I'm not confident that they will catch every regression that would
> affect any mobile GPU --- we still need to be very careful.

So you are suggesting to be careful to the point of simply not doing anything? Our current geometry setup code is a cluster. We have literally 11 different places where we upload geometry, and most of them are using client side rendering which is horribly slow on many drivers. You are blocking fixing known bugs based on hypothetical considerations of potential bugs.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #8)
> There are a bunch of different options that we can take when uploading
> geometry. The simplest is to allocate buffers, set the data, draw. I'm not
> surprised that this isn't great due to the allocations. Reusing one buffer
> for ever draw also seems like it could be bad. However, this still leaves us
> with a number of options:
> 
> 1. Special case a small number of quads and use uniforms to set the
> coordinates (This is intuitively and according to driver vendors the sanest
> way to mutate geometry) It can also be used on top of any of the other
> methods and might make anything more sophisticated unneeded.
> 
> 2. Have a pool of vertex buffers of a particular sizes upload to them in a
> fifo order. (This should remove the allocation problem and reduce the mutate
> data in the pipeline problem).
> 
> 3. Take a deferred (or pre-pass depending on how you look at it) approach
> where we accumulate all of the geometry that we'll need into a single vertex
> buffer and do a single upload.
> 
> 4. Use a single vertex buffer but use it as fifo geometry queue. Updating
> sub ranges of it.
> 
> 5. Cache vertex buffers by value on the gpu.
> 
> The choice between these isn't obvious to me. I believe Direct2D uses
> something like option 3 or 4. Personally I have a preference for something
> that has deterministic performance characteristics, which is something the
> caching approach does not have unless we avoid caching vbos across frames.
> Either way, the right thing to do here is get performance numbers for the
> different approaches across different devices and choose based on that.
> Further, without a decent way of streaming geometry to the GPU any kind of
> GL based 2D drawing is going to be very difficult to make performant and
> memory efficient.

See above. I have seen bad behavior for 1). If possible, I would like to remove the mQuadVBO business. I am not 100% sure we can do that though.

2) doesn't solve the problem that twiddling with uniforms is expensive. We mostly need 1,2,3,4 quads so we could just load those into one buffer and then change the element size of the draw call, so this might be better than 1) but maybe not ideal. We will still need the patch I uploaded for complex geometries (I am doing this work here to enable mesh transforms).

3) I really like this approach. If people prefer this I am happy to rewrite the patch but I would like to get approval for the approach ahead of time. I don't want to get blocked by hypothetical fears after writing the patch.

4) Sub range updates are almost always crazy expensive because the driver has to read back GPU memory. Every GL book I have read discourages it and recommends nulling the whole buffer and re-uploading the whole thing.

5) is what I am proposing here.

I like 3). If that doesn't end up in an endless debate once the patch is up I am happy to rewrite for that. What do you want to see happen?

The bug assignee didn't login in Bugzilla in the last 7 months.
:bhood, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: gal → nobody
Flags: needinfo?(bhood)
Flags: needinfo?(bhood)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: