Closed Bug 984783 Opened 10 years ago Closed 10 years ago

Compare different WebGL element array cache invalidation strategies

Categories

(Core :: Graphics: CanvasWebGL, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla32

People

(Reporter: bjacob, Assigned: bjacob)

Details

Attachments

(5 files, 1 obsolete file)

The WG is starting to take drawElements optimization seriously and NVIDIA is contributing a microbenchmark to the official repo:

https://github.com/KhronosGroup/WebGL/pull/501

One of the tests (in the original version of the pull request, which is again what is being considered for merging now) does two far-away, one-element updates. This is the worst-case scenario for our current invalidation logic, which tracks a single invalidated interval and grows it to contain all invalidated intervals.

A simple fix to avoid this worst-case scenario while retaining the benefits of deferred updates in the realistic cases where it is beneficial (adjacent/overlapping updates) is to immediately flush the existing invalidated interval when it is not adjacent to (or overlapping with) the new invalidated interval.

On the benchmark being submitted by NVIDIA, this optimization takes us from:

> Test #1 (minimal updates to cause complete revalidation in Chrome and Firefox as of March 2014) ran in 727 ms
> Indices referenced by draw calls processed: 137552 / ms

to:

> Test #1 (minimal updates to cause complete revalidation in Chrome and Firefox as of March 2014) ran in 70 ms
> Indices referenced by draw calls processed: 1428572 / ms

In other words, this is a 10x speedup in this case.

I still don't know if this case is relevant to the real world, but I've lost this battle on the conversation on the pull request, and this is only a three line patch plus comments.

The patch also adds comments elsewhere in this function to document existing logic.
Attachment #8392753 - Flags: review?(jgilbert)
Note that the compiled unit test passes locally.
I would really just rather have this solved properly with an invalidation interval tree. Is there a reason we should take the complication from this patch in the short term, before working on what should be the final solution?
Yes: thinking more about it, I am unsure that we will ever want to implement the interval tree solution.

Note that we are not even very far from not needing any invalidation tracking at all: it might be acceptable to just perform tree updates immediately everytime. After all, this is not so bad: a tree update of N elements is O(N log N) while the corresponding buffer upload is already O(N) and already put that data in CPU caches.

And if we hadn't tried to be smart with invalidation tracking in the first place, we would not have had any performance problem with the present benchmark!

Now, I have thought a bit about that, and I think that we actually want to have some invalidation tracking so we can defer some updates. The question is how much. The present compromise transforms the former worst-case scenario (small, far-away updates) into an easy case. The new worst-case scenario is not bad at all anymore. What would it be, even? How much room is there left for an interval tree implementation to perform much better?

Thinking about an interval tree implementation (and starting to prototype one) I realized that it has some significant costs. If an element array buffer is used with multiple element index types e.g. uint8/16/32 then we have a separate tree for each type, and we then need either of the following:
 - either have invalidation tracking of each tree separately; this is what the current code does; if that were implemented by interval trees, potentially growing big, causing a memory overhead concern, that would then be multiplied by 3 for 3 trees;
 - or have a single invalidation tracking for all types but then we would need to update all trees at one when we update any tree, causing a speed overhead by a factor of 3 for 3 trees.

We mitigate these concerns with the current approach by having a very simple invalidation tracking that costs very little memory (two size_t and a bool) and thus is cheap to have separately for each tree, and is already enough AFAICS to leave little room for improvement.
(In reply to Benoit Jacob [:bjacob] from comment #3)
> Yes: thinking more about it, I am unsure that we will ever want to implement
> the interval tree solution.
> 
> Note that we are not even very far from not needing any invalidation
> tracking at all: it might be acceptable to just perform tree updates
> immediately everytime. After all, this is not so bad: a tree update of N
> elements is O(N log N)

Actually scrap that, it's even better: a tree update of N elements in an element array of overall length L is

   O(max(N, log L))

Indeed for large enough N it's dominated by N ops at leaf level, then N/2 at next level, etc, summing up to 2N, while for N=1 we have O(number of levels).
I think I now remember why we bothered with invalidation tracking in the first place: in the case where a single element array buffer contains parts of different types (uint8/16/32), each bufferSubData invalidates all the trees, but each drawElements consumes only one. So we wanted to defer tree updates to avoid unnecessary tree updates.

Thinking more about this: this is probably not a good, realistic, sane usage pattern to optimize for, since different element types require separate drawElements calls, and then it doesn't cost much to also call bindBuffer to switch buffers. In other words, I doubt that there exists a good reason to want to stick multiple element types in a single element array buffer. We have to support this, but I doubt that we have to worry about this performance-wise or memory-usage-wise. Maybe we should generate a JS warning when this happens, as this is inherently hard to optimize for.

That means that my above objections against an interval tree can be ignored. But it remains true that there isn't a lot to gain, AFAICS, by implementing an interval tree.

I am still very tempted to remove the invalidation tracking completely and just do immediate tree updates. Again, that will be inefficient in the case of the presence of multiple trees, but that will probably not matter for the above reason.
I agree that we probably shouldn't optimized for interleaved/adjacent types in index buffers.

I think in the short term, we should do immediate validation on buffer updates.
I'll give our long term some thought. It should be easy to figure out what would be optimal, here.
So I've taken a look at doing immediate tree updates, and that revealed a probably very common update pattern that does benefit greatly from just naive invalidation tracking and deferred updated, such as done currently on m-c or as done with the patch attached here:

Suppose that the update pattern is

bufferData(null, big_size)   // allocate buffer, zero-initialized
bufferSubData(some_small_array_1, offset_1)  // start filling in data...
bufferSubData(some_small_array_2, offset_2)
bufferSubData(some_small_array_3, offset_3)

Then this benefits greatly from the naive invalidation tracking that we have, or the one that the patch here gives us. Note that it does not even matter whether the 3 subData calls in the above example are actually adjacent, since they are anyway overlapping with the already-invalidated interval, which was set by the above bufferData call to be the entire buffer!

So at this point I'm back to wanting to land the patch submitted for review here: it does make sense to optimize specifically for adjacent/overlapping updates.
Found this while staring at code to see how to implement immediate tree updates.

This memset is useless because if we're here then we've already invalidated the entire tree just above, so it does not matter whant the mTreeData contains.
Attachment #8393817 - Flags: review?(jgilbert)
Lies, damn lies and benchmarks.

Maybe I'm not following along properly but I don't think we should be optimizing for synthetic benchmarks.

What are the real world patterns for drawElements()? Static geom is buffered once then drawn multiple times, so that's not an issue. It's going to be "streaming" geom, for example particles, in which case the usage is to treat the buffer as a ring buffer.
Totally agree on the importance of benchmarks being rooted in real-world use cases, see the conversation on https://github.com/KhronosGroup/WebGL/pull/501 . And it's true that just the case of static geom already covers the vast majority of real-world use cases. But there are some "streaming geom" use cases out there in the wild, which was half of the motivation why we bothered with the WebGLElementArrayCache in the first place (the other half was drawElements calls consuming only part of the buffer); unfortunately at the moment we don't seem to know exactly what these are. There's been complaints about browser drawElements validation performance on the webgl public mailing list recently, it would be nice to have more detail about what exactly the real-world usage patterns were, and Ken already asked for that: https://www.khronos.org/webgl/public-mailing-list/archives/1403/msg00017.html
(In reply to Benoit Jacob [:bjacob] from comment #10)
> Totally agree on the importance of benchmarks being rooted in real-world use
> cases, see the conversation on
> https://github.com/KhronosGroup/WebGL/pull/501 . And it's true that just the
> case of static geom already covers the vast majority of real-world use
> cases. But there are some "streaming geom" use cases out there in the wild,
> which was half of the motivation why we bothered with the
> WebGLElementArrayCache in the first place (the other half was drawElements
> calls consuming only part of the buffer); unfortunately at the moment we
> don't seem to know exactly what these are. There's been complaints about
> browser drawElements validation performance on the webgl public mailing list
> recently, it would be nice to have more detail about what exactly the
> real-world usage patterns were, and Ken already asked for that:
> https://www.khronos.org/webgl/public-mailing-list/archives/1403/msg00017.html

Would this code be simpler if we tore out the delayed-validation code, and unconditionally did validation on buffer(Sub)Data? Would this really be any worse than having at-most-one delayed validation?
Flags: needinfo?(bjacob)
Attachment #8393817 - Flags: review?(jgilbert) → review+
Hey, I actually wanted to replace this patch to drop the invalidation tracking as we had discussed! I believe that's a better idea, in the interest of predictable performance characteristics and even distribution of overhead, proportionate to the amount of useful work. Sorry I haven't gotten back to this bug yet. Feel free, too.
Flags: needinfo?(bjacob)
Can we collect some "science" on what's "best"?
Good idea :) Note that the work here was triggered by a benchmark. In addition to it, another possible benchmark is to measure the time taken by TestWebGLElementArrayCache in an opt build.

Better still would be to have some real application whose performance would be impacted by these details... but I don't know of any. (Most applications seem to be rarely updating index arrays and/or consuming them all at one in a drawElements call).

The dearth of benchmarks is also why in comment 12 I just wanted to remove the invalidation tracking, so at least we'd get simpler code with simpler performance characteristics. Simplification, at least, should be valuable by itself :-)
Attachment #8392753 - Flags: review?(jgilbert)
As discussed above, let's just remove all the invalidation tracking:
 - This gives simpler performance characteristics, removes the bad worst-case scenario discussed here.
 - This gives simpler source code.

I've microbenchmarked 4 different variants, on 2 benchmarks:

The benchmarks are:
 1. time TestWebGLElementArrayCache
 2. https://www.khronos.org/registry/webgl/sdk/tests/extra/webgl-drawelements-validation.html

The different variants are:
 1. Current mozilla-central
 2. Remove all invalidation tracking (the new patch here; also removes the useless memset)
 3. Flush immediately non-adjacent updates (the previous patch here)
 4. Like 3 plus the patch to remove the useless memset.

My config: optimized non-DEBUG build on linux x86-64 on a sandy bridge core i7, built with clang 3.3.



Results summary:

Removing the invalidation tracking does make TestWebGLElementArrayCache take about 7% or 8% longer. But that doesn't worry me a lot, because throughput differences are far more visible in TestWebGLElementArrayCache, which is a standalone executable that does nothing but that, compared to real-world WebGL workloads. The online microbenchmark (2.) sees no significant throughput difference.

So given that removing invalidation tracking makes code simpler and removes the worst-case scenario and generally gives simpler performance characteristics, I feel in favor of it.


***


Detailed results:



Current Mozilla-central:
************************

obj-firefox-release/content/canvas/compiledtest/TestWebGLElementArrayCache: all 6733768 tests passed
real 0.96
user 0.95
sys 0.01

1. Time spent on full buffer updates: 1906 ms

Indices uploaded and referenced by draw calls processed: 52466 / ms

2. Time spent on validating single index updates while range referenced also changes on every draw call: 0 ms

Indices referenced by draw calls handled: Infinity / ms

3. Time spent on validating single index updates at each end of the buffer (worst case for Firefox implementation as of March 2014, not reflective of real world performance): 638 ms

Indices referenced by draw calls handled: 156740 / ms


With invalidation tracking removed:
***********************************

obj-firefox-release/content/canvas/compiledtest/TestWebGLElementArrayCache: all 6733768 tests passed
real 1.07
user 1.06
sys 0.00

1. Time spent on full buffer updates: 1875 ms

Indices uploaded and referenced by draw calls processed: 53333 / ms

2. Time spent on validating single index updates while range referenced also changes on every draw call: 0 ms

Indices referenced by draw calls handled: Infinity / ms

3. Time spent on validating single index updates at each end of the buffer (worst case for Firefox implementation as of March 2014, not reflective of real world performance): 1 ms

Indices referenced by draw calls handled: 100000050 / ms


With flush-on-non-adjacent-updates patch:
*****************************************

obj-firefox-release/content/canvas/compiledtest/TestWebGLElementArrayCache: all 6733768 tests passed
real 0.99
user 0.98
sys 0.01

1. Time spent on full buffer updates: 1891 ms

Indices uploaded and referenced by draw calls processed: 52882 / ms

2. Time spent on validating single index updates while range referenced also changes on every draw call: 1 ms

Indices referenced by draw calls handled: 100000050 / ms

3. Time spent on validating single index updates at each end of the buffer (worst case for Firefox implementation as of March 2014, not reflective of real world performance): 0 ms

Indices referenced by draw calls handled: Infinity / ms


With flush-on-non-adjacent-updates patch and remove-useless-memset patch:
*************************************************************************

obj-firefox-release/content/canvas/compiledtest/TestWebGLElementArrayCache: all 6733768 tests passed
real 0.99
user 0.97
sys 0.01


1. Time spent on full buffer updates: 1891 ms

Indices uploaded and referenced by draw calls processed: 52882 / ms

2. Time spent on validating single index updates while range referenced also changes on every draw call: 0 ms

Indices referenced by draw calls handled: Infinity / ms

3. Time spent on validating single index updates at each end of the buffer (worst case for Firefox implementation as of March 2014, not reflective of real world performance): 2 ms

Indices referenced by draw calls handled: 50000025 / ms
Attachment #8417102 - Flags: review?(jgilbert)
> Removing the invalidation tracking does make TestWebGLElementArrayCache take about 7% or 8% longer.

Actually it's 12%. I was comparing to the previous patch, which was already 5% slower than mozilla-central.
(In reply to Benoit Jacob [:bjacob] from comment #14)
> Good idea :)

I was thinking of recording the calls from something like 'Epic Citadel'. But that demo appears to have disappeared from the Epic website.
Yes, a real use case would be far better. The problem is it's hard to find real use cases that make this matter a lot. What sparked the recent renewed interest in this topic, IIUC, is that application developers complained about some browsers performance with drawElements; I asked for them to share their real testcases but I dont remember getting answers.
Summary: WebGL element array cache optimization: flush immediately the old invalidation on new non-adjacent invalidation → Compare different WebGL element array cache invalidation strategies
... but to be clear, while I totally agree that we need careful real benchmarking before _adding_ new optimizations (and I'm sorry that an optimization was previously added here without such careful process), I think that the bar should be much lower for _removing_ bad optimizations.

I think that the above measurements are enough to justify taking my newer patch under review here, that removes the invalidation tracking. It gives us simpler code, simpler performance characteristics, with no bad "worst case". All what we lose is 10% throughput on a compiled c++ code test that makes such throughput differences dramatically more marked than anything we've seen on WebGL pages.
Mr. President, we need a decision.
Flags: needinfo?(jgilbert)
Comment on attachment 8417102 [details] [diff] [review]
Remove all the invalidation tracking

Review of attachment 8417102 [details] [diff] [review]:
-----------------------------------------------------------------

I'm sad there's not more code we can axe here.

::: content/canvas/src/WebGLElementArrayCache.cpp
@@ +128,1 @@
>   * all existing trees. So if trees for types uint8_t, uint16_t and uint32_t have ever been constructed for this buffer,

Can we warn when we create more than one of these per buffer?

@@ +128,3 @@
>   * all existing trees. So if trees for types uint8_t, uint16_t and uint32_t have ever been constructed for this buffer,
> + * every subsequent update will have to update all trees even if one of the types is never
> + * used again. That's inefficient, but content should put indices of different types in the

s/should/should not/, I hope.
Attachment #8417102 - Flags: review?(jgilbert) → review+
(In reply to Jeff Gilbert [:jgilbert] from comment #21)
> ::: content/canvas/src/WebGLElementArrayCache.cpp
> @@ +128,1 @@
> >   * all existing trees. So if trees for types uint8_t, uint16_t and uint32_t have ever been constructed for this buffer,
> 
> Can we warn when we create more than one of these per buffer?

Yes. WebGLElementArrayCache is not the right place to generate warnings; but it could have a getter to report the number of different index types in use, that the WebGL implementation could use.

> 
> @@ +128,3 @@
> >   * all existing trees. So if trees for types uint8_t, uint16_t and uint32_t have ever been constructed for this buffer,
> > + * every subsequent update will have to update all trees even if one of the types is never
> > + * used again. That's inefficient, but content should put indices of different types in the
> 
> s/should/should not/, I hope.

Indeed :)
Flags: needinfo?(jgilbert)
Filed bug 1008310 follow-up to implement this much needed WebGL performance warnings. Not taking --- do feel free.
This further simplifies code (zero net change of lines, but ResizeToParentSize is gone, merged into Update), and fixes the memory bug (that reproduces very well in Valgrind).

As I merged ResizeToParentSize into Update, and resizing can fail as we use fallible allocation there, I had to change Update and a few other functions to return bool.

A side benefit is that when a BufferData call caused a tree resize, the tree was re-updated twice (once under ResizeToParentSize and once again under BufferSubData). That redundancy is fixed by this patch.
Attachment #8420328 - Flags: review?(jgilbert)
Comment on attachment 8420328 [details] [diff] [review]
fixups to make Valgrind/Asan happy, and also, avoid redundant tree updates on BufferData calls changing buffer size

Review of attachment 8420328 [details] [diff] [review]:
-----------------------------------------------------------------

::: content/canvas/src/WebGLElementArrayCache.cpp
@@ +156,2 @@
>    {
> +    Update(0, mParent.ByteSize() - 1);

Should we assert this?
Attachment #8420328 - Flags: review?(jgilbert) → review+
Oh. This seems like an excellent catch.
https://hg.mozilla.org/integration/mozilla-inbound/rev/d2242d05c93c

Note that I added more assertions and handling of size-0 cases, and made the unit test exercise size 0 as well, instead of exercising sizes >= 1 only as it used to do.
Ha, I just realized what you actually meant above in your comment on Update(). I also found another mis-propagation-of-error issue, and finally found a name that says what it really is for mParentByteSize : mSupportedParentByteSize. That is, the buffer byte size that the tree really supports; it may differ from the actual byte size, in which case the tree is useless.

New patch with this commit message:

* Correctly propagate Update() failure in the WebGLElementArrayCacheTree
* Correctly propagate BufferSubData failure in BufferData
* Rename mParentByteSize to mSupportedParentByteSize
Attachment #8420554 - Flags: review?(jgilbert)
The patch that landed is failed cpp tests with
cppunittests TEST-UNEXPECTED-FAIL | TestWebGLElementArrayCache | test failed with return code -6
It'll probably get backed out sometime soonish.
Backed out for Cpp test failures.  The ASAN log may prove interesting:


21:54:19     INFO -  =================================================================
21:54:19     INFO -  ==2584==ERROR: AddressSanitizer: attempting double-free on 0x602000039930 in thread T0:
21:54:20     INFO -      #0 0x47225b in realloc /builds/slave/moz-toolchain/src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:95
21:54:20     INFO -      #1 0x48eb27 in BufferData /builds/slave/m-in-l64-asan-0000000000000000/build/content/canvas/compiledtest/../src/WebGLElementArrayCache.cpp:453
21:54:20     INFO -      #2 0x48eb27 in main /builds/slave/m-in-l64-asan-0000000000000000/build/content/canvas/compiledtest/TestWebGLElementArrayCache.cpp:179
21:54:20     INFO -      #3 0x7f6987f3a76c (/lib/x86_64-linux-gnu/libc.so.6+0x2176c)
21:54:20     INFO -      #4 0x4899cc in _start (/builds/slave/test/build/tests/cppunittests/TestWebGLElementArrayCache+0x4899cc)
21:54:20     INFO -  0x602000039930 is located 0 bytes inside of 2-byte region [0x602000039930,0x602000039932)
21:54:20     INFO -  freed by thread T0 here:
21:54:20     INFO -      #0 0x47225b in realloc /builds/slave/moz-toolchain/src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:95
21:54:20     INFO -      #1 0x48eb27 in BufferData /builds/slave/m-in-l64-asan-0000000000000000/build/content/canvas/compiledtest/../src/WebGLElementArrayCache.cpp:453
21:54:20     INFO -      #2 0x48eb27 in main /builds/slave/m-in-l64-asan-0000000000000000/build/content/canvas/compiledtest/TestWebGLElementArrayCache.cpp:179
21:54:20     INFO -      #3 0x7f6987f3a76c (/lib/x86_64-linux-gnu/libc.so.6+0x2176c)
21:54:20     INFO -  previously allocated by thread T0 here:
21:54:20     INFO -      #0 0x47225b in realloc /builds/slave/moz-toolchain/src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:95
21:54:20     INFO -      #1 0x48eb27 in BufferData /builds/slave/m-in-l64-asan-0000000000000000/build/content/canvas/compiledtest/../src/WebGLElementArrayCache.cpp:453
21:54:20     INFO -      #2 0x48eb27 in main /builds/slave/m-in-l64-asan-0000000000000000/build/content/canvas/compiledtest/TestWebGLElementArrayCache.cpp:179
21:54:20     INFO -      #3 0x7f6987f3a76c (/lib/x86_64-linux-gnu/libc.so.6+0x2176c)
21:54:20     INFO -  SUMMARY: AddressSanitizer: double-free /builds/slave/moz-toolchain/src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:95 realloc
21:54:20     INFO -  ==2584==ABORTING
21:54:20     INFO -  cppunittests TEST-UNEXPECTED-FAIL | TestWebGLElementArrayCache | test failed with return code 1

https://hg.mozilla.org/integration/mozilla-inbound/rev/ec7113f437f3
Hah. So here's my story, it might be interesting.

I tested locally with valgrind, which didn't report any error, so I thought I was safe.

It turns out, I was forgetting to pass --soname-synonyms=somalloc=NONE to make it work nicely with jemalloc. Doing so makes it catch this bug. Looking now.
So what was happening is that my changeset was, as mentioned above, making the unit test now also exercise BufferData resizing buffers down to size 0, which caused realloc to behave like free(), which caused the next realloc to be a use-after-free.

Easy fix; will land if this is green:

https://tbpl.mozilla.org/?tree=Try&rev=a58781736901
Rebased. See comment 31.
Attachment #8420554 - Attachment is obsolete: true
Attachment #8420554 - Flags: review?(jgilbert)
Attachment #8420589 - Flags: review?(jgilbert)
Comment on attachment 8420589 [details] [diff] [review]
more webglelementarraycache fixes

I have yet more nits to fix around there. I'll file a separate bug.
Attachment #8420589 - Flags: review?(jgilbert)
https://hg.mozilla.org/mozilla-central/rev/24ff9dbc3f20
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: