Efficient drawElements validation on subarray and/or dynamically updated array

RESOLVED FIXED in mozilla18

Status

()

Core
Canvas: WebGL
RESOLVED FIXED
6 years ago
4 years ago

People

(Reporter: bjacob, Assigned: bjacob)

Tracking

(Blocks: 1 bug)

unspecified
mozilla18
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: webgl-next [games:p1])

Attachments

(3 attachments, 12 obsolete attachments)

3.61 KB, application/gzip
Details
21.36 KB, text/plain
Details
45.47 KB, patch
jgilbert
: review+
Details | Diff | Splinter Review
drawElements is hard optimize because it has to validate that the indices in the element array buffer are all in range.

In bug 569431 we did a simple optimization: we cache the maximum index, in the whole element array buffer. This works great in STATIC_DRAW use cases, when the drawElements calls always consume the whole element array buffer.

Fortunately, that's the most common use case, but there are other use cases too, where our current caching doesn't help:
 - in DYNAMIC_DRAW use cases, i.e. when buffer updates are frequent
 - when drawing using only part of a larger element array buffer.

The one case where we can't do anything (except, in the future, relying on ARB_robustness instead of checking ourselves) is when the whole buffer gets replaced before every drawElements call. But in other cases, e.g. when only parts of the element array buffer get updated, or when the problem is that drawElements consumes only a subarray, we should be able to do something.

Chromium has an optimization whereby they cache in a hash map the maximum value for each subarray. I'm not convinced because:
 - the memory usage here is potentially quadratic in the buffer size
 - hard to resolve conflict between memory usage and usefulness of the cache
 - partial buffer updates are quite destructive to existing cache entries

I'd rather see us use a new 'smart array' data structure that makes the "maximum value in arbitrary sub-array" problem inherently cheap to solve. I think that a starting point (there might exist something much smarter) is just a binary tree, where each node tells whether the maximum value is in the left or right half. So for example, with an element array like:

  1     5     3     2

The binary tree would look like

    right      left
       \       /
        \     /
         \   /
          \ /
         left

Problems yet to be solved:
 - given such a tree, how to most efficiently compute the maximum in an arbitrary sub-array? What's the complexity? (Trivial solution of reducing an arbitrary sub-array to a union of arrays aligned to nice tree boundaries, has log^2 complexity, not great.) 
 - when a subarray gets updated, how to efficiently update the tree?
(Assignee)

Comment 1

6 years ago
Possible variant that might work better than what is proposed in comment 0:

Instead of only storing boolean left/right data in each tree node, we could store pairs contain that bool as well as the corresponding max value, to transform log complexity into constant complexity, so the tree in the example in comment 0 becomes:

 (5,right)   (3, left)
       \       /
        \     /
         \   /
          \ /
       (5, left)

The average complexity of finding the max in an arbitrary sub-array would then be much smaller, because we typically would quickly find that the max value is not to be found near the boundaries of the sub-array, especially for large sub-arrays.
(Assignee)

Updated

6 years ago
Whiteboard: webgl-next
(Assignee)

Updated

5 years ago
Blocks: 761461

Comment 2

5 years ago
The case of partial buffer updates is very important in BananaBread, which uploads lots of temp data each frame, and that data can look basically random. We have temp buffers that we constantly upload into there, and using a large temp buffer is very slow (so we use as small a buffer as we can in each call).

To optimize partial buffer updates, the first thing that comes to mind is to split the buffer recursively into segments, so a buffer of 8 elements would look like

******** max of entire buffer
aaaabbbb max of first half, max of second half
ccddeeff max of first, second, third and fourth quarter

So we store *, a, b, c, d, e ,f, total of 7 values.

When using the entire buffer, * is used. If a range that falls in the first half is used, a is used, and so forth. When a partial update occurs, the relevant maximums are updated, for example if index 3 is updated, then we update d, and if relevant a and * in a hierarchial manner.

This basically means updating the cached maximums is logarithmic for partial updates. There is a constant factor though, so partial updates that update almost the entire buffer are slower than before.

Using the maximums is straightforward, you pick the one that covers the relevant range for you.
I am tempted to suggest that we start to rely on ARB_robustness for this where we can, along with whitelists of card types that give the correct known value.  AFAIK, Direct3D 10+ guarantees to return zero for any resource that is accessed out of bounds, so we should be able to start with ARB_robustness supported and DX10 support.

I'm pretty sure that the eventual goal is to define out of bounds access to return zero, so if we can get consensus on that right now (if we don't have it already), we can start using robustness as it currently exists...
(Assignee)

Comment 4

5 years ago
Alon: Comment 2 sounds like exactly what I tried to express in Comment 1, thanks for the help thinking it out and also knowing that BananaBread has a concrete use case with both partial updates and random accesses is really important to know.

Vladimir: it would be great to rely on ARB_robustness for this when it's available, and thereby reward driver vendors who fully implement it. Unfortunately, the ARB_robustness spec, http://www.opengl.org/registry/specs/ARB/robustness.txt , says:

    Add to Section 2.8 "Vertex Arrays" before 2.8.1 "Drawing Commands"

    Robust buffer access is enabled by creating a context with robust
    access enabled through the window system binding APIs. When enabled,
    indices within the vertex array that lie outside the arrays defined
    for enabled attributes result in undefined values for the
    corresponding attributes, but cannot result in application failure.

So in my understanding, we can't rely on this for WebGL drawElements, even if we relax the requirement that drawElements must generate INVALID_OPERATION on out-of-bounds indices (section 6.4 in WebGL spec) as we could still be allowing the application to access illegal data --- the only guarantee that we get, as far as I can see, is that we don't crash. Right?
Ah crap, I forgot that we spec'd INVALID_OPERATION.  No, the assertion is that on DX10-class hardware, out of bounds access will always return 0 since that's what DX requires.  I think the robustness spec relaxes that because it can be implemented on DX9 and other platforms where the requirement is that it's zero.

INVALID_OPERATION is going to be tricky to deal with.  I actually don't have any good suggestions here, other than maybe a webgl extension that lets authors explicitly turn the zero-index behaviour.  It wouldn't be on by default, so that kind of sucks.  Unfortunately, codifying the zero-index behaviour will cause a significant slowdown for apps that do hit it and don't have robustness, but I'm not sure that we should care, especially if we can print a warning/error in debug mode... just some ideas.
(Assignee)

Comment 6

5 years ago
Yes, we could consider that. OTOH, it it works out as well as hoped, the optimization being discussed here could make this almost a non-issue speed-wise and be comparable to the current situation memory-wise (where we have to cache element array buffers).
(Assignee)

Comment 7

5 years ago
Hm, though, there will remain the extreme case where the element array buffer gets completely (or in a very large part) replaced in between every drawElement call, in which case we can't do any optimization and ARB_robustness-style extensions are the only solution.
(Assignee)

Comment 8

5 years ago
Alon: regarding comment 1: the additional difficulty is that we must compute the maximum value in the sub-array exactly, we can't round it up, otherwise we would reject valid drawElement calls. So even if the sub-array falls entirely in the first half, we can't just use the 'a' value.

Here's where I currently am. Here's an example array to work with:

2 3 7 4 1 0 5 6

We're going to build a tree as in Comment 1 so each node has a (max_value, left_or_right) pair. We write it short-hand with L for left and R for right.

So the tree is:

2    3     7    4      1    0    5    6
 \  /       \  /        \  /      \  /
  3R         7L          1L        6R
     \__  __/              \__  __/
        7R                    6R
          \________  ________/
                   7L

Leaving aside concerns about memory usage and redundancy, what storage layout should we use to represent that tree? We should try to store it in a way that makes it easy to traverse and such that the best possible memory-locality is achieved (so we get more cache hits). One possible storage layout that gives us that, is to just read the above tree left-to-right:

2, 3R, 3, 7R, 7, 7L, 4, 7L, 1, 1L, 0, 6R, 5, 6R, 6

This is particularly convenient if we use indices starting at 1 (just store a dummy entry at index 0 in the array):

index | entry
------+-------
1     | 2
2     | 3R
3     | 3
4     | 7R
5     | 7
6     | 7L
7     | 4
8     | 7L
9     | 1
10    | 1L
11    | 0
12    | 6R
13    | 5
14    | 6R
15    | 6

This storage is optimally memory-local and the tree nodes are located at easy-to-compute locations (note that tree nodes closer to the root are at indices that are multiples of higher powers of 2)

Comment 9

5 years ago
(In reply to Benoit Jacob [:bjacob] from comment #7)
> Hm, though, there will remain the extreme case where the element array
> buffer gets completely (or in a very large part) replaced in between every
> drawElement call, in which case we can't do any optimization and
> ARB_robustness-style extensions are the only solution.

Yes, but that case isn't the most worrying one IMO. The worst problem now
is that modifying one byte of a big buffer takes time proportional to
that buffer. Whereas modifying the entire buffer or almost all of it
will at worst suffer from a slowdown of 2x (the slowdown in the previous
case is proportionally worse).

(In reply to Benoit Jacob [:bjacob] from comment #8)
> Alon: regarding comment 1: the additional difficulty is that we must compute
> the maximum value in the sub-array exactly, we can't round it up, otherwise
> we would reject valid drawElement calls. So even if the sub-array falls
> entirely in the first half, we can't just use the 'a' value.

It is quick to look up an upper bound for the relevant area.
But that is likely the common case, the upper bound will be fine
and you can proceed.

If the upper bound is not enough to know that the values in
the relevant area are valid, we do need to look at additional
areas. But that should be faster than reading individual
values since you can read large chunks, I am guessing this is
logarithmic.
(In reply to Alon Zakai (:azakai) from comment #9)
> Yes, but that case isn't the most worrying one IMO. The worst problem now
> is that modifying one byte of a big buffer takes time proportional to
> that buffer. Whereas modifying the entire buffer or almost all of it
> will at worst suffer from a slowdown of 2x (the slowdown in the previous
> case is proportionally worse).

Good point!

> It is quick to look up an upper bound for the relevant area.
> But that is likely the common case, the upper bound will be fine
> and you can proceed.

Ah yes, that's likely the case for small enough sub-arrays.

> 
> If the upper bound is not enough to know that the values in
> the relevant area are valid, we do need to look at additional
> areas. But that should be faster than reading individual
> values since you can read large chunks, I am guessing this is
> logarithmic.

Yes, I understand. I think we're on the same page there, just need to get to code it.

Somehow I thought that the tree would need to store booleans telling whether the max value was on the left or right half, but now I agree that this doesn't seem to be needed. That was the only difference AFAICS between your proposal and mine and I think you are right on this (storing max values is enough).
Cool.

Btw, let me know if you don't have time for this, I might be able to find time to write it.
Created attachment 630333 [details]
self-contained proof-of-concept impl

Still have to clean that up and hook it into WebGLBuffer. This also has a unit test which needs to be made a real compiled unit tests (and we need to enable compiled unit tests in some tests subdir of content/canvas).
Got any perf numbers for your impl?
I suggest not populating this data structure if it isn't used. If only the entire buffer is used, there is no need to calculate all the maximums. The maximums can be calculated the first time we actually need to calculate the maximum for a subset of the buffer (and from that point, we would update the maximums on updates). That way if the entire buffer is constantly updated, we remain as fast as we are now.
Vlad: all I know is that the unit test runs 50% faster with this algorithm, but it's not intended to be a benchmark, and the code isn't optimized, so I don't have real performance numbers yet.

Alon: all great points!
Created attachment 630695 [details]
self-contained proof-of-concept impl, unit-test and benchmark
Created attachment 630700 [details]
benchmark results

So here are the benchmark results, on a buffer of size 1048576. They vary considerably depending on parameters:
 - update_size: the number of indices updated by each bufferSubData call
 - draw_size: the number of indices consumed by each drawElements call
 - draws_between_updates: the number of drawElements calls between two consecutive bufferSubData calls

Basically, the speedup is often large (over 100x speedup in favorable cases) but can be negligible, or even a slowdown, in unfavorable cases:
 - very small drawElements calls, draw_size < 16  (but this won't be efficient anyways).
 - very large buffer updates (I need to investigate this, my update_tree function needs optimization. At least, it's not cache-friendly at the moment).
I'd also be curious about the absolute amount of time taken -- e.g. the total amount of time in us/ms taken by doing the index checking, if that's not too hard to print.  Though I think I can just take your self-contained code and see for myself :)
(Assignee)

Updated

5 years ago
Attachment #630333 - Attachment is obsolete: true
Easy to do, these times are the |naive| and |smart| variables in the benchmark() function.
Created attachment 631020 [details]
self-contained proof-of-concept impl, unit-test and benchmark (v2)

This new version:
 - is cleaner (more or less ready to be integrated)
 - uses naive level-by-level tree storage instead of the fractal storage which I thought would be faster (turned out to be a bad idea).
 - has some optimizations, in particular bufferSubData only invalidates the cache, so that we don't regress the performance of doing many bufferSubData calls (not covered by this benchmark)
Created attachment 631021 [details]
benchmark results (v2)

Better results than the previous version, and gives absolute times.
Attachment #630695 - Attachment is obsolete: true
Attachment #630700 - Attachment is obsolete: true
Created attachment 631154 [details]
self-contained proof-of-concept impl, unit-test and benchmark (v3)

This gets further perf improvements by changing the approach: we don't need to compute the maximum, we just need to verify that it's not greater than a certain value. This allows to apply the optimizations that Alon mentioned, when we can quickly verify that no matter what the maximum in the sub-array is, it's not greater than a certain value.

Also, this simulates the overhead of WebGL calls by adding 1 microsecond to the timing of each simulated WebGL call (which is still optimistic) so we get more realistic numbers (closer to 1x obviously, in both directions).
Created attachment 631155 [details]
benchmark results (v3)

The "loose" numbers simulate the situation where the whole element array buffer is OK for the vertex attribs, so we validation can exit early. The "tight" numbers simulate the situation where the validation has to be careful as the vertexx attribs are just exactly big enough for the sub-element-array.

Again: 1e+6 second constant overhead has been added to each simulated WebGL call, whence the less dramatic numbers.
Attachment #631020 - Attachment is obsolete: true
Attachment #631021 - Attachment is obsolete: true
Created attachment 632998 [details]
self-contained proof-of-concept impl, unit-test and benchmark (v4)

New version: adapted to match the actual WebGL API: the buffer itself is untyped, the type is only specified in the drawElements call

This requires storing two trees so an idea was needed to avoid wasting too much space.

Ehsan's idea: cut the tree a few levels below the top. Indeed, most of the memory usage is in the top levels of the tree, and these are the least useful.

Experimentation shows that a good level to make this cut is 3 levels below the top i.e. the tree leaves represent 2^3 == 8 buffer entries each.

Another nontrivial part was to decide how lazy to be with tree updates.

A fully up-to-date tree is required to make the important optimization that we avoid the computation altogether when the global buffer maximum is itself low enough. So it seemed like a better idea not to try being lazy in drawElements. So drawElements will always update the tree if it had been invalidated, even if the invalidated part isn't the part consumed by drawElements. The idea is that in real-world cases doing partial updates and partial draws, the updated buffer parts will be consumed by the draws anyway (otherwise the application has a performance bug).

The part were we go subtle is when there are lots of consecutive partial buffer updates. When should we be lazy with tree updates? In case of lots of overlapping or at least adjacent buffer updates, it seems like a good idea to be lazy, to group tree updates together. But in case of random, far-apart, small buffer updates, this can be a bad idea: if I invalidate the range 0..20 and then invalidate the range 1000..1020, and I can only represent a single range in my invalidation mechanism, then I will have to invalidate the range 0..1020 which is much larger than my 2 individual updated ranges. So in this case, the right approach when we see the second update (1000..1020) is to perform the update of the already invalidated part (0..20) and go on with 1000..1020 invalidated. See the invalidate_bytes function.

The benchmark would need to be updated to cover this.
Attachment #631154 - Attachment is obsolete: true
Created attachment 633000 [details]
benchmark results (v4)
Attachment #631155 - Attachment is obsolete: true
Whiteboard: webgl-next → webgl-next [games:p1]
Created attachment 646947 [details] [diff] [review]
element array validation

See the long comment in WebGLElementArrayCache.h.
Attachment #646947 - Flags: review?(jgilbert)
Rather, in WebGLElementArrayCache.cpp.
Created attachment 646949 [details] [diff] [review]
element array validation

removed useless header export
Attachment #646947 - Attachment is obsolete: true
Attachment #646947 - Flags: review?(jgilbert)
Attachment #646949 - Flags: review?(jgilbert)
(Assignee)

Updated

5 years ago
Summary: Smart optimization for drawElements on subarray and/or dynamically updated array → Efficient drawElements validation on subarray and/or dynamically updated array
Created attachment 648105 [details] [diff] [review]
element array validation

Fixed makefiles to build in more than just the mozconfig that I had been using so far. Also, the EXPORTS was actually needed because that header is included by the exported WebGLContext.h.
Attachment #646949 - Attachment is obsolete: true
Attachment #646949 - Flags: review?(jgilbert)
Attachment #648105 - Flags: review?(jgilbert)
Comment on attachment 648105 [details] [diff] [review]
element array validation

Review of attachment 648105 [details] [diff] [review]:
-----------------------------------------------------------------

You have a bunch of non_camel_case identifiers in the unit test, as you mentioned on IRC, as well.

A bunch of nits, and some actual concerns, so r-. Largely fine though.

::: content/canvas/src/WebGLContext.h
@@ +1497,5 @@
>          LinkedListElement<WebGLBuffer>::remove(); // remove from mContext->mBuffers
>      }
>  
>      size_t SizeOfIncludingThis(nsMallocSizeOfFun aMallocSizeOf) const {
> +        return aMallocSizeOf(this); //  + size of cache

Is the size of the cache included here? It looks like this would only include the size of the autoPtr.

::: content/canvas/src/WebGLContextGL.cpp
@@ +512,5 @@
>      }
>  
>      boundBuffer->SetByteLength(size);
> +    boundBuffer->ElementArrayCacheBufferData(nsnull, size);
> +    // on OOM,  return ErrorOutOfMemory("bufferData: out of memory");

Does this comment mean that this is thrown somewhere inside of ElementArrayCacheBufferData?

@@ +1850,5 @@
>        return;
>  
> +    int32_t maxAllowedIndex = maxAllowedCount >= 1
> +                              ? maxAllowedCount - 1
> +                              : 0;

I would think |NS_MAX(maxAllowedCount-1, 0)| would be cleaner.

::: content/canvas/src/WebGLElementArrayCache.cpp
@@ +153,5 @@
> +    return mTreeData[1];
> +  }
> +
> +  // returns the index of the parent node; if treeIndex=1 (the root node),
> +  // the return value if 0.

s/if/is

Also consider asserting that treeIndex is non-zero.

@@ +187,5 @@
> +    return treeIndex + distance;
> +  }
> +
> +  // will clamp automatically if element is out of bounds
> +  size_t LeafForElement(size_t element) {

Is it useful to have this behavior instead of asserting?

@@ +196,5 @@
> +  size_t LeafForByte(size_t byte) {
> +    return LeafForElement(byte / sizeof(T));
> +  }
> +
> +  size_t TreeIndexForLeaf(size_t leaf) {

Add a comment that explains what this does, and why it's just |leaf + mNumLeaves|.

@@ +201,5 @@
> +    return leaf + mNumLeaves;
> +  }
> +
> +  static size_t LastElementUnderSameLeaf(size_t element) {
> +    size_t mask = sElementsPerLeaf - 1;

Since we calculate this mask in (at least) three places, really consider adding |sElementPerLeafMask| or similar. Better source locality would make it more clear that sElementsPerLeaf is always PoT, also.

@@ +264,5 @@
> +  }
> +
> +  void ResizeForDataBytes(size_t byteSize)
> +  {
> +    size_t numberOfElements = mParent.ByteSize() / sizeof(T);

Did you mean |byteSize|? Otherwise |byteSize| appears unused here.

@@ +266,5 @@
> +  void ResizeForDataBytes(size_t byteSize)
> +  {
> +    size_t numberOfElements = mParent.ByteSize() / sizeof(T);
> +    size_t numberOfElementsRoundedUp = NextMultipleOfElementsPerLeaf(numberOfElements);
> +    size_t requiredNumLeaves = numberOfElementsRoundedUp / sElementsPerLeaf;

If reqNumLeaves is all you care about, why not use this fairly common meme:
// x' = (x + (n-1)) / n;
// For n=4:
// x=0: (0 + (4-1)) / 4 = (0+3)/4 = 3/4 = 0
// x=1: (1 + 3) / 4 = 4/4 = 1
// x=4: (4 + 3) / 4 = 7/4 = 1
// x=5: (5 + 3) / 4 = 8/4 = 2
reqNumLeaves = (numElements + (sElemsPerLeaf - 1)) / sElemsPerLeaf;

@@ +270,5 @@
> +    size_t requiredNumLeaves = numberOfElementsRoundedUp / sElementsPerLeaf;
> +
> +    mNumLeaves = 1;
> +    while (mNumLeaves < requiredNumLeaves)
> +      mNumLeaves *= 2;

Simple as this is, I'd rather call a NextPowerOfTwo util function for this. If it doesn't already have one, this would probably make a good addition to MFBT.

@@ +272,5 @@
> +    mNumLeaves = 1;
> +    while (mNumLeaves < requiredNumLeaves)
> +      mNumLeaves *= 2;
> +
> +    mTreeData.SetLength(2 * mNumLeaves);

I know it's way up in the huge comment at the top, but it would be good to leave a tiny comment here to reiterate that the structure size is 2*numLeaves, with a 'see explanation above for why', or something.

@@ +334,5 @@
> +
> +  // Step #1: initialize the tree leaves from plain buffer data.
> +  // That is, each tree leaf must be set to the max of the |sElementsPerLeaf| corresponding
> +  // buffer entries.
> +  {

Why is there a conditionless scope?

Oh, I imagine this is to not leak this scope's variables into the code below. Generally this would mean we would want this as a separate function, though.

I'm fine leaving this, but add a comment about why we're adding this scope here.

@@ +346,5 @@
> +        m = NS_MAX(m, mParent.Element<T>(srcIndex));
> +      }
> +      mTreeData[treeIndex] = m;
> +      treeIndex++;
> +    }

It seems this would be much clearer as something like:
for (cur = firstInvalidLeaf; cur <= lastInvalidLeaf; cur++) {
SetLeafMax(cur, GetMaxBufferElem(cur * elemPerLeaf, elemPerLeaf));
}

There are a number of 'fooIndex' vars, for which it's not immediately clear what they are indexes of/into.

@@ +351,5 @@
> +  }
> +
> +  // Step #2: propagate the values up the tree. This is simply a matter of walking up
> +  // the tree and setting each node to the max of its two children.
> +  while (true) {

Any chance we could reformulate this into a recursive function?

@@ +449,5 @@
> +  size_t lastElement = firstElement + countElements - 1;
> +
> +  tree->Update();
> +
> +  // fast exist path when the global maximum for the whole element array buffer

s/fast exist/fast-exit

::: content/canvas/src/WebGLElementArrayCache.h
@@ +66,5 @@
> +  friend class WebGLElementArrayCacheTree;
> +  template<typename T>
> +  friend struct TreeForType;
> +
> +  void *mUntypedData;

Prefer |void* foo| to |void *foo|.

@@ +69,5 @@
> +
> +  void *mUntypedData;
> +  size_t mByteSize;
> +  WebGLElementArrayCacheTree<uint8_t> *mUint8Tree;
> +  WebGLElementArrayCacheTree<uint16_t> *mUint16Tree;

|type *foo| conflicts with the style above, and I would prefer |type* foo| regardless.

::: content/canvas/test/compiled/TestWebGLElementArrayCache.cpp
@@ +30,5 @@
> +    VerifyImplFunction((condition), __FILE__, __LINE__)
> +
> +void MakeRandomVector(nsTArray<uint8_t>& a, size_t size) {
> +  a.SetLength(size);
> +  const size_t bits_to_ignore_in_rand = 16;

Why so many? Why do we ignore any bits at all? RAND_MAX is only guaranteed to be >(2^15 - 1).

@@ +31,5 @@
> +
> +void MakeRandomVector(nsTArray<uint8_t>& a, size_t size) {
> +  a.SetLength(size);
> +  const size_t bits_to_ignore_in_rand = 16;
> +  MOZ_ASSERT((unsigned int)(RAND_MAX) >> (8 + bits_to_ignore_in_rand));

Could this be made into a static assert?

@@ +44,5 @@
> +  return result;
> +}
> +
> +template<typename T>
> +void CheckValidateOneType(WebGLElementArrayCache& b, size_t first_byte, size_t count_bytes)

Why is it called 'b'.

@@ +47,5 @@
> +template<typename T>
> +void CheckValidateOneType(WebGLElementArrayCache& b, size_t first_byte, size_t count_bytes)
> +{
> +  size_t first = first_byte / sizeof(T);
> +  size_t count = count_bytes / sizeof(T);

You might assert that count_bytes % sizeof(T) is 0.

@@ +53,5 @@
> +  GLenum type = sizeof(T) == 1
> +                ? LOCAL_GL_UNSIGNED_BYTE
> +                : LOCAL_GL_UNSIGNED_SHORT;
> +
> +  T max = 0;

Consider using size_t here, and cast down to T below.

@@ +83,5 @@
> +
> +  for (int maxBufferSize = 1; maxBufferSize <= 4096; maxBufferSize *= 2) {
> +    int repeat = std::min(maxBufferSize, 20);
> +    for (int i = 0; i < repeat; i++) {
> +    size_t size = RandomInteger<size_t>(1, maxBufferSize);

Bad indenting following for(){.

@@ +108,5 @@
> +              CheckValidate(b, offset, subsize);
> +            }
> +          }
> +        }
> +      }

Can you mark one or two of these scope-endings with a comment saying which part it is that's done?
Attachment #648105 - Flags: review?(jgilbert) → review-
Actually, it would be great to add one or two tests to the unit test using known good/bad data. [1,2,3,4,5,6,7,8] should fail for max=4 from 0:5, but succeed for from 0:4, etc.
Created attachment 659362 [details] [diff] [review]
element array validation

Followed most of your recommendations; see next comment.
Attachment #648105 - Attachment is obsolete: true
Attachment #659362 - Flags: review?(jgilbert)
Comment on attachment 659362 [details] [diff] [review]
element array validation

moment, patch not quite ready
Attachment #659362 - Flags: review?(jgilbert)
Created attachment 659374 [details] [diff] [review]
element array validation
Attachment #659362 - Attachment is obsolete: true
Attachment #659374 - Flags: review?(jgilbert)
(In reply to Jeff Gilbert [:jgilbert] from comment #30)
> Comment on attachment 648105 [details] [diff] [review]
> element array validation
> 
> Review of attachment 648105 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> You have a bunch of non_camel_case identifiers in the unit test, as you
> mentioned on IRC, as well.
> 
> A bunch of nits, and some actual concerns, so r-. Largely fine though.
> 
> ::: content/canvas/src/WebGLContext.h
> @@ +1497,5 @@
> >          LinkedListElement<WebGLBuffer>::remove(); // remove from mContext->mBuffers
> >      }
> >  
> >      size_t SizeOfIncludingThis(nsMallocSizeOfFun aMallocSizeOf) const {
> > +        return aMallocSizeOf(this); //  + size of cache
> 
> Is the size of the cache included here? It looks like this would only
> include the size of the autoPtr.

Good catch.

> 
> ::: content/canvas/src/WebGLContextGL.cpp
> @@ +512,5 @@
> >      }
> >  
> >      boundBuffer->SetByteLength(size);
> > +    boundBuffer->ElementArrayCacheBufferData(nsnull, size);
> > +    // on OOM,  return ErrorOutOfMemory("bufferData: out of memory");
> 
> Does this comment mean that this is thrown somewhere inside of
> ElementArrayCacheBufferData?

Good catch too. Made this func return bool.

> 
> @@ +1850,5 @@
> >        return;
> >  
> > +    int32_t maxAllowedIndex = maxAllowedCount >= 1
> > +                              ? maxAllowedCount - 1
> > +                              : 0;
> 
> I would think |NS_MAX(maxAllowedCount-1, 0)| would be cleaner.

Done.

> 
> ::: content/canvas/src/WebGLElementArrayCache.cpp
> @@ +153,5 @@
> > +    return mTreeData[1];
> > +  }
> > +
> > +  // returns the index of the parent node; if treeIndex=1 (the root node),
> > +  // the return value if 0.
> 
> s/if/is
> 
> Also consider asserting that treeIndex is non-zero.

Done.

> 
> @@ +187,5 @@
> > +    return treeIndex + distance;
> > +  }
> > +
> > +  // will clamp automatically if element is out of bounds
> > +  size_t LeafForElement(size_t element) {
> 
> Is it useful to have this behavior instead of asserting?

Was bad design. Fixed in the caller. Now asserting.

> 
> @@ +196,5 @@
> > +  size_t LeafForByte(size_t byte) {
> > +    return LeafForElement(byte / sizeof(T));
> > +  }
> > +
> > +  size_t TreeIndexForLeaf(size_t leaf) {
> 
> Add a comment that explains what this does, and why it's just |leaf +
> mNumLeaves|.

Done.

> 
> @@ +201,5 @@
> > +    return leaf + mNumLeaves;
> > +  }
> > +
> > +  static size_t LastElementUnderSameLeaf(size_t element) {
> > +    size_t mask = sElementsPerLeaf - 1;
> 
> Since we calculate this mask in (at least) three places, really consider
> adding |sElementPerLeafMask| or similar. Better source locality would make
> it more clear that sElementsPerLeaf is always PoT, also.

Done.

> 
> @@ +264,5 @@
> > +  }
> > +
> > +  void ResizeForDataBytes(size_t byteSize)
> > +  {
> > +    size_t numberOfElements = mParent.ByteSize() / sizeof(T);
> 
> Did you mean |byteSize|? Otherwise |byteSize| appears unused here.

Good catch. The function is renamed to ResizeToParentSize and no longer takes that argument.

> 
> @@ +266,5 @@
> > +  void ResizeForDataBytes(size_t byteSize)
> > +  {
> > +    size_t numberOfElements = mParent.ByteSize() / sizeof(T);
> > +    size_t numberOfElementsRoundedUp = NextMultipleOfElementsPerLeaf(numberOfElements);
> > +    size_t requiredNumLeaves = numberOfElementsRoundedUp / sElementsPerLeaf;
> 
> If reqNumLeaves is all you care about, why not use this fairly common meme:
> // x' = (x + (n-1)) / n;
> // For n=4:
> // x=0: (0 + (4-1)) / 4 = (0+3)/4 = 3/4 = 0
> // x=1: (1 + 3) / 4 = 4/4 = 1
> // x=4: (4 + 3) / 4 = 7/4 = 1
> // x=5: (5 + 3) / 4 = 8/4 = 2
> reqNumLeaves = (numElements + (sElemsPerLeaf - 1)) / sElemsPerLeaf;

Done

> 
> @@ +270,5 @@
> > +    size_t requiredNumLeaves = numberOfElementsRoundedUp / sElementsPerLeaf;
> > +
> > +    mNumLeaves = 1;
> > +    while (mNumLeaves < requiredNumLeaves)
> > +      mNumLeaves *= 2;
> 
> Simple as this is, I'd rather call a NextPowerOfTwo util function for this.

Done

> If it doesn't already have one, this would probably make a good addition to
> MFBT.

Agree.

> 
> @@ +272,5 @@
> > +    mNumLeaves = 1;
> > +    while (mNumLeaves < requiredNumLeaves)
> > +      mNumLeaves *= 2;
> > +
> > +    mTreeData.SetLength(2 * mNumLeaves);
> 
> I know it's way up in the huge comment at the top, but it would be good to
> leave a tiny comment here to reiterate that the structure size is
> 2*numLeaves, with a 'see explanation above for why', or something.

Done

> 
> @@ +334,5 @@
> > +
> > +  // Step #1: initialize the tree leaves from plain buffer data.
> > +  // That is, each tree leaf must be set to the max of the |sElementsPerLeaf| corresponding
> > +  // buffer entries.
> > +  {
> 
> Why is there a conditionless scope?
> 
> Oh, I imagine this is to not leak this scope's variables into the code
> below. Generally this would mean we would want this as a separate function,
> though.
> 
> I'm fine leaving this, but add a comment about why we're adding this scope
> here.

Done

> 
> @@ +346,5 @@
> > +        m = NS_MAX(m, mParent.Element<T>(srcIndex));
> > +      }
> > +      mTreeData[treeIndex] = m;
> > +      treeIndex++;
> > +    }
> 
> It seems this would be much clearer as something like:
> for (cur = firstInvalidLeaf; cur <= lastInvalidLeaf; cur++) {
> SetLeafMax(cur, GetMaxBufferElem(cur * elemPerLeaf, elemPerLeaf));
> }
> 
> There are a number of 'fooIndex' vars, for which it's not immediately clear
> what they are indexes of/into.

Added comments.

> 
> @@ +351,5 @@
> > +  }
> > +
> > +  // Step #2: propagate the values up the tree. This is simply a matter of walking up
> > +  // the tree and setting each node to the max of its two children.
> > +  while (true) {
> 
> Any chance we could reformulate this into a recursive function?

Maybe, I don't know and don't see enough potential benefits.

> 
> @@ +449,5 @@
> > +  size_t lastElement = firstElement + countElements - 1;
> > +
> > +  tree->Update();
> > +
> > +  // fast exist path when the global maximum for the whole element array buffer
> 
> s/fast exist/fast-exit

I was trying to engage a conversation on existentialism, you insensitive clod.

> 
> ::: content/canvas/src/WebGLElementArrayCache.h
> @@ +66,5 @@
> > +  friend class WebGLElementArrayCacheTree;
> > +  template<typename T>
> > +  friend struct TreeForType;
> > +
> > +  void *mUntypedData;
> 
> Prefer |void* foo| to |void *foo|.

Done

> 
> @@ +69,5 @@
> > +
> > +  void *mUntypedData;
> > +  size_t mByteSize;
> > +  WebGLElementArrayCacheTree<uint8_t> *mUint8Tree;
> > +  WebGLElementArrayCacheTree<uint16_t> *mUint16Tree;
> 
> |type *foo| conflicts with the style above, and I would prefer |type* foo|
> regardless.

Done

> 
> ::: content/canvas/test/compiled/TestWebGLElementArrayCache.cpp
> @@ +30,5 @@
> > +    VerifyImplFunction((condition), __FILE__, __LINE__)
> > +
> > +void MakeRandomVector(nsTArray<uint8_t>& a, size_t size) {
> > +  a.SetLength(size);
> > +  const size_t bits_to_ignore_in_rand = 16;
> 
> Why so many? Why do we ignore any bits at all? RAND_MAX is only guaranteed
> to be >(2^15 - 1).

Only the most significant bits of rand() are typically decently random. This is guarded by an assertion now.

> 
> @@ +31,5 @@
> > +
> > +void MakeRandomVector(nsTArray<uint8_t>& a, size_t size) {
> > +  a.SetLength(size);
> > +  const size_t bits_to_ignore_in_rand = 16;
> > +  MOZ_ASSERT((unsigned int)(RAND_MAX) >> (8 + bits_to_ignore_in_rand));
> 
> Could this be made into a static assert?

Done

> 
> @@ +44,5 @@
> > +  return result;
> > +}
> > +
> > +template<typename T>
> > +void CheckValidateOneType(WebGLElementArrayCache& b, size_t first_byte, size_t count_bytes)
> 
> Why is it called 'b'.

Because of previous iterations. Fixed.


> 
> @@ +47,5 @@
> > +template<typename T>
> > +void CheckValidateOneType(WebGLElementArrayCache& b, size_t first_byte, size_t count_bytes)
> > +{
> > +  size_t first = first_byte / sizeof(T);
> > +  size_t count = count_bytes / sizeof(T);
> 
> You might assert that count_bytes % sizeof(T) is 0.

Done

> 
> @@ +53,5 @@
> > +  GLenum type = sizeof(T) == 1
> > +                ? LOCAL_GL_UNSIGNED_BYTE
> > +                : LOCAL_GL_UNSIGNED_SHORT;
> > +
> > +  T max = 0;
> 
> Consider using size_t here, and cast down to T below.

I didn't understand that part.

> 
> @@ +83,5 @@
> > +
> > +  for (int maxBufferSize = 1; maxBufferSize <= 4096; maxBufferSize *= 2) {
> > +    int repeat = std::min(maxBufferSize, 20);
> > +    for (int i = 0; i < repeat; i++) {
> > +    size_t size = RandomInteger<size_t>(1, maxBufferSize);
> 
> Bad indenting following for(){.

Fixed.

> 
> @@ +108,5 @@
> > +              CheckValidate(b, offset, subsize);
> > +            }
> > +          }
> > +        }
> > +      }
> 
> Can you mark one or two of these scope-endings with a comment saying which
> part it is that's done?

Done
(In reply to Benoit Jacob [:bjacob] from comment #35)
> (In reply to Jeff Gilbert [:jgilbert] from comment #30)
> > Comment on attachment 648105 [details] [diff] [review]
> > element array validation
> > 
> > Review of attachment 648105 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > @@ +351,5 @@
> > > +  }
> > > +
> > > +  // Step #2: propagate the values up the tree. This is simply a matter of walking up
> > > +  // the tree and setting each node to the max of its two children.
> > > +  while (true) {
> > 
> > Any chance we could reformulate this into a recursive function?
> 
> Maybe, I don't know and don't see enough potential benefits.
Recursive operations and trees tend to go hand-in-hand in my mind, and I like how each recursive step tends to be very concise. However, this is probably fine.
> > 
> > @@ +53,5 @@
> > > +  GLenum type = sizeof(T) == 1
> > > +                ? LOCAL_GL_UNSIGNED_BYTE
> > > +                : LOCAL_GL_UNSIGNED_SHORT;
> > > +
> > > +  T max = 0;
> > 
> > Consider using size_t here, and cast down to T below.
> 
> I didn't understand that part.
Yeah, me neither. I think I was wrong though, this seems fine.
(In reply to Jeff Gilbert [:jgilbert] from comment #36)
> (In reply to Benoit Jacob [:bjacob] from comment #35)
> > (In reply to Jeff Gilbert [:jgilbert] from comment #30)
> > > Comment on attachment 648105 [details] [diff] [review]
> > > element array validation
> > > 
> > > Review of attachment 648105 [details] [diff] [review]:
> > > -----------------------------------------------------------------
> > > 
> > > @@ +351,5 @@
> > > > +  }
> > > > +
> > > > +  // Step #2: propagate the values up the tree. This is simply a matter of walking up
> > > > +  // the tree and setting each node to the max of its two children.
> > > > +  while (true) {
> > > 
> > > Any chance we could reformulate this into a recursive function?
> > 
> > Maybe, I don't know and don't see enough potential benefits.
> Recursive operations and trees tend to go hand-in-hand in my mind, and I
> like how each recursive step tends to be very concise. However, this is
> probably fine.

I see your point. A counter-point though is that these trees are typically "linked" trees and here I use a special kind of "compact" tree. Using loops instead of recursion is sort of the code counterpart to this data-structure difference.
Anything else before I can get my r+?
Comment on attachment 659374 [details] [diff] [review]
element array validation

Review of attachment 659374 [details] [diff] [review]:
-----------------------------------------------------------------

This was the issue I mentioned in IRC.

::: content/canvas/src/WebGLContextGL.cpp
@@ +554,5 @@
>      }
>  
>      boundBuffer->SetByteLength(data->Length());
> +    boundBuffer->ElementArrayCacheBufferData(data->Data(), data->Length());
> +    // on OOM,      return ErrorOutOfMemory("bufferData: out of memory");

It looks like this doesn't actually do the return? [1]

@@ +589,5 @@
>      }
>  
>      boundBuffer->SetByteLength(data.Length());
> +    boundBuffer->ElementArrayCacheBufferData(data.Data(), data.Length());
> +    // on OOM,    return ErrorOutOfMemory("bufferData: out of memory");

[1] again

::: content/canvas/test/compiled/TestWebGLElementArrayCache.cpp
@@ +159,5 @@
> +          } // validateCalls
> +        } // bufferSubDataCalls
> +      } // j
> +    } // i
> +  } // maxBufferSize

These comments work surprisingly well.
Attachment #659374 - Flags: review?(jgilbert) → review-
Created attachment 661930 [details] [diff] [review]
element array validation

Oh right. This version should have it fixed. Still figuring out TEST_DIRS vs TOOLS_DIRS for landing --- looks like Android wants TOOLS_DIRS while Linux wants TEST_DIRS. Fun.
Attachment #659374 - Attachment is obsolete: true
Attachment #661930 - Flags: review?(jgilbert)
Attachment #661930 - Flags: review?(jgilbert) → review+
http://hg.mozilla.org/integration/mozilla-inbound/rev/ff0e65d0ee17
(Assignee)

Updated

5 years ago
Assignee: nobody → bjacob
Target Milestone: --- → mozilla18
https://hg.mozilla.org/mozilla-central/rev/ff0e65d0ee17
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Depends on: 825028

Updated

4 years ago
Blocks: 710398
You need to log in before you can comment on or make changes to this bug.