Write microbenchmarks for tree traversal algorithms

RESOLVED FIXED in Firefox 50

Status

()

RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: botond, Assigned: kevin.m.wern, Mentored)

Tracking

Trunk
mozilla50
Points:
---

Firefox Tracking Flags

(firefox48 affected, firefox50 fixed)

Details

(Whiteboard: gfx-noted)

Attachments

(3 attachments)

In bug 1226919 and its dependencies, we are writing generic tree traveral algorithms, and replacing manual loops over certain trees, such as the Layer tree, with calls to these algorithms.

I don't expect this transformation to have any performance impact, but it would be good to verify that this is the case. Now that we have a micro-benchmarking framework (added in bug 1256408), that would be a good way to verify this.
Whiteboard: gfx-noted
(Assignee)

Updated

3 years ago
Assignee: nobody → kevin.m.wern
Mentor: botond
(Assignee)

Comment 1

3 years ago
Hey, Botond!

I'm taking a look over the patch you linked to, and am starting work on this bug this weekend.  I'm thinking that I should extend TestTreeTraversal to include benchmarks for a node being traversed by normal recursion, and then the corresponding solution using our new functions.  In addition, I've been thinking two other things:
- Are there functions in gfx/apz code that would make good 'practical' tests?
- Other than the fact that we're calling lambda functions, is there any potential for overhead?  I can't think of any way that the specific objects being captured or calls being made would expose any performance difference (other than magnifying any differences because more work is done in each call).

Any other information you think would be helpful is appreciated, as well.
Flags: needinfo?(botond)
(In reply to Kevin Wern from comment #1)

Thanks for working on this!

> I'm thinking that I should extend TestTreeTraversal to
> include benchmarks for a node being traversed by normal recursion, and then
> the corresponding solution using our new functions.  

That sounds like the right idea.

> - Are there functions in gfx/apz code that would make good 'practical' tests?

Nothing in particular comes to mind. It might make sense to have a test case that does a search of the tree, and one that traverses the entire tree and performs some computation (e.g. say the leaves start off storing numbers, the algorithm could compute, for each node, the sum of the numbers of its children; or, if you want to be particularly realistic, you could use regions instead of numbers, and take the union of the regions).

> - Other than the fact that we're calling lambda functions, is there any
> potential for overhead?  I can't think of any way that the specific objects
> being captured or calls being made would expose any performance difference
> (other than magnifying any differences because more work is done in each
> call).

A couple of ideas:

  - Capturing a lot of variables, including some by reference, might make it
    harder for the compiler to inline the call to the lambda, so it might be
    worth testing that.

  - Using a separate stack for keeping track of state as we traverse the tree, 
    as we do in some places (e.g. ApplyAsyncContentTransformToTree), might be 
    more expensive than before, when the recursion just made use of the 
    program's stack for this purpose. It's probably worth testing this too.

> Any other information you think would be helpful is appreciated, as well.

Feel free to build the tree structure you're testing on algorithmically, e.g. loop to add 20 levels of depth and things like that.
Flags: needinfo?(botond)
(Assignee)

Comment 3

3 years ago
Created attachment 8761412 [details]
Bug 1258460: create microbenchmarks for TreeTraversal.h algorithms

Review commit: https://reviewboard.mozilla.org/r/58428/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/58428/
Attachment #8761412 - Flags: review?(botond)
(Assignee)

Comment 4

3 years ago
Created attachment 8761414 [details]
Output from tests, including benchmarks
(Assignee)

Comment 5

3 years ago
I tried to use regions rather than numbers for ForEachNode operations. If you feel these cases are too complicated, please let me know.

Additionally, is 26 variables enough of a test for your first point?  I couldn't find anything regarding compiler implementation that gave a concrete number.

I attached the output of one test, and will also attach 3 additional trials I ran for an overview of how everything seems to be stacking up.
(Assignee)

Comment 6

3 years ago
Created attachment 8761440 [details]
Overview of test cases, tab-separated
Comment on attachment 8761412 [details]
Bug 1258460: create microbenchmarks for TreeTraversal.h algorithms

https://reviewboard.mozilla.org/r/58428/#review55530

Thanks, Kevin, for these thorough benchmarks!

A few comments:

::: gfx/tests/gtest/TestTreeTraversal.cpp:14
(Diff revision 1)
> +#include <queue>
> +
> +#define PERFORMANCE_TREE_DEPTH 20
> +#define PERFORMANCE_TREE_CHILD_COUNT 2
> +#define PERFORMANCE_TREE_LEAF_COUNT 1048576 // 2 ** 20
> +#define PERFORMANCE_REGION_XWRAP 100000

These can be constant integers instead of defines.

::: gfx/tests/gtest/TestTreeTraversal.cpp:1252
(Diff revision 1)
> +{
> +  int depth = PERFORMANCE_TREE_DEPTH;
> +  int childrenCount = PERFORMANCE_TREE_CHILD_COUNT;
> +  RefPtr<SearchTestNodeForward> needleNode;
> +  RefPtr<SearchTestNodeForward> root = CreateBenchmarkTree<SearchTestNodeForward>(depth, childrenCount,
> +      [&needleNode](SearchTestNodeForward* aNode) {

It's often convenient to use a lambda as an action, but of course it doesn't have to be a lambda - it can be a regular function or function object.

In this case, as we re-use this particular action in many different tests, it makes sense to factor it out into a function object, which is then re-used by the various tests.

This applies to several lambdas that we use over and over again in these tests.

::: gfx/tests/gtest/TestTreeTraversal.cpp:1254
(Diff revision 1)
> +  int childrenCount = PERFORMANCE_TREE_CHILD_COUNT;
> +  RefPtr<SearchTestNodeForward> needleNode;
> +  RefPtr<SearchTestNodeForward> root = CreateBenchmarkTree<SearchTestNodeForward>(depth, childrenCount,
> +      [&needleNode](SearchTestNodeForward* aNode) {
> +        aNode->SetType(SearchNodeType::Hay);
> +        if (aNode->GetFirstChild() == nullptr) {

Perhaps we could add an IsLeaf() method to TestNodeReverse and TestNodeForward, for better readability?

::: gfx/tests/gtest/TestTreeTraversal.cpp:1296
(Diff revision 1)
> +static RefPtr<Node> DepthFirstSearchCaptureVariablesForwardRecursive(RefPtr<Node> aNode,
> +    int a, int b, int c, int d, int e, int f,
> +    int g, int h, int i, int j, int k, int l,
> +    int m, int& n, int& o, int& p, int& q, int& r,
> +    int& s, int& t, int& u, int& v, int& w, int& x,
> +    int& y, int& z)

These variables may just be optimized out if they're not actually used. Perhaps we could modify the search to actually use them? For example, we could have the nodes store an integer value, and search for the one whose value is the sum of these variables.

This applies to both the plain and algorithm versions.

::: gfx/tests/gtest/TestTreeTraversal.cpp:1498
(Diff revision 1)
> +// each ancestor's region as the union of its child regions.
> +template <typename Node>
> +static void ForEachNodePostOrderForwardRecursive(RefPtr<Node> aNode)
> +{
> +  if (aNode->GetFirstChild()) {
> +    aNode->SetRegion(new nsRegion());

We're leaking these regions. It would be better to have the nodes store them by value.

::: gfx/tests/gtest/TestTreeTraversal.cpp:1506
(Diff revision 1)
> +        node = node->GetNextSibling()) {
> +      ForEachNodePostOrderForwardRecursive(node);
> +      nsRegion* currentRegion = aNode->GetRegion();
> +      nsRegion* childRegion = node->GetRegion();
> +      nsRegion* newRegion = new nsRegion(currentRegion->OrWith(*childRegion));
> +      aNode->SetRegion(newRegion);

currentRegion->OrWith() modifies currentRegion, so it's sufficient to do:

  currentRegion->OrWith(\*childRegion);
  
instead of creating and setting a new region.

Applies to the algorithm version too.

::: gfx/tests/gtest/TestTreeTraversal.cpp:1526
(Diff revision 1)
> +          int y = squareCount / xWrap;
> +          aNode->SetRegion(new nsRegion(nsRect(x, y, 1, 1)));
> +          squareCount++;
> +        }
> +      });
> +  ForEachNodePostOrderForwardRecursive(root);

Do we want to make some assertion about the root node's region?

::: gfx/tests/gtest/TestTreeTraversal.cpp:1572
(Diff revision 1)
> +// Starting with a tree whose root has a rectangular region of size
> +// PERFORMANCE_TREE_LEAF_COUNT x 1, for each node, split the region into
> +// PERFORMANCE_TREE_CHILD_COUNT separate regions of equal width and assign to
> +// each child left-to-right. In the end, every node's region should equal the
> +// sum of its childrens' regions, and each level of depth's regions should sum
> +// to the root's region.

Nice :)

::: gfx/tests/gtest/TestTreeTraversal.cpp:1631
(Diff revision 1)
> +            nChildren++;
> +          }
> +          nsRect bounds = aNode->GetRegion()->GetBounds();
> +          int childWidth = bounds.width / nChildren;
> +          int x = bounds.x;
> +          for (RefPtr<ForEachTestNodeForward> node = aNode->GetFirstChild();

I note that here, we're looping over the children of each node during the action, and then the algorithm will loop over the children again to traverse them, so we're doing more looping overall than the "plain" version.

This is not necessarily a bad thing - if people are going to be introducing extra loops like this when refactoring their code to use the algorithms, then doing so in the benchmark is totally fair - I'm just pointing it out.

::: gfx/tests/gtest/TestTreeTraversal.cpp:1652
(Diff revision 1)
> +// a starting width of PERFORMANCE_TREE_LEAF_COUNT, and an empty tree, create a
> +// tree with the same conditions as
> +// ((Plain|TreeTraversal)_ForwardForEachNodePerformance) by assigning regions
> +// of the current width, starting from the min x and min y coordinates. For
> +// each level of depth, decrease the current width by a factor of
> +// PERFORMANCE_TREE_CHILD_COUNT, and maintiain a stack of ancestor regions.

s/maintiain/maintain

::: gfx/tests/gtest/TestTreeTraversal.cpp:1653
(Diff revision 1)
> +// tree with the same conditions as
> +// ((Plain|TreeTraversal)_ForwardForEachNodePerformance) by assigning regions
> +// of the current width, starting from the min x and min y coordinates. For
> +// each level of depth, decrease the current width by a factor of
> +// PERFORMANCE_TREE_CHILD_COUNT, and maintiain a stack of ancestor regions.
> +// Use the stack to track the poriton of each region still available to assign

s/poriton/portion

::: gfx/tests/gtest/TestTreeTraversal.cpp:1654
(Diff revision 1)
> +// ((Plain|TreeTraversal)_ForwardForEachNodePerformance) by assigning regions
> +// of the current width, starting from the min x and min y coordinates. For
> +// each level of depth, decrease the current width by a factor of
> +// PERFORMANCE_TREE_CHILD_COUNT, and maintiain a stack of ancestor regions.
> +// Use the stack to track the poriton of each region still available to assign
> +// to children, which determines the aformentioned min x and min y coordinates.

s/aformentioned/aforementioned

::: gfx/tests/gtest/TestTreeTraversal.cpp:1669
(Diff revision 1)
> +  aRectangleWidth /= aChildrenCount;
> +
> +  for (RefPtr<Node> node = aNode->GetFirstChild();
> +      node != nullptr;
> +      node = node->GetNextSibling()) {
> +    ForEachNodeForwardStackRecursive(node, aRectangleWidth, aRegion, aChildrenCount);

Did you mean to pass |newRegion| instead of |aRegion|? Otherwise why compute |newRegion| at all?

::: gfx/tests/gtest/TestTreeTraversal.cpp:1670
(Diff revision 1)
> +
> +  for (RefPtr<Node> node = aNode->GetFirstChild();
> +      node != nullptr;
> +      node = node->GetNextSibling()) {
> +    ForEachNodeForwardStackRecursive(node, aRectangleWidth, aRegion, aChildrenCount);
> +    newRegion = newRegion.SubOut(*(node->GetRegion()));

Like OrWith(), SubOut() modifies the region it's called on, so it's sufficient to just do:

  newRegion.SubOut(\*(node->GetRegion()));

::: gfx/tests/gtest/TestTreeTraversal.cpp:1702
(Diff revision 1)
> +static void TreeTraversal_ForwardForEachNodeStackPerformance()
> +{
> +  int depth = PERFORMANCE_TREE_DEPTH;
> +  int childrenCount = PERFORMANCE_TREE_CHILD_COUNT;
> +  int rectangleWidth = PERFORMANCE_TREE_LEAF_COUNT;
> +  stack<nsRegion*> regionStack;

Store the regions in the stack by value, so we don't leak them.

::: gfx/tests/gtest/TestTreeTraversal.cpp:1714
(Diff revision 1)
> +        nsRegion* parentRegion = regionStack.top();
> +        nsRect parentRect = parentRegion->GetBounds();
> +        nsRect newRect(parentRect.x, parentRect.y, rectangleWidth, 1);
> +        nsRegion newRegion(newRect);
> +        aNode->SetRegion(new nsRegion(newRegion));
> +        regionStack.top() = new nsRegion(regionStack.top()->SubOut(newRegion));

Just:

  regionStack.top()->SubOut(newRegion);

::: gfx/tests/gtest/TestTreeTraversal.cpp:1830
(Diff revision 1)
> +  RefPtr<SearchTestNodeReverse> needleNode;
> +  RefPtr<SearchTestNodeReverse> root = CreateBenchmarkTree<SearchTestNodeReverse>(depth, childrenCount,
> +      [&needleNode](SearchTestNodeReverse* aNode) {
> +        aNode->SetType(SearchNodeType::Hay);
> +        if (aNode->GetLastChild() == nullptr) {
> +          needleNode = aNode;

Why not |if (!needleNode)| as in the previous test?

::: gfx/tests/gtest/TestTreeTraversal.cpp:1955
(Diff revision 1)
> +  int childrenCount = PERFORMANCE_TREE_CHILD_COUNT;
> +  RefPtr<SearchTestNodeReverse> needleNode;
> +  RefPtr<SearchTestNodeReverse> root = CreateBenchmarkTree<SearchTestNodeReverse>(depth, childrenCount,
> +      [&needleNode](SearchTestNodeReverse* aNode) {
> +        aNode->SetType(SearchNodeType::Hay);
> +        if (aNode->GetLastChild() == nullptr) {

Similarly, should this be |if (!needleNode)|?
Attachment #8761412 - Flags: review?(botond)
(In reply to Kevin Wern from comment #6)
> Created attachment 8761440 [details]
> Overview of test cases, tab-separated

Interesting results! If I'm interpreting them correctly, they are showing that:

  - For forward searches, the traversal algorithms actually outperform
    hand-coded loops in all tests except for the one with the separate
    stack.

  - For reverse searches, the results are more mixed, with the algorithms
    outperforming hand-coded loops in some tests but not others.

  - For searches with a separate stack, the hand-coded loop noticeably
    outperforms the algorithm.

The difference between forward and reverse searches is a bit surprising to me; it may, however, just be due to noise in the results.

The performance hit with a separate stack is expected, as we're maintaining a separate, heap-allocated data structure. We should probably be careful that we don't have such traversals in critical code paths (I don't believe we currently do).
(Assignee)

Comment 9

3 years ago
Ok!  A few of the comments were just mistakes I made repurposing previous code or search/replacing--I'll do another pass for that once I'm done with this round of changes. Just one question:

(In reply to Botond Ballo [:botond] from comment #7)
> It's often convenient to use a lambda as an action, but of course it doesn't
> have to be a lambda - it can be a regular function or function object.
> 
> In this case, as we re-use this particular action in many different tests,
> it makes sense to factor it out into a function object, which is then
> re-used by the various tests.
> 
> This applies to several lambdas that we use over and over again in these
> tests.

Can normal functions capture external variables for state?  Otherwise, I'd try doing something creative, but I think a huge advantage here is that we don't have to reuse code for things like creating argument structs or supporting different argument lists in CreateBenchmarkTreeRecursive.
(In reply to Kevin Wern from comment #9)
> (In reply to Botond Ballo [:botond] from comment #7)
> > It's often convenient to use a lambda as an action, but of course it doesn't
> > have to be a lambda - it can be a regular function or function object.
> > 
> > In this case, as we re-use this particular action in many different tests,
> > it makes sense to factor it out into a function object, which is then
> > re-used by the various tests.
> > 
> > This applies to several lambdas that we use over and over again in these
> > tests.
> 
> Can normal functions capture external variables for state?  Otherwise, I'd
> try doing something creative, but I think a huge advantage here is that we
> don't have to reuse code for things like creating argument structs or
> supporting different argument lists in CreateBenchmarkTreeRecursive.

Plain functions can't, but function objects can. Example:

With lambdas:
  
  void CallingFunction() {
    Type var;
    ...
    Foo(..., [&var](int arg){ /* use var and arg */ });
  }

With function objects:

  struct Action {
    Type& var;
    void operator()(int arg) {
      /* use var and arg */
    }
  };

  void CallingFunction() {
    Type var;
    ...
    Foo(..., Action{var});
  }

Note the use of C++11 uniform initialization to avoid having to write a constructor for Action.
(Assignee)

Comment 11

3 years ago
Comment on attachment 8761412 [details]
Bug 1258460: create microbenchmarks for TreeTraversal.h algorithms

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/58428/diff/1-2/
Attachment #8761412 - Flags: review?(botond)
Kevin, apologies for the long delay here. I've been travelling for the past couple of weeks. I hope to review your updated patch early next week.
Comment on attachment 8761412 [details]
Bug 1258460: create microbenchmarks for TreeTraversal.h algorithms

https://reviewboard.mozilla.org/r/58428/#review58186

Thanks! Just one small comment remaining:

::: gfx/tests/gtest/TestTreeTraversal.cpp:1721
(Diff revision 2)
> +
> +  for (RefPtr<Node> node = aNode->GetFirstChild();
> +      node != nullptr;
> +      node = node->GetNextSibling()) {
> +    ForEachNodeForwardStackRecursive(node, aRectangleWidth, newRegion, aChildrenCount);
> +    newRegion = newRegion.SubOut(node->GetRegion());

Just
  
  newRegion.SubOut(node->GetRegion());
  
is sufficient.
Attachment #8761412 - Flags: review?(botond) → review+
https://reviewboard.mozilla.org/r/58428/#review58188

::: gfx/tests/gtest/TestTreeTraversal.cpp:2171
(Diff revision 2)
> +
> +  for (RefPtr<Node> node = aNode->GetLastChild();
> +      node != nullptr;
> +      node = node->GetPrevSibling()) {
> +    ForEachNodeReverseStackRecursive(node, aRectangleWidth, newRegion, aChildrenCount);
> +    newRegion = newRegion.SubOut(node->GetRegion());

Here too.
I made the two adjustments I suggested, and pushed the patch to Try:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=27a894d5224b
(In reply to Botond Ballo [:botond] from comment #15)
> I made the two adjustments I suggested, and pushed the patch to Try:
> 
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=27a894d5224b

The Try push has some static analysis failures of this form:

  TestTreeTraversal.cpp:49:1: error: Refcounted record 'TestNodeReverse<SearchNodeType>' has multiple mRefCnt members

The problem is that the derived classes (TestNodeReverse and TestNodeForward) and the base class (TestNodeBase) both have NS_INLINE_DECL_REFCOUNTING. It's sufficient for the base class to have it.
(Assignee)

Comment 17

3 years ago
Comment on attachment 8761412 [details]
Bug 1258460: create microbenchmarks for TreeTraversal.h algorithms

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/58428/diff/2-3/
(Assignee)

Comment 18

3 years ago
Sorry for the delay, just made those changes now.  Try run:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=578935e372f2

Comment 20

3 years ago
Pushed by bballo@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9740c3f2b075
create microbenchmarks for TreeTraversal.h algorithms r=botond
^ Autolanded.

Comment 22

3 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/9740c3f2b075
Status: NEW → RESOLVED
Last Resolved: 3 years ago
status-firefox50: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
You need to log in before you can comment on or make changes to this bug.