Closed Bug 1426044 Opened 8 years ago Closed 7 years ago

Optimize RetainedDisplayListBuilder::MergeDisplayLists, use of hashtables seems overkill

Categories

(Core :: Web Painting, enhancement, P2)

enhancement

Tracking

()

RESOLVED DUPLICATE of bug 1439809
Tracking Status
firefox59 --- affected

People

(Reporter: mozbugz, Assigned: mozbugz)

References

(Blocks 1 open bug)

Details

Attachments

(7 files)

Currently, RDL merging uses hashtables to find items from one display list in another one. The creation of, and search in, these hashtables is visible in profiling runs. Some quick printf-debugging showed me that in most cases, display lists only contain a few items, and most lists match 1:1 (with some skipping sometimes; didn't notice any reordering), a full hashtable search seems wasteful. Instead, just going through the list, starting at index 0 or (previous search + 1), should take care of most real-world cases much more efficiently. The worst case with this would be for items that are not in the searched list, for which we would need to go through the whole list. This could probably be optimized by looking at the display items linked with the associated frame (nsIFrame::DisplayItems -- but we need to be careful there, as we don't know which lists these items belong to). The attached dl-test.html may be used to measure such cases -- we can accept some performance impact from a new algorithm, but if it looks much worse, we may need to keep the old algorithm around, to be used for such big lists. I will try to gather more profiling data first, then design and optimize an algorithm that should handle most cases, ensuring that worst-case scenarios don't blow up. Feedback & help welcome.
Priority: -- → P2
Comment on attachment 8941250 [details] Bug 1426044 - Move list lookup into separate class, lazily construct hashtable - https://reviewboard.mozilla.org/r/211520/#review217328 ::: layout/painting/RetainedDisplayListBuilder.cpp:321 (Diff revision 1) > + template<typename F> > + explicit ListLookup(nsDisplayList* aList, F&& aItemAction) > + : mList(aList) > + , mTable(aList->Count() ? Table(Undecided()) : Table(Empty())) > + { > + for (nsDisplayItem* i = aList->GetBottom(); i != nullptr; If we're running this in the constructor (not doing it lazily), it might be worth just making the caller doing it manually rather than making ListLookup code more complicated. Unless of course there is more code in a latter patch that reuses this functionality. ::: layout/painting/RetainedDisplayListBuilder.cpp:329 (Diff revision 1) > + } > + } > + > + nsDisplayItem* GetSame(nsDisplayItem* aItem) > + { > + struct MatcherForGet Bit unfortunate how much boilerplate these matcher classes add, it would be nice if there was a version of Variant::match that just took N lambdas.
Attachment #8941250 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8941251 [details] Bug 1426044 - Merging expects old and new list to mostly match already - https://reviewboard.mozilla.org/r/211522/#review217332
Attachment #8941251 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8941250 [details] Bug 1426044 - Move list lookup into separate class, lazily construct hashtable - https://reviewboard.mozilla.org/r/211520/#review217334 ::: layout/painting/RetainedDisplayListBuilder.cpp:341 (Diff revision 1) > + // Empty list, nothing to get. > + return nullptr; > + } > + nsDisplayItem* match(Undecided&) > + { > + mSelf.CreateHashtable(); Maybe assert that the variant type is now a hashtable after this return? Or just a comment to that effect, so it's super clear that this won't recurse into itself.
Comment on attachment 8941250 [details] Bug 1426044 - Move list lookup into separate class, lazily construct hashtable - https://reviewboard.mozilla.org/r/211520/#review217328 > If we're running this in the constructor (not doing it lazily), it might be worth just making the caller doing it manually rather than making ListLookup code more complicated. > > Unless of course there is more code in a latter patch that reuses this functionality. Hopefully the 3rd patch will make it clearer why I do this (using the list visitation to do something else that's relevant to the lookup). > Bit unfortunate how much boilerplate these matcher classes add, it would be nice if there was a version of Variant::match that just took N lambdas. Agreed! :-) Except in this case, it would mean *each* lambda would have to carry its own capture, instead of having one shared capture in the struct.
Comment on attachment 8941252 [details] Bug 1426044 - Use flat array when merging small lists - https://reviewboard.mozilla.org/r/211524/#review217338 ::: layout/painting/RetainedDisplayListBuilder.cpp:374 (Diff revision 1) > + // finding the old item at the same relative location, which should now > + // be at the bottom (previous ones having been consumed). > + nsDisplayItem* bottomItem = mSelf.mList->GetBottom(); > + if (!bottomItem || IsSameItem(mItem, bottomItem)) { > + return bottomItem; > + } Looks like we have these same 4 lines in all match impls (except for the empty). Can we just run this before calling match() and have it once? ::: layout/painting/RetainedDisplayListBuilder.cpp:460 (Diff revision 1) > // Yet-undecided state, may create a table if later needed. > struct Undecided > { > }; > > - // Hash table (frame, key) -> item. > + struct ItemInfo Maybe consider moving this struct, the constants, and the using Array= bits outside of this class, so that the code for ListLookup stays smaller and more concise. ::: layout/painting/RetainedDisplayListBuilder.cpp:481 (Diff revision 1) > + return mFrame == aItem->Frame() && > + mPerFrameKey == aItem->GetPerFrameKey(); > + } > + }; > + // Inline array storage, created eagerly on the first list reading. > + static constexpr size_t scInlineArraySize = I think it might be simpler to pick a single number for all platforms where we're confident it's a win (10?), and not get too worried about trying to maximize it completely. Either way, please add a comment detailing how/why the numbers were chosen, so if we ever get more data on the behaviour of this, we can evaluate it against the data that lead to the current choices.
Attachment #8941252 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8941253 [details] Bug 1426044 - Restart linear search from after previously-found item - https://reviewboard.mozilla.org/r/211526/#review217344 ::: layout/painting/RetainedDisplayListBuilder.cpp:400 (Diff revision 1) > - } > + } > + if (iter == start) { > - return nullptr; > + return nullptr; > - } > + } > + } > + } Breakout out of the array container and doing pointer math iteration always makes me feel a bit uneasy. Can't we do something like: MOZ_ASSERT(mNextStartIndex < len); uint32_t start = mNextStartIndex; do { const ItemInfo* elem = aArray.mArray[mNextStartIndex]; mNextStartIndex++; if (mNextStartIndex == len) { mNextStartIndex = 0; } if (elem == mItem) { return elem->mItem; } } while (mNextStartIndex != start) ::: layout/painting/RetainedDisplayListBuilder.cpp:438 (Diff revision 1) > } > - void match(Array& aArray) { aArray.RemoveElement(mItem); } > + void match(Array& aArray) > + { > + if (aArray.mArray.RemoveElement(mItem)) { > + if (aArray.mNextStartIndex != 0) { > + --aArray.mNextStartIndex; Shouldn't we only decrement if the removed element was before the index? Or are we just assuming it's cheaper to always decrement, and take the +1 iteration hit on the next search?
How much do these patches improve performance?
Comment on attachment 8941252 [details] Bug 1426044 - Use flat array when merging small lists - https://reviewboard.mozilla.org/r/211524/#review217338 > Looks like we have these same 4 lines in all match impls (except for the empty). > > Can we just run this before calling match() and have it once? Agreed, I'll pull the common code out. And even in the 'Empty' case, having that extra test will be compensated by not running the Variant.match() code at all. > Maybe consider moving this struct, the constants, and the using Array= bits outside of this class, so that the code for ListLookup stays smaller and more concise. I always try to encapsulate structs to avoid polluting the surrounding namespace, and to prevent unexpected outside use. I'll see if I can cleanly separate them. > I think it might be simpler to pick a single number for all platforms where we're confident it's a win (10?), and not get too worried about trying to maximize it completely. > > Either way, please add a comment detailing how/why the numbers were chosen, so if we ever get more data on the behaviour of this, we can evaluate it against the data that lead to the current choices. Based on my benchmarks, picking single numbers would harm some platforms (e.g., inline>10 shows a perf cliff on Linux, but inline<24 wastes perf on Mac.) So I would really prefer to keep these separate numbers at least for now. I'll add a comment in the code referring to this bug, and add my benchmark details to the bug.
Comment on attachment 8941253 [details] Bug 1426044 - Restart linear search from after previously-found item - https://reviewboard.mozilla.org/r/211526/#review217344 > Breakout out of the array container and doing pointer math iteration always makes me feel a bit uneasy. > > Can't we do something like: > > > MOZ_ASSERT(mNextStartIndex < len); > uint32_t start = mNextStartIndex; > do { > const ItemInfo* elem = aArray.mArray[mNextStartIndex]; > > mNextStartIndex++; > if (mNextStartIndex == len) { > mNextStartIndex = 0; > } > > if (elem == mItem) { > return elem->mItem; > } > } while (mNextStartIndex != start) Yeah, clearer and safer, thank you for the suggestion. > Shouldn't we only decrement if the removed element was before the index? > > Or are we just assuming it's cheaper to always decrement, and take the +1 iteration hit on the next search? My first pointer-based attempt was too complicated, so yes I went for easier & cheap-enough. But I found nsTArray::IndexOf, much easier and allows me to conditionally move mNextStartIndex when appropriate.
Comment on attachment 8941253 [details] Bug 1426044 - Restart linear search from after previously-found item - https://reviewboard.mozilla.org/r/211526/#review217690
Attachment #8941253 - Flags: review?(matt.woodrow) → review+
Benchmarking code used to compare AutoTArray<ItemInfo, N> vs nsDataHashtable<DisplayItemHashEntry, nsDisplayItem*> (i.e., as used in this bug's main patches). Run with `./mach gtest DisplayListBenchmark*`. Numbers are microseconds, lower is better.
Results from running the benchmark code on Try, taken from 'opt' runs on different platforms. (win64 numbers missing, but a local run showed me they were similar to win32 numbers.)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #11) > How much do these patches improve performance? I haven't forgotten you! I've been trying to measure performance improvements, but so far I'm afraid I'm not seeing them. How I've tested so far: I time the calls to MergeDisplayLists in RetainedDisplayListBuilder.cpp, gathering simple stats, and run RefTests on Try, hoping to get stable-enough numbers (because tests are working with the same pages, in the same environments). E.g., without my patches: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0d30dc880bd7be1c4331583771eda2bca84af58e With patches: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0ccf7eee62e05e8d4b14387caff76e6b1453a6f7 Unfortunately, results so far are pointing at similar or slightly worse results with the patches. For example, 'R' tests on Mac, without patches: average=15.1381394 min=0.8744 q1=3.544 med=6.3114 q3=13.8602 max=3121.8202 (count=26699) with patches: average=15.013578 min=1.0092 q1=4.0178 med=7.1412 q3=12.3158 max=3130.7498 (count=26801) (Numbers are microseconds, lower is better.) So the average looks a little bit better with patches, however the lower quartiles are worse, when I would have hoped these patches would have helped the most. And if the numbers are correct, 1% improvements may not justify the extra complexity brought by these patches. It's possible I'm not measuring things properly, or not measuring the right things -- suggestions? -- but I'm seeing this trend in lots of tests I've conducted over the past few days and across platforms. Maybe it's good news, as I've "proven" that the current code is already optimal? I'll try to better understand what could be wrong with my patches. In the meantime, I've found a couple of other actual improvements nearby, I'll open new bugs for them soon.
See Also: → 1439809
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: