Closed Bug 1110530 Opened 10 years ago Closed 10 years ago

GiBs of cumulative heap allocations under nsRegion::Or() when viewing Treeherder

Categories

(Core :: Layout, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla37

People

(Reporter: n.nethercote, Assigned: jrmuizel)

References

Details

Attachments

(2 files)

I did cumulative heap profiling with DMD while viewing several Treeherder tabs, which is a good stress test for layout stuff.

I see GiBs of allocations under nsRegion::Or().

> Cumulative {
>   ~340,938 blocks in heap block record 1 of 18,221
>   ~1,974,381,515 bytes (~1,568,500,507 requested / ~405,881,008 slop)
>   Individual block sizes: 12,288 x 29,986; 8,192 x 81,198; 4,096 x 118,803; ~4,093 x 110,951
>   22.81% of the heap (22.81% cumulative)
>   Allocated at {
>     #01: alloc_data (/home/njn/moz/mi4/co64dmd/gfx/cairo/libpixman/src/../../../../../gfx/cairo/libpixman/src/pixman-region.c:214)
>     #02: pixman_rect_alloc (/home/njn/moz/mi4/co64dmd/gfx/cairo/libpixman/src/../../../../../gfx/cairo/libpixman/src/pixman-region.c:465)
>     #03: pixman_op (/home/njn/moz/mi4/co64dmd/gfx/cairo/libpixman/src/../../../../../gfx/cairo/libpixman/src/pixman-region.c:828)
>     #04: _moz_pixman_region32_union (/home/njn/moz/mi4/co64dmd/gfx/cairo/libpixman/src/../../../../../gfx/cairo/libpixman/src/pixman-region.c:1440)
>     #05: _moz_pixman_region32_union_rect (/home/njn/moz/mi4/co64dmd/gfx/cairo/libpixman/src/../../../../../gfx/cairo/libpixman/src/pixman-region.c:1368)
>     #06: nsRegion::Or(nsRegion const&, nsRect const&) (/home/njn/moz/mi4/o64dmd/layout/mathml/../../dist/include/nsRegion.h:140)
>     #07: nsIntRegion::Or(nsIntRegion const&, nsIntRect const&) (/home/njn/moz/mi4/o64dmd/layout/svg/../../dist/include/nsRegion.h:532)
>     #08: mozilla::PaintedLayerData::Accumulate(mozilla::ContainerState*, nsDisplayItem*, nsIntRegion const&, nsIntRect const&, nsIntRect const&, mozilla::DisplayItemClip const&
> ) (/home/njn/moz/mi4/o64dmd/layout/base/../../../layout/base/FrameLayerBuilder.cpp:2356)
>     #09: mozilla::ContainerState::ProcessDisplayItems(nsDisplayList*) (/home/njn/moz/mi4/o64dmd/layout/base/../../../layout/base/FrameLayerBuilder.cpp:3077)
>     #10: mozilla::FrameLayerBuilder::BuildContainerLayerFor(nsDisplayListBuilder*, mozilla::layers::LayerManager*, nsIFrame*, nsDisplayItem*, nsDisplayList*, mozilla::Container
> LayerParameters const&, mozilla::gfx::Matrix4x4 const*, unsigned int) (/home/njn/moz/mi4/o64dmd/layout/base/../../../layout/base/FrameLayerBuilder.cpp:4037)
>     #11: nsDisplayList::PaintRoot(nsDisplayListBuilder*, nsRenderingContext*, unsigned int) (/home/njn/moz/mi4/o64dmd/layout/base/../../../layout/base/nsDisplayList.cpp:1467)
>     #12: nsLayoutUtils::PaintFrame(nsRenderingContext*, nsIFrame*, nsRegion const&, unsigned int, unsigned int) (/home/njn/moz/mi4/o64dmd/layout/base/../../../layout/base/nsLay
> outUtils.cpp:3129) 
>     #13: PresShell::Paint(nsView*, nsRegion const&, unsigned int) (/home/njn/moz/mi4/o64dmd/layout/base/../../../layout/base/nsPresShell.cpp:6343)
>     #14: nsViewManager::ProcessPendingUpdatesPaint(nsIWidget*) (/home/njn/moz/mi4/o64dmd/view/../../view/nsViewManager.cpp:443)
>     #15: nsViewManager::ProcessPendingUpdatesForView(nsView*, bool) (/home/njn/moz/mi4/o64dmd/view/../../view/nsViewManager.cpp:380)
>     #16: nsRefreshDriver::Tick(long, mozilla::TimeStamp) (/home/njn/moz/mi4/o64dmd/layout/base/../../../layout/base/nsRefreshDriver.cpp:1358)
>     #17: mozilla::RefreshDriverTimer::Tick() (/home/njn/moz/mi4/o64dmd/layout/base/../../../layout/base/nsRefreshDriver.cpp:160)
>     #18: nsTimerImpl::Fire() (/home/njn/moz/mi4/co64dmd/xpcom/threads/../../../xpcom/threads/nsTimerImpl.cpp:637)
>     #19: nsTimerEvent::Run() (/home/njn/moz/mi4/co64dmd/xpcom/threads/../../../xpcom/threads/nsTimerImpl.cpp:719)
>     #20: nsThread::ProcessNextEvent(bool, bool*) (/home/njn/moz/mi4/co64dmd/xpcom/threads/../../../xpcom/threads/nsThread.cpp:830)
>     #21: NS_ProcessNextEvent(nsIThread*, bool) (/home/njn/moz/mi4/xpcom/glue/nsThreadUtils.cpp:265)
>     #22: mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) (/home/njn/moz/mi4/o64dmd/ipc/glue/../../../ipc/glue/MessagePump.cpp:99)
>     #23: MessageLoop::Run() (/home/njn/moz/mi4/co64dmd/ipc/chromium/../../../ipc/chromium/src/base/message_loop.cc:201)
>   }
> }

Note that count of ~340,938 blocks is probably a significant underestimate due
to sampling, because lots of these allocations are small.

I did some ad hoc instrumentation of the cases where pixman_op() calls
pixman_rect_alloc(). From a different run, here are the most common frequencies:

> 5415811 counts:
> (  1)  2765520 (51.1%, 51.1%): alloc: 1, 1 --> 2 (48 bytes)
> (  2)   861066 (15.9%, 67.0%): alloc: 2, 1 --> 4 (80 bytes)
> (  3)   334603 ( 6.2%, 73.1%): alloc: 4, 1 --> 8 (144 bytes)
> (  4)   270705 ( 5.0%, 78.1%): alloc: 3, 1 --> 6 (112 bytes)
> (  5)    41636 ( 0.8%, 78.9%): alloc: 5, 1 --> 10 (176 bytes)
> (  6)    35948 ( 0.7%, 79.6%): alloc: 1, 2 --> 4 (80 bytes)
> (  7)    30061 ( 0.6%, 80.1%): alloc: 11, 1 --> 22 (368 bytes)
> (  8)    29318 ( 0.5%, 80.7%): alloc: 7, 1 --> 14 (240 bytes)
> (  9)    28716 ( 0.5%, 81.2%): alloc: 19, 1 --> 38 (624 bytes)
> ( 10)    26825 ( 0.5%, 81.7%): alloc: 9, 1 --> 18 (304 bytes)
> ( 11)    26640 ( 0.5%, 82.2%): alloc: 6, 1 --> 12 (208 bytes)
> ( 12)    23776 ( 0.4%, 82.6%): alloc: 15, 1 --> 30 (496 bytes)
> ( 13)    22651 ( 0.4%, 83.0%): alloc: 13, 1 --> 26 (432 bytes)
> ( 14)    19350 ( 0.4%, 83.4%): alloc: 17, 1 --> 34 (560 bytes)
> ( 15)    13720 ( 0.3%, 83.7%): alloc: 8, 1 --> 16 (272 bytes)
> ( 16)    13417 ( 0.2%, 83.9%): alloc: 25, 1 --> 50 (816 bytes)
> ( 17)    12150 ( 0.2%, 84.1%): alloc: 1, 3 --> 6 (112 bytes)
> ( 18)    11147 ( 0.2%, 84.3%): alloc: 29, 1 --> 58 (944 bytes)
> ( 19)    10806 ( 0.2%, 84.5%): alloc: 23, 1 --> 46 (752 bytes)
> ( 20)    10727 ( 0.2%, 84.7%): alloc: 21, 1 --> 42 (688 bytes)
> ( 21)    10545 ( 0.2%, 84.9%): alloc: 18, 1 --> 36 (592 bytes)
> ( 22)    10374 ( 0.2%, 85.1%): alloc: 10, 1 --> 20 (336 bytes)
> ( 23)    10153 ( 0.2%, 85.3%): alloc: 213, 1 --> 426 (6832 bytes)
> ( 24)     9583 ( 0.2%, 85.5%): alloc: 1, 4 --> 8 (144 bytes)
> ( 25)     9563 ( 0.2%, 85.7%): alloc: 27, 1 --> 54 (880 bytes)
> ( 26)     9114 ( 0.2%, 85.8%): alloc: 217, 1 --> 434 (6960 bytes)
> ( 27)     8814 ( 0.2%, 86.0%): alloc: 16, 1 --> 32 (528 bytes)
> ( 28)     8327 ( 0.2%, 86.1%): alloc: 31, 1 --> 62 (1008 bytes)
> ( 29)     8203 ( 0.2%, 86.3%): alloc: 33, 1 --> 66 (1072 bytes)
> ( 30)     8005 ( 0.1%, 86.4%): alloc: 215, 1 --> 430 (6896 bytes)

The first line says that 5.4 million allocations were done.

The second line says that for 51.1% of those calls, both the regions had 1
rectangle, the result had 2, and the size of the allocation for the result was
48 bytes.

The third line says that for 15.9% of those calls, the first region had 2
rectangles, the second had one, and the result had 4, which cost 80 bytes.
And so on.

The vast majority of cases involved one at least one trivial region (i.e. a
region with a single rectangle). And more than half involved two trivial
regions.

Here's the same data, but with each line weighted by the allocation size:

> 3455668241 counts:
> (  1) 132744960 ( 3.8%,  3.8%): alloc: 1, 1 --> 2 (48 bytes)
> (  2) 69365296 ( 2.0%,  5.8%): alloc: 213, 1 --> 426 (6832 bytes)
> (  3) 68885280 ( 2.0%,  7.8%): alloc: 2, 1 --> 4 (80 bytes)
> (  4) 63433440 ( 1.8%,  9.7%): alloc: 217, 1 --> 434 (6960 bytes)
> (  5) 55202480 ( 1.6%, 11.3%): alloc: 215, 1 --> 430 (6896 bytes)
> (  6) 53954688 ( 1.6%, 12.8%): alloc: 223, 1 --> 446 (7152 bytes)
> (  7) 53382400 ( 1.5%, 14.4%): alloc: 219, 1 --> 438 (7024 bytes)
> (  8) 48182832 ( 1.4%, 15.8%): alloc: 4, 1 --> 8 (144 bytes)
> (  9) 44983120 ( 1.3%, 17.1%): alloc: 227, 1 --> 454 (7280 bytes)
> ( 10) 42826960 ( 1.2%, 18.3%): alloc: 225, 1 --> 450 (7216 bytes)
> ( 11) 40841040 ( 1.2%, 19.5%): alloc: 224, 1 --> 448 (7184 bytes)
> ( 12) 38347488 ( 1.1%, 20.6%): alloc: 211, 1 --> 422 (6768 bytes)
> ( 13) 36496800 ( 1.1%, 21.7%): alloc: 205, 1 --> 410 (6576 bytes)
> ( 14) 35151440 ( 1.0%, 22.7%): alloc: 222, 1 --> 444 (7120 bytes)
> ( 15) 33723648 ( 1.0%, 23.7%): alloc: 229, 1 --> 458 (7344 bytes)
> ( 16) 33543744 ( 1.0%, 24.6%): alloc: 226, 1 --> 452 (7248 bytes)
> ( 17) 33367056 ( 1.0%, 25.6%): alloc: 198, 1 --> 396 (6352 bytes)
> ( 18) 32548096 ( 0.9%, 26.5%): alloc: 221, 1 --> 442 (7088 bytes)
> ( 19) 31314096 ( 0.9%, 27.4%): alloc: 238, 1 --> 476 (7632 bytes)
> ( 20) 30666528 ( 0.9%, 28.3%): alloc: 228, 1 --> 456 (7312 bytes)
> ( 21) 30318960 ( 0.9%, 29.2%): alloc: 3, 1 --> 6 (112 bytes)
> ( 22) 29604864 ( 0.9%, 30.1%): alloc: 209, 1 --> 418 (6704 bytes)
> ( 23) 29549520 ( 0.9%, 30.9%): alloc: 214, 1 --> 428 (6864 bytes)
> ( 24) 28934160 ( 0.8%, 31.8%): alloc: 232, 1 --> 464 (7440 bytes)
> ( 25) 28670512 ( 0.8%, 32.6%): alloc: 230, 1 --> 460 (7376 bytes)
> ( 26) 28249504 ( 0.8%, 33.4%): alloc: 239, 1 --> 478 (7664 bytes)
> ( 27) 28005264 ( 0.8%, 34.2%): alloc: 220, 1 --> 440 (7056 bytes)
> ( 28) 25973072 ( 0.8%, 35.0%): alloc: 216, 1 --> 432 (6928 bytes)
> ( 29) 25902880 ( 0.7%, 35.7%): alloc: 242, 1 --> 484 (7760 bytes)
> ( 30) 25822016 ( 0.7%, 36.5%): alloc: 236, 1 --> 472 (7568 bytes)

3.45 GiB of allocations!

The way the number of rectangles in the result is computed is this:

> max(numRects1, numRects2) * 2

I don't understand the region combining operations, but this seems pessimistic.
I wonder if we can allocate smaller sizes, and sometimes avoid allocating at
all (e.g. if we know the result will be trivial).
Here's some more measurements. This time I added the number of rectangles we
actually ended up with in the result, in square brackets.

> 785992 counts:
> (  1)   176759 (22.5%, 22.5%): final: 1, 1 --> 2 [1] (48 bytes)
> (  2)    65387 ( 8.3%, 30.8%): final: 1, 1 --> 2 [2] (48 bytes)
> (  3)    55482 ( 7.1%, 37.9%): final: 4, 1 --> 8 [4] (144 bytes)
> (  4)    54464 ( 6.9%, 44.8%): final: 2, 1 --> 4 [0] (80 bytes)
> (  5)    35276 ( 4.5%, 49.3%): final: 3, 1 --> 6 [3] (112 bytes)
> (  6)    32790 ( 4.2%, 53.5%): final: 2, 1 --> 4 [2] (80 bytes)
> (  7)    24674 ( 3.1%, 56.6%): final: 2, 1 --> 4 [1] (80 bytes)
> (  8)    15314 ( 1.9%, 58.5%): final: 1, 2 --> 4 [1] (80 bytes)
> (  9)    13779 ( 1.8%, 60.3%): final: 6, 1 --> 12 [6] (208 bytes)
> ( 10)    12176 ( 1.5%, 61.8%): final: 3, 1 --> 6 [1] (112 bytes)
> ( 11)    10224 ( 1.3%, 63.1%): final: 1, 1 --> 2 [0] (48 bytes)
> ( 12)    10166 ( 1.3%, 64.4%): final: 4, 1 --> 8 [1] (144 bytes)
> ( 13)     8448 ( 1.1%, 65.5%): final: 1, 3 --> 6 [1] (112 bytes)
> ( 14)     7992 ( 1.0%, 66.5%): final: 7, 1 --> 14 [7] (240 bytes)
> ( 15)     7198 ( 0.9%, 67.4%): final: 5, 1 --> 10 [5] (176 bytes)
> ( 16)     6780 ( 0.9%, 68.3%): final: 5, 1 --> 10 [1] (176 bytes)
> ( 17)     6389 ( 0.8%, 69.1%): final: 9, 1 --> 18 [1] (304 bytes)
> ( 18)     5849 ( 0.7%, 69.9%): final: 3, 3 --> 6 [0] (112 bytes)
> ( 19)     5779 ( 0.7%, 70.6%): final: 9, 1 --> 18 [9] (304 bytes)
> ( 20)     4817 ( 0.6%, 71.2%): final: 11, 1 --> 22 [11] (368 bytes)
> ( 21)     4394 ( 0.6%, 71.8%): final: 1, 9 --> 18 [1] (304 bytes)
> ( 22)     4144 ( 0.5%, 72.3%): final: 1, 5 --> 10 [1] (176 bytes)
> ( 23)     3716 ( 0.5%, 72.8%): final: 8, 1 --> 16 [8] (272 bytes)
> ( 24)     3618 ( 0.5%, 73.2%): final: 1, 4 --> 8 [1] (144 bytes)
> ( 25)     3612 ( 0.5%, 73.7%): final: 1, 1 --> 2 [3] (48 bytes)
> ( 26)     3556 ( 0.5%, 74.1%): final: 17, 1 --> 34 [17] (560 bytes)
> ( 27)     3542 ( 0.5%, 74.6%): final: 15, 1 --> 30 [15] (496 bytes)
> ( 28)     3094 ( 0.4%, 75.0%): final: 20, 1 --> 40 [20] (656 bytes)
> ( 29)     2981 ( 0.4%, 75.4%): final: 13, 1 --> 26 [13] (432 bytes)
> ( 30)     2875 ( 0.4%, 75.7%): final: 19, 1 --> 38 [19] (624 bytes)

So the majority of the time when we combine two trivial regions and allocate,
we end up with a non-trivial region. But sometimes we end up with a trivial
region. So that seems tricky.

But when we combine a trivial with a non-trivial, the resulting size looks like
it's always equal to the size of the non-trivial input. So perhaps we can do
better there? It seems like the initial result size can be an underestimate --
e.g. look at line (25) -- so in the non-1,1 case perhaps we should use
max(numRects1,numRects2), i.e. omit the doubling?
Or if we could avoid doing so many nsRegion::Or() calls in the first place, that'd be even better :)
> So the majority of the time when we combine two trivial regions and allocate,
> we end up with a non-trivial region. But sometimes we end up with a trivial
> region.

Sorry, I switched those two cases. |1,1 --> [1]| is more common than |1,1 --> [2]|. |1,1 --> [0]| also happens, as does |1,1 --> [3]|!
I tried removing the doubling from the result size pre-computation. Things
appeared to work ok. Here are the numbers. The lines with "!!!" are the cases
where the pre-computation underestimates the final result (which is again in
square brackets).

> 2696156 counts:
> (  1)   939179 (34.8%, 34.8%): final: 1, 1 --> 1 [1] (32 bytes)
> (  2)   322206 (12.0%, 46.8%): final: 1, 1 --> 1 [2] (32 bytes) !!!
> (  3)   300965 (11.2%, 57.9%): final: 2, 1 --> 2 [0] (48 bytes)
> (  4)   161216 ( 6.0%, 63.9%): final: 4, 1 --> 4 [4] (80 bytes)
> (  5)    88461 ( 3.3%, 67.2%): final: 3, 1 --> 3 [3] (64 bytes)
> (  6)    73121 ( 2.7%, 69.9%): final: 2, 1 --> 2 [1] (48 bytes)
> (  7)    53959 ( 2.0%, 71.9%): final: 2, 1 --> 2 [2] (48 bytes)
> (  8)    43591 ( 1.6%, 73.5%): final: 1, 2 --> 2 [1] (48 bytes)
> (  9)    19423 ( 0.7%, 74.3%): final: 1, 1 --> 1 [0] (32 bytes)
> ( 10)    19245 ( 0.7%, 75.0%): final: 3, 1 --> 3 [1] (64 bytes)
> ( 11)    19187 ( 0.7%, 75.7%): final: 5, 1 --> 5 [5] (96 bytes)
> ( 12)    16253 ( 0.6%, 76.3%): final: 1, 3 --> 3 [1] (64 bytes)
> ( 13)    15936 ( 0.6%, 76.9%): final: 6, 1 --> 6 [6] (112 bytes)
> ( 14)    15667 ( 0.6%, 77.5%): final: 10, 1 --> 10 [10] (176 bytes)
> ( 15)    14749 ( 0.5%, 78.0%): final: 9, 1 --> 9 [9] (160 bytes)
> ( 16)    13977 ( 0.5%, 78.5%): final: 7, 1 --> 7 [7] (128 bytes)
> ( 17)    12882 ( 0.5%, 79.0%): final: 4, 1 --> 4 [1] (80 bytes)
> ( 18)    11466 ( 0.4%, 79.4%): final: 13, 1 --> 13 [13] (224 bytes)
> ( 19)    11286 ( 0.4%, 79.8%): final: 15, 1 --> 15 [15] (256 bytes)
> ( 20)    10992 ( 0.4%, 80.3%): final: 11, 1 --> 11 [11] (192 bytes)
> ( 21)    10339 ( 0.4%, 80.6%): final: 19, 1 --> 19 [19] (320 bytes)
> ( 22)     9949 ( 0.4%, 81.0%): final: 17, 1 --> 17 [17] (288 bytes)
> ( 23)     8215 ( 0.3%, 81.3%): final: 1, 4 --> 4 [1] (80 bytes)
> ( 24)     7957 ( 0.3%, 81.6%): final: 1, 1 --> 1 [3] (32 bytes) !!!
> ( 25)     7279 ( 0.3%, 81.9%): final: 29, 1 --> 29 [29] (480 bytes)
> ( 26)     7022 ( 0.3%, 82.1%): final: 6, 1 --> 6 [1] (112 bytes)
> ( 27)     6886 ( 0.3%, 82.4%): final: 21, 1 --> 21 [21] (352 bytes)
> ( 28)     6115 ( 0.2%, 82.6%): final: 25, 1 --> 25 [25] (416 bytes)
> ( 29)     6033 ( 0.2%, 82.8%): final: 217, 1 --> 217 [217] (3488 bytes)
> ( 30)     5573 ( 0.2%, 83.0%): final: 31, 1 --> 31 [31] (512 bytes)
> ( 31)     5501 ( 0.2%, 83.3%): final: 23, 1 --> 23 [23] (384 bytes)
> ( 32)     5355 ( 0.2%, 83.5%): final: 27, 1 --> 27 [27] (448 bytes)
> ( 33)     5342 ( 0.2%, 83.6%): final: 7, 1 --> 7 [1] (128 bytes)
> ( 34)     5275 ( 0.2%, 83.8%): final: 4, 1 --> 4 [5] (80 bytes) !!!
> ( 35)     5228 ( 0.2%, 84.0%): final: 33, 1 --> 33 [33] (544 bytes)

Line (2) shows the only common case of underestimation. I'm assuming the
algorithm manages to correctly grow the buffer as needed in that case.
If this first Or was the bulk of the allocations this should remove that completely. I expect there might be more then just this one though.
Attachment #8535322 - Flags: review?(roc)
Comment on attachment 8535322 [details] [diff] [review]
Use a rectangle instead of a region for computing the layer bounds

Review of attachment 8535322 [details] [diff] [review]:
-----------------------------------------------------------------

Nice catch.
Attachment #8535322 - Flags: review?(roc) → review+
Looks like the patch helps a bit, but not a lot.

Before:

> 694023 counts:
> (  1)   188923 (27.2%, 27.2%): final: 1, 1 --> 2 [1] (48 bytes)
> (  2)    65567 ( 9.4%, 36.7%): final: 1, 1 --> 2 [2] (48 bytes)
> (  3)    58486 ( 8.4%, 45.1%): final: 2, 1 --> 4 [0] (80 bytes)
> (  4)    50483 ( 7.3%, 52.4%): final: 4, 1 --> 8 [4] (144 bytes)
> (  5)    37731 ( 5.4%, 57.8%): final: 2, 1 --> 4 [2] (80 bytes)
> (  6)    22565 ( 3.3%, 61.1%): final: 2, 1 --> 4 [1] (80 bytes)
> (  7)    17330 ( 2.5%, 63.6%): final: 3, 1 --> 6 [3] (112 bytes)
> (  8)    14639 ( 2.1%, 65.7%): final: 1, 2 --> 4 [1] (80 bytes)
> (  9)     9890 ( 1.4%, 67.1%): final: 1, 1 --> 2 [0] (48 bytes)
> ( 10)     6676 ( 1.0%, 68.1%): final: 5, 1 --> 10 [1] (176 bytes)
> ( 11)     6192 ( 0.9%, 68.9%): final: 4, 1 --> 8 [1] (144 bytes)
> ( 12)     5716 ( 0.8%, 69.8%): final: 5, 1 --> 10 [5] (176 bytes)
> ( 13)     4467 ( 0.6%, 70.4%): final: 15, 1 --> 30 [15] (496 bytes)
> ( 14)     4441 ( 0.6%, 71.1%): final: 13, 1 --> 26 [13] (432 bytes)
> ( 15)     4223 ( 0.6%, 71.7%): final: 9, 1 --> 18 [9] (304 bytes)

After:
 
> 578222 counts:
> (  1)   181940 (31.5%, 31.5%): final: 1, 1 --> 2 [1] (48 bytes)
> (  2)    63264 (10.9%, 42.4%): final: 1, 1 --> 2 [2] (48 bytes)
> (  3)    59153 (10.2%, 52.6%): final: 2, 1 --> 4 [0] (80 bytes)
> (  4)    39473 ( 6.8%, 59.5%): final: 4, 1 --> 8 [4] (144 bytes)
> (  5)    28889 ( 5.0%, 64.5%): final: 2, 1 --> 4 [2] (80 bytes)
> (  6)    27559 ( 4.8%, 69.2%): final: 2, 1 --> 4 [1] (80 bytes)
> (  7)    18165 ( 3.1%, 72.4%): final: 1, 2 --> 4 [1] (80 bytes)
> (  8)    17460 ( 3.0%, 75.4%): final: 3, 1 --> 6 [3] (112 bytes)
> (  9)    11496 ( 2.0%, 77.4%): final: 1, 1 --> 2 [0] (48 bytes)
> ( 10)     7521 ( 1.3%, 78.7%): final: 4, 1 --> 8 [1] (144 bytes)
> ( 11)     5489 ( 0.9%, 79.6%): final: 1, 3 --> 6 [1] (112 bytes)
> ( 12)     4767 ( 0.8%, 80.4%): final: 1, 4 --> 8 [1] (144 bytes)
> ( 13)     4193 ( 0.7%, 81.2%): final: 6, 1 --> 12 [1] (208 bytes)
> ( 14)     4140 ( 0.7%, 81.9%): final: 5, 1 --> 10 [1] (176 bytes)
> ( 15)     3874 ( 0.7%, 82.6%): final: 10, 1 --> 20 [10] (336 bytes)

And I don't know how noisy these measurements are, so even the small win may be
illusory.
(In reply to Nicholas Nethercote [:njn] from comment #7)
> Looks like the patch helps a bit, but not a lot.
> 
> Before:
> 
> > 694023 counts:
> > (  1)   188923 (27.2%, 27.2%): final: 1, 1 --> 2 [1] (48 bytes)
> > (  2)    65567 ( 9.4%, 36.7%): final: 1, 1 --> 2 [2] (48 bytes)
> > (  3)    58486 ( 8.4%, 45.1%): final: 2, 1 --> 4 [0] (80 bytes)
> > (  4)    50483 ( 7.3%, 52.4%): final: 4, 1 --> 8 [4] (144 bytes)
> > (  5)    37731 ( 5.4%, 57.8%): final: 2, 1 --> 4 [2] (80 bytes)
> > (  6)    22565 ( 3.3%, 61.1%): final: 2, 1 --> 4 [1] (80 bytes)
> > (  7)    17330 ( 2.5%, 63.6%): final: 3, 1 --> 6 [3] (112 bytes)
> > (  8)    14639 ( 2.1%, 65.7%): final: 1, 2 --> 4 [1] (80 bytes)
> > (  9)     9890 ( 1.4%, 67.1%): final: 1, 1 --> 2 [0] (48 bytes)
> > ( 10)     6676 ( 1.0%, 68.1%): final: 5, 1 --> 10 [1] (176 bytes)
> > ( 11)     6192 ( 0.9%, 68.9%): final: 4, 1 --> 8 [1] (144 bytes)
> > ( 12)     5716 ( 0.8%, 69.8%): final: 5, 1 --> 10 [5] (176 bytes)
> > ( 13)     4467 ( 0.6%, 70.4%): final: 15, 1 --> 30 [15] (496 bytes)
> > ( 14)     4441 ( 0.6%, 71.1%): final: 13, 1 --> 26 [13] (432 bytes)
> > ( 15)     4223 ( 0.6%, 71.7%): final: 9, 1 --> 18 [9] (304 bytes)
> 
> After:
>  
> > 578222 counts:
> > (  1)   181940 (31.5%, 31.5%): final: 1, 1 --> 2 [1] (48 bytes)
> > (  2)    63264 (10.9%, 42.4%): final: 1, 1 --> 2 [2] (48 bytes)
> > (  3)    59153 (10.2%, 52.6%): final: 2, 1 --> 4 [0] (80 bytes)
> > (  4)    39473 ( 6.8%, 59.5%): final: 4, 1 --> 8 [4] (144 bytes)
> > (  5)    28889 ( 5.0%, 64.5%): final: 2, 1 --> 4 [2] (80 bytes)
> > (  6)    27559 ( 4.8%, 69.2%): final: 2, 1 --> 4 [1] (80 bytes)
> > (  7)    18165 ( 3.1%, 72.4%): final: 1, 2 --> 4 [1] (80 bytes)
> > (  8)    17460 ( 3.0%, 75.4%): final: 3, 1 --> 6 [3] (112 bytes)
> > (  9)    11496 ( 2.0%, 77.4%): final: 1, 1 --> 2 [0] (48 bytes)
> > ( 10)     7521 ( 1.3%, 78.7%): final: 4, 1 --> 8 [1] (144 bytes)
> > ( 11)     5489 ( 0.9%, 79.6%): final: 1, 3 --> 6 [1] (112 bytes)
> > ( 12)     4767 ( 0.8%, 80.4%): final: 1, 4 --> 8 [1] (144 bytes)
> > ( 13)     4193 ( 0.7%, 81.2%): final: 6, 1 --> 12 [1] (208 bytes)
> > ( 14)     4140 ( 0.7%, 81.9%): final: 5, 1 --> 10 [1] (176 bytes)
> > ( 15)     3874 ( 0.7%, 82.6%): final: 10, 1 --> 20 [10] (336 bytes)
> 
> And I don't know how noisy these measurements are, so even the small win may
> be
> illusory.

Is there a way to narrow down which call to nsIntRegion::Or is responsible for the majority of the allocations?
Flags: needinfo?(n.nethercote)
> Is there a way to narrow down which call to nsIntRegion::Or is responsible
> for the majority of the allocations?

The stack trace in comment 0 indicates where most of the problem lies. And
you've actually fixed much of it... I forgot to weight each line in comment 8
by its size. If I do that, and tweak the instrumentation so it's just
considering |union| calls, I get this:

Before:

> 608942272 counts:
> (  1) 11084816 ( 1.8%,  1.8%): union-alloc: 245, 1 --> 490 (7856 bytes)
> (  2) 10486896 ( 1.7%,  3.5%): union-alloc: 241, 1 --> 482 (7728 bytes)
> (  3) 10347024 ( 1.7%,  5.2%): union-alloc: 1, 1 --> 2 (48 bytes)
> (  4)  9507600 ( 1.6%,  6.8%): union-alloc: 4, 1 --> 8 (144 bytes)
> (  5)  9342144 ( 1.5%,  8.3%): union-alloc: 220, 1 --> 440 (7056 bytes)
> (  6)  9066800 ( 1.5%,  9.8%): union-alloc: 237, 1 --> 474 (7600 bytes)
> (  7)  8933408 ( 1.5%, 11.3%): union-alloc: 225, 1 --> 450 (7216 bytes)
> (  8)  8833104 ( 1.5%, 12.7%): union-alloc: 190, 1 --> 380 (6096 bytes)
> (  9)  8094160 ( 1.3%, 14.1%): union-alloc: 207, 1 --> 414 (6640 bytes)
> ( 10)  7931680 ( 1.3%, 15.4%): union-alloc: 222, 1 --> 444 (7120 bytes)
> ( 11)  7635040 ( 1.3%, 16.6%): union-alloc: 200, 1 --> 400 (6416 bytes)
> ( 12)  7490560 ( 1.2%, 17.9%): union-alloc: 192, 1 --> 384 (6160 bytes)
> ( 13)  7431776 ( 1.2%, 19.1%): union-alloc: 236, 1 --> 472 (7568 bytes)
> ( 14)  7046160 ( 1.2%, 20.2%): union-alloc: 93, 1 --> 186 (2992 bytes)
> ( 15)  6887904 ( 1.1%, 21.4%): union-alloc: 228, 1 --> 456 (7312 bytes)

After:

> 46061424 counts:
> (  1) 12260400 (26.6%, 26.6%): union-alloc: 1, 1 --> 2 (48 bytes)
> (  2)  6860160 (14.9%, 41.5%): union-alloc: 4, 1 --> 8 (144 bytes)
> (  3)  4691456 (10.2%, 51.7%): union-alloc: 3, 1 --> 6 (112 bytes)
> (  4)  3285840 ( 7.1%, 58.8%): union-alloc: 2, 1 --> 4 (80 bytes)
> (  5)  1784880 ( 3.9%, 62.7%): union-alloc: 18, 1 --> 36 (592 bytes)
> (  6)  1715376 ( 3.7%, 66.4%): union-alloc: 19, 1 --> 38 (624 bytes)
> (  7)  1650528 ( 3.6%, 70.0%): union-alloc: 16, 1 --> 32 (528 bytes)
> (  8)  1602288 ( 3.5%, 73.5%): union-alloc: 13, 1 --> 26 (432 bytes)
> (  9)  1389408 ( 3.0%, 76.5%): union-alloc: 20, 1 --> 40 (656 bytes)
> ( 10)  1265792 ( 2.7%, 79.3%): union-alloc: 15, 1 --> 30 (496 bytes)
> ( 11)  1218928 ( 2.6%, 81.9%): union-alloc: 14, 1 --> 28 (464 bytes)
> ( 12)  1215760 ( 2.6%, 84.5%): union-alloc: 17, 1 --> 34 (560 bytes)
> ( 13)  1204800 ( 2.6%, 87.2%): union-alloc: 7, 1 --> 14 (240 bytes)
> ( 14)  1007584 ( 2.2%, 89.3%): union-alloc: 11, 1 --> 22 (368 bytes)
> ( 15)   878640 ( 1.9%, 91.3%): union-alloc: 10, 1 --> 20 (336 bytes)

That's a 13.2x reduction in cumulative allocation size. The number of
allocations only drops by 1.26x, so it looks like this callsite must be
responsible for all the larger allocations. (And looking at the individual
lines it's clear this is the case.)

I also re-ran DMD and I'm seeing a similar reduction there. So you've actually
fixed most of the problem. Thanks! And sorry for my misleading measurements in
comment 8.
Flags: needinfo?(n.nethercote)
Thanks, Jeff! I'll dig a little more next week and see if I can identify other places where the smaller allocations might be coming from.
Assignee: nobody → jmuizelaar
(In reply to Nicholas Nethercote [:njn] from comment #11)
> Thanks, Jeff! I'll dig a little more next week and see if I can identify
> other places where the smaller allocations might be coming from.

I did another run of Treeherder under DMD, with your patch applied, this time with sampling off, which gives details about every heap allocation.

We still have a *lot* of small allocations:

> Cumulative {
>   1,552,122 blocks in heap block record 2 of 8,348
>   271,205,008 bytes (238,975,424 requested / 32,229,584 slop)
>   Individual block sizes: 4,096 x 58; 2,048 x 7,821; 1,024 x 68,846; 496 x 13,144; 464 x 12,681; 432 x 17,113; 400 x 25,061; 368 x 18,743; 336 x 39,172; 304 x 31,250; 272 x 33,955; 240 x 66,841; 208 x 20,867; 176 x 45,576; 144 x 146,758; 112 x 114,653; 80 x 336,197; 48 x 553,386
>   7.05% of the heap (14.41% cumulative)
>   Allocated at {
>     #01: alloc_data (/home/njn/moz/mi4/o64dmd/gfx/cairo/libpixman/src/../../../../../gfx/cairo/libpixman/src/pixman-region.c:214)
>     #02: pixman_rect_alloc (/home/njn/moz/mi4/o64dmd/gfx/cairo/libpixman/src/../../../../../gfx/cairo/libpixman/src/pixman-region.c:465)
>     #03: pixman_op (/home/njn/moz/mi4/o64dmd/gfx/cairo/libpixman/src/../../../../../gfx/cairo/libpixman/src/pixman-region.c:828)
>   }
> }

1,552,122 allocations -- that's more than 10% of all heap allocations from this
run (going by count, not by size).

There's more detailed data in the attached file. AccumulateRectDifference(),
ProcessRemovedDisplayItems(), ProcessDisplayItems(), and
RecomputeVisibilityForItems() all feature prominently.
(In reply to Nicholas Nethercote [:njn] from comment #12)
> Created attachment 8536283 [details]
> 1,552,122 allocations -- that's more than 10% of all heap allocations from
> this
> run (going by count, not by size).
> 
> There's more detailed data in the attached file. AccumulateRectDifference(),
> ProcessRemovedDisplayItems(), ProcessDisplayItems(), and
> RecomputeVisibilityForItems() all feature prominently.

So AccumulateRectDifference calls Xor(A, B) which is currently not natively supported and is emulated by using (A - B) U (B - A). If we actually need this kind of thing to be well implemented it shouldn't be too hard to add support to pixman. I've filed bug 1111432 for that work.
https://hg.mozilla.org/mozilla-central/rev/c306922867aa
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla37
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: