This work complements space usage profiling work in bug 551477. DHAT: a Dynamic Heap Analysis Tool There is a lot of potentially useful information about heap usage which cannot be obtained merely by looking at the contents of the heap and at the allocation points. This bug provides a Valgrind tool which tracks not only the allocated blocks, but inspects every memory access to find which block, if any, it is to. The following data is collected and presented per allocation point (allocation stack): * total allocation and max liveness * average block lifetime (# instructions between allocation and freeing) * average number of reads and writes to each byte in the block ("access ratios") * average of longest interval of non-access to a block, also measured in instructions (not yet implemented) Using these stats it is possible to identify allocation points with the following characteristics: * potential process-lifetime leaks (blocks allocated by the point just accumulate, and are freed only at the end of the run) * excessive turnover: points which chew through a lot of heap, even if it is not held onto for very long * excessively transient: points which allocate very short lived blocks * useless or underused allocations: blocks which are allocated but not completely filled in, or are filled in but not subsequently read. * blocks which see extended periods of inactivity. Could these perhaps be allocated later or freed sooner?
Created attachment 455944 [details] [diff] [review] Patch against Valgrind trunk r11207 Should work on any Valgrind-supported target. Should be safe to drop in against a standard trunk build, although it will disable all your other tools. To undo this, fiddle with TOOLS and EXP_TOOLS in Makefile.am (after patching) in the obvious way. To run: valgrind --tool=exp-dhat --num-callers=4 You need --num-callers=4 to avoid being swamped by output data. There are many rough edges. Currently the goal is to see if it produces data which aids in understanding space usage of Fx. The output is printed twice, first by decreasing order of maximum liveness, then by decreasing order of total allocation. This is so as to avoid having to write perl output analysis/display scripts for the time being.
Here's some example outputs from a startup and quit of Fx with 7 tabs. This is characteristic of a near-process lifetime leak. I'm not saying it is one, only that it makes the right kind of noises. 326 blocks were allocated and 326 freed, so there is no leak in the traditional sense. However, the peak liveness (4556k) is close to the total allocation (4686k), so the blocks were mostly allocated continuously with few (at most 9 == 326 - 217) intervening frees. And the average age at death as 12.5G instrs, out of a process lifetime of 16.5Gi, so these blocks are staying alive about 3/4 of the entire run. ======== SUMMARY STATISTICS ======== ... guest_insns: 16,492,812,468 ... ------ 5 of 200 ------ max_live: 4,556,875 in 317 blocks tot_alloc: 4,686,250 in 326 blocks deaths: 326, at avg age 12,542,256,531 acc-ratios: 18.69 rd, 4.04 wr (87,625,585 b-read, 18,947,485 b-written) at 0x4C275B8: malloc (vg_replace_malloc.c:236) by 0x72D7E9F: moz_malloc (mozalloc.cpp:162) by 0x6D6B187: JS_ArenaAllocate (jsutil.h:196) by 0x6DF4866: js::PropertyTree::newScopeProperty(JSContext*, bool) (jspropertytree.cpp:165) Here's an allocation point with two interesting features. First, it turns over more heap than any other, in this run (271.6M out of a total of 1672M). Secondly, the average age at death is very small (3779 instructions). Hence this stack allocates a large number of very short lived blocks, (and underuses them -- see next example). ======== SUMMARY STATISTICS ======== ... tot_alloc: 1,672,007,415 in 2,071,938 blocks ... ------ 1 of 200 ------ max_live: 12,287 in 1 blocks tot_alloc: 271,665,570 in 22,110 blocks deaths: 22,110, at avg age 3,779 acc-ratios: 0.05 rd, 0.05 wr (15,068,202 b-read, 15,125,203 b-written) at 0x4C275B8: malloc (vg_replace_malloc.c:236) by 0x72D7E9F: moz_malloc (mozalloc.cpp:162) by 0x6D6B0D4: JS_ArenaAllocate (jsutil.h:196) by 0x6DFB6E3: js_ExecuteRegExp (jsregexp.cpp:4858) Here's an allocation point which is apparently associated with useless allocations and useless writes. It allocates 4.7M in total, writes to only 83% of those bytes, ("0.83 wr") and doesn't read any of them ("0.00 rd"). So -- assuming this isn't a bug in the profiler -- these blocks are slackers. (Come to think of it, so are the blocks in the example just above.) OTOH, their lifetime is very short (31.9ki) so perhaps it doesn't matter. ------ 34 of 200 ------ max_live: 2,688 in 7 blocks tot_alloc: 4,674,816 in 12,174 blocks deaths: 12,174, at avg age 31,894 acc-ratios: 0.00 rd, 0.83 wr (40 b-read, 3,895,720 b-written) at 0x4C275B8: malloc (vg_replace_malloc.c:236) by 0x72D7E9F: moz_malloc (mozalloc.cpp:162) by 0x6D8AC99: JSCodeGenerator::init() (jsutil.h:196) by 0x6D90C22: js_EmitTree (jsemit.cpp:4383) .. whereas, by contrast, these blocks are better used. Each byte of them is written on average 6.51 times and read 21.92 times before deallocation: ------ 159 of 200 ------ max_live: 73,472 in 287 blocks tot_alloc: 85,248 in 333 blocks deaths: 333, at avg age 11,051,783,369 acc-ratios: 21.92 rd, 6.51 wr (1,868,805 b-read, 555,055 b-written) at 0x4C275B8: malloc (vg_replace_malloc.c:236) by 0x72D7F24: moz_xmalloc_tagged (mozalloc.cpp:130) by 0x5681585: NewRequestAndEntry(nsIURI*, imgRequest**, imgCacheEntry**) (imgRequest.h:89) by 0x5685B0F: imgLoader::LoadImage(nsIURI*, nsIURI*, nsIURI*, nsILoadGroup*, imgIDecoderObserver*, nsISupports*, unsigned int, nsISupports*, imgIRequest*, nsIChannelPolicy*, imgIRequest**) (imgLoader.cpp:1584)
Good stuff! I filed bug 576726 on a site that causes us to use a ton of memory; dbaron's done analysis there using trace-malloc, would be interesting to see what conclusions you could draw from the more detailed info emitted here as a testcase.
Is it worth using the Haskell "lag, drag, void and use" terminology?
> Is it worth using the Haskell "lag, drag, void and use" terminology? I might get to that. But there's a problem in that (eg) to detect drag blocks, we need to be able to tell when the last "useful" memory access to a block is and when the destruction process begins. I don't know how to do that for C++ -- how to tell whether accesses to a block are "useful" or because we've just entered the destructor? So I don't want to use the LDVU terminology if I can't offer the same semantics. Hence the plan to measure "longest period of non-access" as a kind of statistical kludge that might offer the ability to identify probably-lag or probably-drag blocks. I should point out that this is the least-well worked out part my plan so far. All suggestions for good heuristics to identify probably-lag/drag blocks gratefully received. for non-ex-Haskell-folks ... "lag" is when a block (object) is allocated and then not used for a long time. "drag" is when a block is deallocated a long time after its last use. In such cases, later allocation or earlier deallocation respectively would reduce the amount of useless space in the heap. "void" is a block which is allocated, never used, and then deallocated. Yes, we have those in Fx (not many, though). "use" is the only useful state. It refers to a block which has been allocated and is now in use. Hence all blocks have a lifecycle in one of two forms: allocate .. lag .. use .. drag .. deallocate or allocate ........ void .......... deallocate It would be nice to detect and minimise lag and drag states and get rid of void states completely. Note that the XPCOM cycle collector actually increases the drag component of the heap, often by several megabytes, compared to eager deletion of cycles as soon as they become unreferenced.
DHAT has been present in Valgrind for ages now. Nothing more to do here.