DHAT: Experimental heap-profiling tool for Valgrind

RESOLVED WORKSFORME

Status

()

Core
General
RESOLVED WORKSFORME
8 years ago
2 years ago

People

(Reporter: jseward, Assigned: jseward)

Tracking

Other Branch
x86
Linux
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Assignee)

Description

8 years ago
This work complements space usage profiling work in bug 551477.

DHAT: a Dynamic Heap Analysis Tool

There is a lot of potentially useful information about heap usage
which cannot be obtained merely by looking at the contents of the heap
and at the allocation points.  This bug provides a Valgrind tool which
tracks not only the allocated blocks, but inspects every memory access
to find which block, if any, it is to.  The following data is
collected and presented per allocation point (allocation stack):

* total allocation and max liveness

* average block lifetime (# instructions between allocation and
  freeing)

* average number of reads and writes to each byte in the block
  ("access ratios")

* average of longest interval of non-access to a block, also 
  measured in instructions (not yet implemented)

Using these stats it is possible to identify allocation points with
the following characteristics:

* potential process-lifetime leaks (blocks allocated by the point just
  accumulate, and are freed only at the end of the run)

* excessive turnover: points which chew through a lot of heap, even if
  it is not held onto for very long

* excessively transient: points which allocate very short lived blocks

* useless or underused allocations: blocks which are allocated but not
  completely filled in, or are filled in but not subsequently read.

* blocks which see extended periods of inactivity.  Could these
  perhaps be allocated later or freed sooner?
(Assignee)

Comment 1

8 years ago
Created attachment 455944 [details] [diff] [review]
Patch against Valgrind trunk r11207

Should work on any Valgrind-supported target.

Should be safe to drop in against a standard trunk build, although it
will disable all your other tools.  To undo this, fiddle with TOOLS
and EXP_TOOLS in Makefile.am (after patching) in the obvious way.

To run: valgrind --tool=exp-dhat --num-callers=4

You need --num-callers=4 to avoid being swamped by output data.

There are many rough edges.  Currently the goal is to see if it
produces data which aids in understanding space usage of Fx.

The output is printed twice, first by decreasing order of maximum
liveness, then by decreasing order of total allocation.  This is so as
to avoid having to write perl output analysis/display scripts for the
time being.
(Assignee)

Comment 2

8 years ago
Here's some example outputs from a startup and quit of Fx with 7 tabs.



This is characteristic of a near-process lifetime leak.  I'm not
saying it is one, only that it makes the right kind of noises.  326
blocks were allocated and 326 freed, so there is no leak in the
traditional sense.  However, the peak liveness (4556k) is close to the
total allocation (4686k), so the blocks were mostly allocated
continuously with few (at most 9 == 326 - 217) intervening frees.

And the average age at death as 12.5G instrs, out of a process
lifetime of 16.5Gi, so these blocks are staying alive about 3/4 of the
entire run.

======== SUMMARY STATISTICS ========
...
guest_insns:  16,492,812,468
...
------ 5 of 200 ------
max_live:    4,556,875 in 317 blocks
tot_alloc:   4,686,250 in 326 blocks
deaths:      326, at avg age 12,542,256,531
acc-ratios:  18.69 rd, 4.04 wr  (87,625,585 b-read, 18,947,485 b-written)
   at 0x4C275B8: malloc (vg_replace_malloc.c:236)
   by 0x72D7E9F: moz_malloc (mozalloc.cpp:162)
   by 0x6D6B187: JS_ArenaAllocate (jsutil.h:196)
   by 0x6DF4866: js::PropertyTree::newScopeProperty(JSContext*, bool)
                 (jspropertytree.cpp:165)



Here's an allocation point with two interesting features.
First, it turns over more heap than any other, in this run (271.6M out
of a total of 1672M).  Secondly, the average age at death is very
small (3779 instructions).  Hence this stack allocates a large number
of very short lived blocks, (and underuses them -- see next example).

======== SUMMARY STATISTICS ========
...
tot_alloc:    1,672,007,415 in 2,071,938 blocks
...
------ 1 of 200 ------
max_live:    12,287 in 1 blocks
tot_alloc:   271,665,570 in 22,110 blocks
deaths:      22,110, at avg age 3,779
acc-ratios:  0.05 rd, 0.05 wr  (15,068,202 b-read, 15,125,203 b-written)
   at 0x4C275B8: malloc (vg_replace_malloc.c:236)
   by 0x72D7E9F: moz_malloc (mozalloc.cpp:162)
   by 0x6D6B0D4: JS_ArenaAllocate (jsutil.h:196)
   by 0x6DFB6E3: js_ExecuteRegExp (jsregexp.cpp:4858)



Here's an allocation point which is apparently associated with useless
allocations and useless writes.  It allocates 4.7M in total, writes to
only 83% of those bytes, ("0.83 wr") and doesn't read any of them
("0.00 rd").  So -- assuming this isn't a bug in the profiler -- these
blocks are slackers.  (Come to think of it, so are the blocks in the
example just above.)  OTOH, their lifetime is very short (31.9ki) so
perhaps it doesn't matter.

------ 34 of 200 ------
max_live:    2,688 in 7 blocks
tot_alloc:   4,674,816 in 12,174 blocks
deaths:      12,174, at avg age 31,894
acc-ratios:  0.00 rd, 0.83 wr  (40 b-read, 3,895,720 b-written)
   at 0x4C275B8: malloc (vg_replace_malloc.c:236)
   by 0x72D7E9F: moz_malloc (mozalloc.cpp:162)
   by 0x6D8AC99: JSCodeGenerator::init() (jsutil.h:196)
   by 0x6D90C22: js_EmitTree (jsemit.cpp:4383)



.. whereas, by contrast, these blocks are better used.  Each byte of
them is written on average 6.51 times and read 21.92 times before
deallocation:

------ 159 of 200 ------
max_live:    73,472 in 287 blocks
tot_alloc:   85,248 in 333 blocks
deaths:      333, at avg age 11,051,783,369
acc-ratios:  21.92 rd, 6.51 wr  (1,868,805 b-read, 555,055 b-written)
   at 0x4C275B8: malloc (vg_replace_malloc.c:236)
   by 0x72D7F24: moz_xmalloc_tagged (mozalloc.cpp:130)
   by 0x5681585: NewRequestAndEntry(nsIURI*, imgRequest**, imgCacheEntry**)
                 (imgRequest.h:89)
   by 0x5685B0F: imgLoader::LoadImage(nsIURI*, nsIURI*, nsIURI*, nsILoadGroup*,
                 imgIDecoderObserver*, nsISupports*, unsigned int, nsISupports*,
                 imgIRequest*, nsIChannelPolicy*, imgIRequest**)
                 (imgLoader.cpp:1584)
Good stuff!  I filed bug 576726 on a site that causes us to use a ton of memory; dbaron's done analysis there using trace-malloc, would be interesting to see what conclusions you could draw from the more detailed info emitted here as a testcase.
Is it worth using the Haskell "lag, drag, void and use" terminology?
(Assignee)

Comment 5

8 years ago
> Is it worth using the Haskell "lag, drag, void and use" terminology?

I might get to that.  But there's a problem in that (eg) to detect
drag blocks, we need to be able to tell when the last "useful" memory
access to a block is and when the destruction process begins.  I don't
know how to do that for C++ -- how to tell whether accesses to a block
are "useful" or because we've just entered the destructor?  So I don't
want to use the LDVU terminology if I can't offer the same semantics.

Hence the plan to measure "longest period of non-access" as a kind of
statistical kludge that might offer the ability to identify
probably-lag or probably-drag blocks.  I should point out that this is
the least-well worked out part my plan so far.  All suggestions for
good heuristics to identify probably-lag/drag blocks gratefully
received.

for non-ex-Haskell-folks ...

 "lag" is when a block (object) is allocated and then not used for
 a long time.

 "drag" is when a block is deallocated a long time after its last
 use.

In such cases, later allocation or earlier deallocation respectively
would reduce the amount of useless space in the heap.

 "void" is a block which is allocated, never used, and then
 deallocated.  Yes, we have those in Fx (not many, though).

 "use" is the only useful state.  It refers to a block which has
 been allocated and is now in use.

Hence all blocks have a lifecycle in one of two forms:

  allocate .. lag .. use .. drag .. deallocate      or
  allocate ........ void .......... deallocate

It would be nice to detect and minimise lag and drag states and get
rid of void states completely.

Note that the XPCOM cycle collector actually increases the drag
component of the heap, often by several megabytes, compared to eager
deletion of cycles as soon as they become unreferenced.
DHAT has been present in Valgrind for ages now. Nothing more to do here.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.