Closed Bug 551477 Opened 14 years ago Closed 10 years ago

Add a built-in space profiler

Categories

(Core :: General, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: jseward, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(4 files, 6 obsolete files)

The following is a placeholder bug following a discussion on irc with
jorendorff, bz, ted, w/ apologies in advance if I paraphrase it wrong.

It would be nice to have some kind of space profiler facility built
into Fx.  It appears there is concern about the difficulty in
determining what Fx is using memory for, especially in more complex
use scenarios (many open tabs, many extensions, many different web
domains displayed).

Initial discussion assumed that the traversal problem was somehow
solvable.  Specifically, we would be able to traverse the entire JS
graph (easy) and also the entire C++ object graph (conservative scan?)

FTR, I am inclined to think of the traversal problem in a pretty
abstract sense: we have a combined C++/JS object graph, plus some set
of roots into the graph.  We will have a traversal mechanism which
allows us to visit all nodes in said graph, plus a query mechanism
that can ask each node something about itself (what kind are you?  who
put you here?), although it might be that from C++ objects we can get
no such data.

The question of what kind of information should or could be collected
proved less clear cut.  I think Jason was after this:

* for each (web) domain, total amount of storage attributed solely to
  that domain

* for each extension, total amount of storage attributed solely to
  that extension

I think I have a more general proposal for this, which I'll put in a
following comment, so as to avoid cluttering this initial statement.

There may be other metrics which are interesting.  Two of the most
basic are "what are you" (what kind of object) and "who put you here"
(what call stack allocated you?)

One other question is: how much can we do without having to add
profiling metadata word(s) to each object?
(In reply to comment #0)

> * for each (web) domain, total amount of storage attributed solely to
>   that domain
>
> * for each extension, total amount of storage attributed solely to
>   that extension

Here's a strawman generalisation, since for sure we don't want to be
hardwiring concepts like "domain X" or "extension Y" into the
profiler.

We have a graph G and a set of roots R, so that each 'r' in R is a
pointer to some node 'nd' in G.

Imagine we also have a completely arbitrary set T of tags.  Each
member of T characterises a root in some (arbitrary) way.  Each root
then has some subset of T associated with it.

For example, T could be
   {"domain is bbc.co.uk", 
    "extension is FlashBlock",
    "tab number is 57"}

So then we can ask the profiler: how much graph is reachable from
roots which have the tag "domain is bbc.co.uk" ?  Or from roots which
have the tag "extension is FlashBlock"?  Or from roots which have both
of those tags?

Is that a useful generalisation?  I suspect so.  Is it implementable?
I don't know.
FWIW, I think it would be fine to do the traversal only on request, as long as we keep enough information around to perform that at any time. I suspect we'd want to be able to gather this information and display it when a user visits about:memory.

Maybe it makes sense to just have the underlying API provide the data you just proposed, and then the front-end UI can figure out a way to display it that makes sense?
(In reply to comment #2)
> Maybe it makes sense to just have the underlying API provide the data you just
> proposed, and then the front-end UI can figure out a way to display it that
> makes sense?

Yep, that'd be the right approach; I suspect a lot of tools would want to use this data in different ways.  Note that exposing a simple query interface would also be perfectly ok, though, since getting the data could well be expensive.  But a query could also just be 'give me all the data'.
It seems ok to have an expensive query, as long as we document that the call is expensive. If a user opens about:memory because their Firefox is already screwed, it's probably not a big deal if it takes a little bit to run.
Thinking a bit about the top level implementation choices and their
space requirements and intrusiveness.  Seems like there's at least
three different things we can ask of an object, at heap-census time:

(1) What are you? (== what kind of object?)

    Presumably we can find this out by examining the object itself.
    Does not require any extra storage.


(2) Who put you here? (== what point in the program allocated you?)

    This requires having extra storage associated with the object,
    and this is needed for the entire lifetime of the object.
    Usually done by having an extra word in each object, containing
    some kind of allocation-point tag.


(3) Why are you still alive? (== who's pointing at you, and/or
    from which root(s) are you reachable?)

    Requires constant extra storage per object (perhaps a root-set
    tag?) but that storage is only required transiently at
    census-time, not for the entire lifetime of the object.


(1) and (3) might be acceptable.  (2) is massively intrusive because
we'd have to change the layout of every object (or do some nasty hack
like having a shadow heap containing the allocation-point data), and
we'd also have to mess with every allocation point, to insert a tag
into the object/shadow-heap.

Note that (3) is a generalisation of the tags idea in Comment #1.
Gregor has been working on (2).
(In reply to comment #6)
Is there a tracking bug, or some other place I can learn about
what he's doing?
(In reply to comment #7)
> (In reply to comment #6)
> Is there a tracking bug, or some other place I can learn about
> what he's doing?

There is bug 547327.
My approach is to monitor objects that are created within a loop and try to optimize the allocation size for further objects that are created at the same program location.
Julian, is there anything I can do to help move this along? Do you need your car washed? Meals cooked? Heaven and earth moved? Is there anybody you want rubbed out? Andreas maybe? Andreas, we need this feature, you're going to have to take one from the team.
I meant "for the team", but either way.
Ted pointed me to this bug from here:
http://groups.google.com/group/mozilla.dev.platform/browse_thread/thread/f2008793142b9abb/49e00f7d41b70ce1?hl=en&lnk=gst&q=memory+profiling+firebug#49e00f7d41b70ce1


One of the most requested features for Firebug is a support for web page memory profiling. Recently we have tried to implement a prototype (based on memory API included in Jetpack prototype). You can see more here:
http://getfirebug.com/wiki/index.php/Firebug_Memory_Profiler

The first question: what is the relation between the existing Jetpack APIs and this bug? Will the Jetpack be used as a base for the solution?

---

So far, there is no way how to get precise size (row bytes) information for
e.g. strings, which would be nice addition to some size info already
provided. Some other more structured size info (related to images,
HTML elements, etc) would be also useful. 

The prototype we have done for Firebug showed us that one of very useful things (except of the actual amount of memory consumed) is the list of referents (comment #5 - point #3). We have managed to get such list (using Jetpack APIs) since it provides a way how to access the entire graph of JS objects in the current firefox runtime.

Note that we needed to filter the graph in order to get JS objects only for the current page (which is how Firebug works, always focusing on the current page) so, having some kind of query interface (as mentioned in comment #3) would help.

Also agree with Vladimir that various tools would want to use different way how to present the data to the user. For example, in case of Firebug we also want to integrate the memory-info with existing panels.

The second question. As I mentioned, using the Jetpack APIs, it's relatively easy to get the entire graph of meta-data, where each node has its unique ID. The problem is that there is no way how to get JS object ID and/or to get an existing JS object according to the provided ID (I know that JSD provides an APIs to get these IDs for script objects: jsdIScript.tag), which complicates  "JS object <-> meta-data" relation.

In Firebug it would be extremely helpful to show the meta data in context of real objects on the page (e.g. in the Watch panel, DOM panel, etc.)

Could such methods (like: getJSObject(id), getJSObjectID(obj) ) also be part of the memory API?

Honza
> The first question: what is the relation between the existing Jetpack APIs and
> this bug? Will the Jetpack be used as a base for the solution?

I had a look at the Firebug profiler links.  It looks like a nice
presentation and query environment.  I think we're approaching the
same problem from opposite ends, though, and I don't think I can
answer your questions at present.  Here's what I'm looking into.

My goal atm is low level enumeration of the graph of objects, with the
short-term aim of maximising coverage.  I use "object" in the loosest
sense here, to mean any lump of memory to which anyone has a pointer
or some kind of reference.  So I am aiming to cover the JS heap, the
C++ heap, and not just stuff under control of the cycle collector, but
ad-hoc malloc'd memory too.

Given the great diversity of objects that covers, I am thinking about
the idea of building a "shadow graph" -- a throwaway structure which
is (mostly) constructed when we want a memory census, analysed, and
discarded.

The shadow graph is composed of shadow nodes, each of which is:

  start-address  length  arbitrary-description  set-of-out-pointers

plus a set of pointers which serve as roots.  At this stage the
arbitrary-description can be minimal or nothing.

The shadow graph can be incomplete or inaccurate.  Basically we forage
around for information to build it with, then hand it off to an
analyser.  The analyser is tolerant of missing pointers, pointers to
the middle of nodes, pointers to static objects, apparent
inconsistencies like nested or overlapping blocks, etc.

The graph would be constructed, for JS objects, by a hacked version of
JS_DumpHeap.  For C++ objects I am thinking to add hooks to
memory/mozalloc/mozalloc.h which observe all mallocs/news and
frees/deletes.  At analysis time any such blocks would be scanned to
find pointers to other blocks.

This has various advantages:

* it gives coverage of C++ and JS objects

* without having to get involved with the Cycle Collector

* all the analysis code can be in one place, rather than scattered
  around

* the analysis code does not need to be aware of the layout of any of
  the objects

* the only code that does need to be aware of object layouts (and be
  scattered around) are "abstractor methods", you might say, which
  take (eg) a JSSquirrellyThing* and give back one or more shadow
  nodes that represent it

* we can incrementally increase the level of detail in the
  arbitrary-description field as required to support more detailed
  queries.

* it extends naturally to anything else we write an "abstractor
  method" for, eg, mmap'd blocks

* initially the description field can be empty.  Even like that we get
  basic stats on the numbers and sizes of allocated objects, plus
  reachability information.

* the shadow graph doesn't need to be constructed all at the same
  time, iow we don't necessarily need to stop the entire world to do
  it.  We can do analyses of incomplete or inconsistent graphs if need
  be.

* as soon as the shadow graph is built, the system can continue.  We
  don't have to wait around for expensive reachability or whatever
  analyses to complete.  That can be done by a separate thread.  Or
  even by a different process, offline.

* when profiling is inactive it's free in space terms and almost free
  in time terms.  "Almost" because it would still require an extra
  conditional branch on the mozalloc malloc/new functions to decide
  not to collect the allocated address/lengths.

Anyway.  If anyone thinks this is the wrong thing to build, or simply
won't work, please yell asap.
(In reply to comment #12)
> Anyway.  If anyone thinks this is the wrong thing to build, or simply
> won't work, please yell asap.

If it weren't you saying it, I would say, "you're vastly underestimating the complexity of what you're proposing."  But since it *is* you, I'll say, "It sounds too good to be true."
I don't understand what you mean by "abstractor methods", and what they would do.
(In reply to comment #14)
> I don't understand what you mean by "abstractor methods",

I mean a function which takes an object and returns its essential
details for inclusion into the shadow graph.  For example, given a
pointer "p" to one of these

   struct Foo  {  int a;  Blah* b;  char c;  D* d;  }

the abstractor function must return at a minimum these 3 items:

   p   sizeof(*p)   {p->b, p->d}

that is, the address range of the object and the set of pointers found
within it.  That is the minimum info needed to build the shadow graph.

The point about intercepting moz_xmalloc et al is that we can extract
the above info automatically, for malloc'd blocks.  This is just as
well since it would obviously be totally infeasible to manually
annotate every C++ constructor in the tree.

When an malloc-style allocation is made we note the returned pointer
and the requested size.  That means for a 2 words/block overhead we
get a list of all the blocks in the C++ heap.  Then at analysis time
we can scan each block to find any pointers to other blocks it might
hold.  This is really just conservative C/C++ garbage collection
stuff.

I just verified that I can stick a hook in moz_xmalloc et al (thanks
cjones!) and see all mallocs and frees going by.  So this part of the
plan at least looks like it might work.

For JS objects and other stuff in the JS heap I will need to write
custom abstractor functions.  But that's not such a big deal because
there's only a limited set of object formats to consider.
Julian, how much of a problem would tagged pointers be for this scheme?  We use them in some places in the DOM, and I've been trying to decide whether we should stop doing so...  Does it matter for this setup one way or another?
(In reply to comment #16)

I don't think that would be a problem -- those bits can just be masked
off.  It weakens a bit the heuristic used to decide whether something
is could be a pointer to a block, because we can no longer reject
misaligned values as can't-possibly-be-a-pointer.  But I don't think
that's a particularly safe heuristic anyway, since it would assume
that nobody in the entire system uses tagged pointers.  Which I don't
believe.

So the short answer is: it makes no difference.
As a devil's advocate type question, what is the advantage of this approach versus forking an inert copy of the program or performing a non-fatal core-dump and then probing that static memory snapshot using a python-enabled gdb?

Assuming debug symbols are available, gdb can introspect objects and tell when a pointer in memory amounts to an nsCOMPtr/nsRefPtr versus a presumed invariant-controlled backlink and otherwise get a lot of structure knowledge for free.

The 'self-introspection' goal could be satisfied by taking a page from roc's chronicle playbook and having the python/gdb driver expose an HTTP loop that consumes and returns JSON.  One might even argue that it's cleaner because the patient is never performing analytical surgery on itself.

In any event, this is very exciting work you are undertaking and I look forward to the results!  You may be interested in trace-malloc's leak-soup logic that does some of the conservative pointer analysis stuff already, apparently:
http://mxr.mozilla.org/mozilla-central/source/tools/trace-malloc/
gdb, what's that? I'm a Windows-based JS developer who wants to know where all my memory is going!

/be
The use case I proposed when I badgered Julian into filing this bug was "I'm a normal user, and my Firefox is using a lot of memory. How can I find out what's using all that memory?" Ideally we will provide an API that things like about:memory can use which could give a rough idea of how much memory is entrained by each extension and web site. We need to give users tools so that instead of saying "Firefox uses too much memory" they can say "AdBlock uses too much memory when I visit cnn.com". (Then, ideally, the parties responsible can fix their code!)
(In reply to comment #20)
> [...] they can say "AdBlock uses too much memory when I visit cnn.com".

Yes, I completely agree that's where we want to get to eventually and
yes it will no doubt require an API to query the collected data.  But
first we need the low-level info-collecting machinery in place, since
for sure I don't see how we can answer any interesting question
without having at a minimum a graph, roots and reachability analysis.
Also .. I'm reluctant to build a profiler which only has partial
coverage (eg, JS heap only), hence the thinking about the C++ heap at
this early stage.
One thing I'd like to ask the assembled wisdom here is, when should
a heap census be performed?  My current plan is:

* immediately after the cycle collector has run, and
* immediately after JSGC has run

From sticking printfs in various places it appears that a run of CC
necessarily causes a run of JSGC (yes?) but not vice versa.  So I'm
inclined to hack up some logic which does a census after every CC run
and after every JSGC run that isn't bracketed by a CC run.

Better suggestions / your-analysis-is-bogus style comments welcomed.
Right, I'm not suggesting you need to implement all that, just the underlying bits! It was more in response to Andrew's "why not just use gdb + python" question.
I raised the gdb solution because it seemed like a way to avoid a slippery slope on the C++ analysis front and because forking seems like the fastest way to get the system running again.  Specifically, implementing the abstractor methods seems like manually reproducing the work that debug information already provides, and using gdb is much less risk-prone than rolling your own DWARF parser.  (Of course, gdb-python or dehydra/treehydra could be used to automatically generate those if the need arose.)

It obviously is not the ideal long-term solution for a web developer focused tool.  It might be reasonable for a mozilla developer tool, especially as it could be viable for crash-dump analysis.  Mainly, I was thinking it could be reasonable as a simpler-to-develop prototype given that I believe Jim Blandy already has the JS side of the house covered for gdb-python ( http://hg.mozilla.org/users/jblandy_mozilla.com/archer-mozilla/ ).  The fundamental assumption that the python-gdb way is simpler might be very wrong on my part...

fwiw, I believe gdb can run on windows, but I am unclear as to whether it understands the MS compiler's debugging symbols format, which would be the most important thing.

In any event, an in-tree solution is even more exciting and useful than an external mechanism with dependencies; I eagerly await the fruits!
I have working, a first cut at machinery that creates a combined
C++/JS shadow graph, as per proposal in comment #12.  Will clean it up
and post a wip patch later this week.  This is still far from being
useful.

For a Fx idling and displaying news.bbc.co.uk, there are about 88k
nodes in the graph, of which 84k are C++ and 4k are JS.
Depends on: 563305
What this patch demonstrates:

* creation of a combined C++/JS shadow graph

* done periodically, at the end of some JS garbage collections

* thread-safe-ly, more or less, and without deadlocking the system

* reasonable performance (resulting browser is quite usable)

* without the profiler's own dynamic memory allocation skewing the
  results


Example output: all C/C++ heap allocations via mozalloc are tracked,
and running stats of live blocks maintained.  Periodically a 1-liner
summary is printed:

HEAPPROF 20747:  CH:  1605639/1444361/170M total adds/dels/alloc,
                      161278/14595k live blocks/bytes

Distribution of ages at death is also computed but not displayed.


After some but not all JS GCs, a full profile is initiated.  A
shadow graph is built, cleaned up, and some simple stats on nodes
and edges computed:

HEAPPROF 20747:  PROFILE #8: BEGIN creating shadow graph
HEAPPROF 20747:  profile #8: analysis begin
HEAPPROF 20747:    raw: 161363 CH blocks, 4200 JS blocks, 207 dups
HEAPPROF 20747:    initial sanity checks ok
HEAPPROF 20747:    refinement done, 434094 mapped, 807262 unmapped
HEAPPROF 20747:      objects/bytes: 4003/46752 JS, 161353/16426453 CH
HEAPPROF 20747:      edges: 279 JS->JS, 225 JS->CH, 5647 CH->JS, 315904 CH->CH
HEAPPROF 20747:      outdegree: 504 from JS, 321551 from CH, 322055 total
HEAPPROF 20747:      indegree:  5926   to JS, 316129   to CH, 322055 total
HEAPPROF 20747       18183 objects of 165356 total unreferenced from within the SG
HEAPPROF 20747:  PROFILE #8: END


What next:

As it stands, this is more or less useless.  It mechanically extracts
low level reachability, but knows nothing about each node.

One direction is to try to answer "what kind of objects are in the
heap?" questions.  This could be provided by the JS world without
difficulty.  For the C++ heap we might be reduced to guessing object
types by looking for pointers to vtables within the objects.

Another direction is to answer "who put these objects in the heap?"
questions.  This would require tagging each object with its allocation
point.  For the C++ heap this could possibly by done by getting
stack traces from Breakpad.
(In reply to comment #26)
> Another direction is to answer "who put these objects in the heap?"
> questions.  This would require tagging each object with its allocation
> point.  For the C++ heap this could possibly by done by getting
> stack traces from Breakpad.

How about causality tracking?  For example, when a DOM click event gets generated it would generate a 'frame' that says "DOM click calling foo.js/func bar()/line 53" and uses that as the tag.  When that event-triggered JS initiates a setTimeout it creates a new frame to use when the setTimeout gets dispatched and links to the currently active (event-origin) frame.  In order to avoid creating the world's longest linked list, a setTimeout from inside a setTimeout callback would reference the original progenitor instead and boost a counter to reflect that it's the result of a chained setTimeout.  It seems like they could just be JS objects and the normal garbage collector would take care of it?

The same mechanism could potentially be exposed to C++ consumers with default logic being provided for the event loop/native nsITimers/other low-hanging fruit using the breakpad (or external address->symbol resolution).  The startup timeline does some of this and I did a bit more for my recently started and regrettably currently triaged/abandoned Thunderbird memory work.  I even went so far as to push state on all XPConnect C++->JS crossings, although I also was interested in the execution and its timing as well as the memory costs.
FYI, Breakpad can only produce stack traces after-the-fact. (We don't ship the debug symbols with the app, which it needs to walk the stack.)
We could expose the symbols via a web service (I don't think the symbol server protocol is fine-grained enough for this) so that the client could request a bunch of symbol/frame factoids and do the walking, if we wanted to.  We have the information, can make all sorts of tradeoffs about how to get it to the user, including just pulling them all down the first time the user wants to walk some stack.
Sure, we can do anything (it's just software, right?) I'm just pointing out that as currently implemented, Breakpad isn't really a good fit for the task.
(In reply to comment #29)

For sure we can't do anything much interesting with the C++ heap
without having symbols or unwind data.  Given those two I think we
could answer what's-in-the-heap and who-put-it-there questions, which
would be a good start.  For x86-{linux,darwin} with
-fno-omit-frame-pointer (the default) we might be able to get away
without unwind data, but pretty much all other platforms require it.

I like the pre-canned factoid idea for a couple of reasons:

* With careful design of the factoids, they can be platform
  independent.  Almost all of the hassle about unwinding the stack and
  producing traces is in reading the debuginfo.  So this scheme
  minimises the extra complexity compiled in to the browser, putting
  it instead in the factoid server.

* At least for ELF/Dwarf3, the frame unwind info is stored in a form
  which is compact but not suitable for fast unwinding, so the
  Dwarf3->Factoid server could reformat it to be more suitable for
  fast unwinding.(*)

There's lots of stuff on the web about Microsoft symbol servers, but I
don't see anything for other platforms.  Does there exist
cross-platform versions of the technology?

(*) partially evaluate the Dwarf3 CFI for each address range,
producing, for each range, a canned summary of how to compute the
integer registers of the calling frame from the values in this frame.
Didn't Vlad do some factoid-like work already?

/be
I accidentally opened a dupe on this topic in order to point out http://www.eclipse.org/mat/ , which from my perspective offers a sort of high-point of the UI you might want.

(Note in particular the pleasantly informative choice of showing dominator trees)
(In reply to comment #26)
> For the C++ heap we might be reduced to guessing object
> types by looking for pointers to vtables within the objects.

I suspect you can get pretty far with this sort of thing. There are limits of course, but I think you can get a fair bit out of the existing XPCOM infrastructure. Could dig in an allocation looking for a pointer to <a href="https://developer.mozilla.org/en/Using_nsIClassInfo">nsIClassInfo</a>, and/or could tailor a small non-vtbl mixin class for decorating other non-XPCOM / non-virtual types with a pointer to a tiny amount of metadata (particularly: "your module and type name, an offset to a member pointer that's your owner, and/or offset to a member pointer to an object from which you inherit a sense of site-origin").

If you can infer an ownership-tree out of the graph dominators, a few unidentified intermediate nodes along a chain leading to, say, self-identifying image data, DOM nodes or JS objects probably isn't a big hazard to comprehension. I'd focus on this before worrying about the "who allocated you" side of the equation, and building a stack-crawling system. Just getting a characterization of the contents of a really bad heap would be a good start.

The nice thing about bundling this in the standard build is we'll get a whole lot more sampling from the field, cases where sore spots are really standing out, not just "developer sandbox testcases".
(In reply to comment #35)

I tried for a while to do a quick hack to get ELF symbol information
into the process, so as to see if any info can be scraped out of heap
blocks.  But got nowhere; dlsym on Linux won't say anything about
private data symbols, which most vtables are.  Plus it's unportable
and no use for field runs.  I'm sure there's a benefit to be had from
reading debug info, but it seems like it'll take serious integration
with breakpad (in some way) to make it workable and cross-platform.

> but I think you can get a fair bit out of the existing XPCOM
> infrastructure. Could dig in an allocation looking for a pointer to <a
> href="https://developer.mozilla.org/en/Using_nsIClassInfo">nsIClassInfo</a>,

At profile time I am merely faced with a collection of blocks that
malloc/new have noted.  I don't know which of those I can safely run
QueryInterface on.

What would be nice is for the compiler to automatically pass a
uniquely identifying text tag to each allocation.  I can't find a way
to do that in the general case.  But for classes that use XPCOM, we're
in luck.  Every participating class has NS_DECL_ISUPPORTS,
NS_DECL_CYCLE_COLLECTING_ISUPPORTS or NS_DECL_ISUPPORTS_INHERITED in
its declaration.  By adding to those macros the following

public:                                                                       \
  static void* operator new (size_t size) {                                   \
      return moz_malloc_tagged(size, (char*)__PRETTY_FUNCTION__);             \
  }                                                                           \
  static void operator delete (void *p) { moz_free(p); }                      \

it is possible to make a class-specific override of new/delete, which
passes __PRETTY_FUNCTION__ onwards, and that string contains the class
name.  Wiring all that together makes possible to generate stats like
this (edited to make it less wide):

objects/bytes: 118343/12905257 in CH, 29901/2965424 in CH-tagged
  [  0] objects/bytes:   2000 / 480000    sStandardURL
  [  1] objects/bytes:   4251 / 408096    CSSStyleRuleImpl
  [  2] objects/bytes:   3165 / 278520    nsTextNode
  [  3] objects/bytes:   2369 / 265328    XPCWrappedNative
  [  4] objects/bytes:    876 / 147168    nsEventListenerManager
  [  5] objects/bytes:   1533 / 147168    nsXULElement
  [  6] objects/bytes:   2962 / 118480    AtomImpl
  [  7] objects/bytes:    673 / 91528     nsHTMLAnchorElement
  [  8] objects/bytes:    237 / 45504     nsLocalFile
  [  9] objects/bytes:    156 / 41184     imgRequest
  (many lines deleted)

This says: in the C++ heap there are currently 118343 blocks
comprising 12905257 bytes.  Of these, 29901 blocks (2965424 bytes)
have a tag.  Followed by a per-class breakdown for the tagged blocks,
most voluminous first.  The full profile has 447 lines, so at least
447 classes have been covered.

What this also shows is that only about 1/4 of the blocks are
identified.  Since it covers (almost) all XPCOM classes, this gives a
quick upper bound on what coverage we can get from XPCOM.

It would be nice to expand the coverage.  I am wondering if adding a
single line to other allocation points would not be a bad tradeoff.
It could be done incrementally.
Julian, I hate to suggest this, but I think the biggest benefit here would be to get this working under Win32 initially; there at least I know that the pdb files have all the info we need to resolve vtable pointers, and the dbghelp library makes doing that straightforward.

I went off on a bit of a tangent yesterday and wrote a quick tool to walk the entire Win32 heap (win32, not jemalloc), dumping the first 4 bytes of each allocation for later vtable analysis.  It returns all that to js, and then dbghelp or other tools (map files also come to mind -- we could make this completely cross platform if we standardized on one map file format) can be used via ctypes/etc. to translate to symbol names.  I'll get that patch up and cc you.
(In reply to comment #37)
> Julian, I hate to suggest this, but I think the biggest benefit here would be
> to get this working under Win32 initially; there at least I know that the pdb
> files have all the info we need to resolve vtable pointers, and the dbghelp
> library makes doing that straightforward.

I think this is a really bad idea. This should be a tool in *every build* of the program. Not "some developer builds, sometimes, when they happen to have set things up right". You should always be able to take a browser -- when it's up at 1gb resident after a week of browsing on a normal desktop and you don't know why -- and just ask it "why?"

The problem we face here is that in "our testing", when we do it on talos or under controlled conditions, we have these lovely graphs that say we don't use much memory and we always release it and everything is lovely. And then users in the field keep posting back screenshots from week-long sessions with a few dozen tabs and a modest set of extensions saying "no, you don't, it's really out of control and my system is paging again and it's firefox's fault". Then we turn around and point at our pretty graphs and tell them they must be imaging things because it's well-behaved when we tested it.

We need to close that gap, and that means gathering better *field data*.
I think we're saying the same thing -- I'm pointing out that 90% of our users are on win32, so getting that data would be a lot more useful than gathering linux data, and that there's a standardized way to obtain all the symbol data that we'd need via the symbol server, plus a simple library that we can use to query that data.  All stuff that makes it possible to actually gathering actual field data :-)

We could also look into teaching breakpad how to spit out text map files so that we have a standard "symbol data" format without having to write platform specific code to get it as well.  But right now, as far as I know, win32 is the only platform where we have infrastructure to say "give me symbols for release/nightly build with signature abcd".
(In reply to comment #39)
> I think we're saying the same thing -- I'm pointing out that 90% of our users
> are on win32, so getting that data would be a lot more useful than gathering
> linux data,

I understand what you're saying, and I'm not disagreeing.  However,
I'm not trying to gather linux data.  I'm trying to see how far it's
possible to get without any external assistance, a la Graydon
comments.

I'm inclining more towards a source-oriented approach -- what can we
do to get the compiler to add this info automatically?  For one thing,
vtable scraping is going to fail to tell us anything for straight
malloc'd blocks, or for classes in which no vtable is required.
(In reply to comment #36)

> What this also shows is that only about 1/4 of the blocks are
> identified.  Since it covers (almost) all XPCOM classes, this gives a
> quick upper bound on what coverage we can get from XPCOM.

25% coverage of the heap is pretty useless.  I wondered how many
other allocation points would have to be hand annotated to
improve that significantly.  Problem is there are thousands of
allocation points (constructors, mostly).  On the assumption that
allocation probably follows a 90/10 rule, I bootstrapped the
process by hacking the profiler to record the top 5 frames of
each allocation, commoned those up, and looked at the sources
corresponding to the 5-frame groups that allocated the most.
This tells me where to put hand annotations so as to maximise
coverage of heap contents.

An afternoon of doing this increased coverage to 80% of heap
bytes, sometimes more, via 35 one-liner annotations, mostly
like so

  MOZALLOC_PROF_TAG(this);

in constructors, since

  MOZALLOC_EXPORT void moz_prof_tag(void* ptr, const char* tag);
  #define MOZALLOC_PROF_TAG(__ptr)          \
      moz_prof_tag((__ptr), __PRETTY_FUNCTION__)

A sample profile is attached (see next comment), for Fx with 7 tabs
open.  It shows coverage of 74% of bytes and 87% of objects for a set
of 177,885 live blocks that have been allocated through mozalloc.

The lines marked "static void* C::operator new..." are
allocations tagged automatically by the trick in Comment #36.
All others are from hand-applied annotations.
Attached file heap profile as per Comment #41 (obsolete) —
I support the standalone from all side-files, works in product builds, source based approach. When in doubt, use brute force. If we have to mangle our source a bit, it could be worthwhile compared to the old manglings we put up with all over for dubious benefit.

/be
Julian, how does your addition of operator new stuff work for classes that already have NS_DECL_AND_IMPL_ZEROING_OPERATOR_NEW?
This is cool, though I suspect people will immediately say "what's the stack trace for static void* nsCSSCompressedDataBlock::operator new(size_t, size_t)?" and you'll end up reinventing Massif's call tree structure.  But that's ok, if it can be done within the browser so that anyone can use it that would be a big win.

Minor comments about the profile output format:  
- Commas in big numbers (eg. 123,456) make them much easier to read;
- Please mark each entry with the percentage of the total that it represents.
(In reply to comment #44)
> Julian, how does your addition of operator new stuff work for classes that
> already have NS_DECL_AND_IMPL_ZEROING_OPERATOR_NEW?

Yes, an excellent question.  Basically it doesn't.  If a class
overrides new and delete itself then adding a second override in
NS_DECL_ISUPPORTS causes the compiler to complain.

What I _really_ did was to rename NS_DECL_ISUPPORTS to
NS_DECL_ISUPPORTS_NO_NEW_OVERRIDE (possibly a bad name :-) and to then
do this

  #define NEW_OVERRIDES \
     public:                                                         \
     static void* operator new (size_t size) {                       \
        return moz_malloc_tagged(size, (char*)__PRETTY_FUNCTION__);  \
     }                                                               \
     static void operator delete (void *p) { moz_free(p); }          \

  #define NS_DECL_ISUPPORTS \
     NS_DECL_ISUPPORTS_NO_NEW_OVERRIDE \
     NEW_OVERRIDES

(ditto with NS_DECL_CYCLE_COLLECTING_ISUPPORTS and
NS_DECL_ISUPPORTS_INHERITED)

Then, everywhere where g++ complained about duplicate overrides,
changed NS_DECL_ISUPPORTS to NS_DECL_ISUPPORTS_NO_NEW_OVERRIDE (viz,
back to what it was originally).

From a bit of greppery, there are about 1700 uses of
NS_DECL_ISUPPORTS_*, of which 23 had to be _NO_NEW_OVERIDE'd.  So not
a big hit.

That said there was some complexity I didn't figure out yet.  Even
after the compiler accepted everything, in a few cases possibly to do
with multiple inheritance, I was seeing uses of uninitialised values
somehow to do with the automatic overriding, leading to segfaults, and
I had to back off the the _NO_NEW_OVERRIDE versions for those classes.
Possibly about 6 cases.
(Stuff I forgot to say in Comment #46)

... consequently, classes that are marked _NO_NEW_OVERIDE have to
be hand-annotated using MOZALLOC_PROF_TAG as per Comment #41.

MOZALLOC_PROF_TAG is a second way to get tag information into the
profiler.  It facilitates setting the tag on a block after the block
has been allocated.  The ideal of course is to set the tag at
allocation time, a la moz_malloc_tagged in Comment #46.
(In reply to comment #42)
> Created an attachment (id=449124) [details]
> heap profile as per Comment #41

As a light diversion from the "what should we build?" question: in
this profile, no less than 37% of the objects in the heap (66.6k out
of a total of 177k objects) have to do with CSS.  That seems a lot.
Is that expected?


HEAPPROF 16701:    objects/bytes: 177885/18446274 in CH, 
                                  155018/13662436 in CH-tagged

RANK  OBJECTS/BYTES    ALLOCATED BY

  0   9131 / 1827388   static void* nsCSSCompressedDataBlock::operator
                       new(size_t, size_t)
  1  19084 / 1374048   nsCSSSelector::nsCSSSelector()
  4   8525 / 818400    static void* CSSStyleRuleImpl::operator new(size_t)
  9   8491 / 407568    nsCSSDeclaration::nsCSSDeclaration()
 12  10306 / 247344    nsCSSSelectorList::nsCSSSelectorList()
 17   6484 / 155616    nsCSSValueList::nsCSSValueList()
 27   1747 / 69880     nsCSSValuePairList::nsCSSValuePairList()
 29   1017 / 50568     nsCSSValue::URL::URL(nsIURI*, nsStringBuffer*, nsIURI*,
                       nsIPrincipal*)
 35   1156 / 36992     nsPseudoClassList::nsPseudoClassList(nsIAtom*,
                       nsCSSPseudoClasses::Type)
 59    354 / 16992     static void* nsDOMCSSAttributeDeclaration::operator
                       new(size_t)
 61    114 / 15504     static void* nsCSSStyleSheet::operator new(size_t)
103     57 / 4560      static void* CSSNameSpaceRuleImpl::operator new(size_t)
104     81 / 4536      static void* nsCSSRuleProcessor::operator new(size_t)
126     90 / 2880      static void* CSSImportantRule::operator new(size_t)
That seems like a large proportion, true; but you're also looking at a profile that (if I read correctly) only relates to ~18mb resident? So like .. a very small snapshot. Possibly quite un-representative of the program later on in its life. I assume this is just from a prefix of startup?

Can you get a profile once it's been browsing for a while and, say, passes the 150mb resident mark? Might need to up the numbers in your static array.
> Is that expected?

It could be, depending on what's being measured.  You mentioned 7 tabs.  What were they?  Was gmail involved, say?
(In reply to comment #49)
> That seems like a large proportion, true; but you're also looking at a profile
> that (if I read correctly) only relates to ~18mb resident?

Yes .. I was also suspicious about that.  By running a profiled build
on Massif, which intercepts all mallocs and frees, it's possible to
see what proportion of heap allocations are routed through mozalloc
and hence are intercepted.  The news isn't good.

For a startup with 4 tabs, intercepting mozalloc shows the heap volume
peaking at around 29.4M.  Massif puts that figure at a more credible
65.0M, meaning more than half the heap allocation bypasses mozalloc and
so is invisible to the built-in profiler.  I'm not sure what to do
about this.  It would be possible to force (eg) jsengine to route
allocations through mozalloc, but that would create a new intermodule
dependence.  And there's a significant amount of heap allocated by
libX11.so.6.3.0 and libxcb.so.1.1.0 which we can't reroute at the
source level.

One ungood kludge would be to write a shared object redefining malloc,
free, etc and LD_PRELOAD it into the process.  That of course only
works on Linux and perhaps MacOSX.

Anyway, some stats from the above run.

I should add, all these figures exclude the overhead of the built-in
profiler itself (about 18M), to make the comparison fair.

============

Built-in intercepts in mozalloc give a peak heap use of 29.4M.

Massif says 65.0M.

where did the rest go?  Well, approximately:

  12,737,468 0x704B744: moz_calloc (mozalloc.cpp:137)
   8,484,176 0x704B81B: moz_malloc_tagged (mozalloc.cpp:105)
   6,564,989 0x704B87D: moz_xmalloc (mozalloc.cpp:91)

  p 12737468  + 8484176 + 6564989
  $1 = 27786633

So 27.8 MB was noted by Massif as going via mozalloc; that ties
in at least approximately with the 29.4M figure above.

For the rest:

11,023,595 in 735 places, all below massif's threshold (01.00%)

PR_Malloc calls malloc directly:
   4,754,242  PR_Malloc (prmem.c:467)

js_malloc calls malloc directly:
    3,832,065  js_NewScript (jsutil.h:193)
    3,478,750  JS_ArenaAllocate (jsutil.h:193)
    2,269,696  JS_DHashAllocTable (jsutil.h:193)
    1,401,504  JSScope::create(JSContext*, JSObjectOps const*, JSClass*,
                                JSObject*, unsigned int) (jsutil.h:193)
      904,564  js_NewStringCopyN (jsutil.h:193)

system libraries presumably call malloc directly:
    3,142,400  XGetImage (in /usr/lib/libX11.so.6.3.0)
    1,571,264  ??? (in /usr/lib/libxcb.so.1.1.0)

extensions/spellcheck/hunspell/src/hashmgr.cpp calls malloc directly:
    2,685,946  HashMgr::add_word(char const*, int, int, unsigned short*,
                                 int, char const*, bool) (hashmgr.cpp:188)
    ## Hmm, does this mean we're allocating 2.7MB merely to initialise
    ## a spell checker that didn't get used in this run?

XPT_ArenaMalloc calls calloc directly:
    1,126,648  XPT_ArenaMalloc (xpt_arena.c:221)

sqlite3 calls malloc directly:
    1,038,456  sqlite3MemMalloc (sqlite3.c:12975)
(In reply to comment #50)
> > Is that expected?
> 
> It could be, depending on what's being measured.  You mentioned 7 tabs.  What
> were they?  Was gmail involved, say?

No gmail.  I don't recall exactly, but I think the tabs were

http://lwn.net/Articles
http://www.youtube.com
http://en.wikipedia.org/wiki/Main_Page
and the other 4 were articles accessible from
http://www.spiegel.de/international
bug 559263 has some changes to make us wrap malloc more thoroughly with mozalloc. I'm not sure if those apply on all platforms or just Android. cjones would be the one to talk to about mozalloc hijinks, in any case.
The other major one is the JS GC chunks, though that may already go through the right place on Linux.  That's better now that Vlad has landed the hooks, though.
> No gmail.  I don't recall exactly, but I think the tabs were

OK.  We'd have to take a look at their source to see whether the numbers make sense, but they're at least plausible...
(In reply to comment #51)
> It would be possible to force (eg) jsengine to route
> allocations through mozalloc, but that would create a new intermodule
> dependence.  And there's a significant amount of heap allocated by
> libX11.so.6.3.0 and libxcb.so.1.1.0 which we can't reroute at the
> source level.
> 
> One ungood kludge would be to write a shared object redefining malloc,
> free, etc and LD_PRELOAD it into the process.  That of course only
> works on Linux and perhaps MacOSX.
> 

So, two notes --- first, mozalloc has no dependencies by design, so using it in js/src would be a relatively simple matter of getting jseng-folk consent and shuffling around sources.  This would work on all platforms.  Second, we already use linker tricks on linux and windows (!!) to replace system malloc with jemalloc, so we could re-use those hacks to point malloc at a version with your instrumentation.  This would take a bit of work.  Mac would be left out.

(In reply to comment #53)
> bug 559263 has some changes to make us wrap malloc more thoroughly with
> mozalloc. I'm not sure if those apply on all platforms or just Android. cjones
> would be the one to talk to about mozalloc hijinks, in any case.

The android code uses GNU ld's --wrap, so linux only (for all intents and purposes).
(In reply to comment #56)
> shuffling around sources.  This would work on all platforms.  Second, we
> already use linker tricks on linux and windows (!!) to replace system malloc
> with jemalloc, so we could re-use those hacks to point malloc at a version with
> your instrumentation.

I tried with --enable-wrap-malloc, but got into various build system/linkage
difficulties, and gave up.  Sticking with the source route for now.
(In reply to comment #51)

> Built-in intercepts in mozalloc give a peak heap use of 29.4M.
> Massif says 65.0M.

I routed js_malloc, XPT_ArenaMalloc and the allocator in db/sqlite3
through mozalloc.  This significantly improves coverage.  For a
startup of Fx with about 100ish tabs, Massif gives a peak C++ heap
allocation size of 342.9MB, and the mozalloc interceptor sees 252.0MB
of it (perhaps a bit more).  That's 73.5%.  Most of the rest escapes
via PR_Malloc, and it might be possible to reroute a couple of its
high-roller callers to mozalloc, so a figure of 80+% is possible, I'd
say.

That's sufficiently encouraging that I'd like to turn my collection
of experimental hacks into a proper patch.  However, I have one
structural problem unsolved.  The interceptor (which is in mozalloc)
really needs a lock/unlock facility.  Right now I'm kludging it by
calling pthread_mutex_{lock,unlock} directly; obviously that won't
work on Windows.

I need to use the locking facilities in NSPR instead, but mozalloc
doesn't depend on NSPR.  I could I guess copy JSThinLock into
mozalloc, so as to keep mozalloc standalone, but that's not ideal.
Any suggestions?
jslock.{cpp,h} depend on NSPR -- a thin lock can fatten into something requiring a PRLock.

If you just need a mutex, but also must stand alone, best to use raw Windows API and ifdef. Shouldn't be too bad.

/be
Standalone C++ heap profiler, patch against m-c 46349:4aadb903f941
(Mon Jun 28 09:46:49 2010 -0700).

Provides who-allocated-what-in-the-C++-heap analysis without external
assistance.  As per comment #51, covers about 80% of the C++ heap
volume.  Of that 80%, about 95% of the volume is tagged by allocation
point, using the schemes in comment #41 and comment #36.

In a quick experiment with 6 tabs, it is tracking about 42MB of
blocks, whereas "about:memory" lists about 12.5MB at that point.

Linux only for the time being.  Working on extending it to MacOS and
Windows.  Should be jemalloc-agnostic, although that is untested.

HOW TO USE:

Profiling is enabled/disabled at Fx startup, by default disabled.
To enable, start with env var MOZALLOC_HEAPPROF set to anything.

When running, it monitors mallocs/frees (etc).  Every 20000 such
calls, it prints a summary line showing the current total volumes:

  1456357/1183643/945M total allocs/frees/totalloc, 272714/40802k live blocks/bytes

Periodically the (tracked part of the) heap is sampled and a census
printed, one line per allocation point ("tag"), showing the number of
blocks and bytes billed to each tag.  A census is performed when the
maximum volume has expanded by more than 4% (over 50MB) or by 2MB
(under 50MB).  This means that you are guaranteed to have a census
within 4% of peak residency, and it will be the most recently
displayed one.

There is some space overhead (probably less than 10% of the allocated
heap) when profiling is enabled, and some extra processing per
allocation.  Browser is still usable though.  The extra space overhead
is not shown in the generated profiles.  When disabled, the overheads
should be very close to zero (1 extra conditional branch per malloc
and per free).
Attachment #445563 - Attachment is obsolete: true
For a startup/quit of the browser, with 6 tabs.
Attachment #449124 - Attachment is obsolete: true
Version that works on both Linux and MacOSX, at least for 10.5.8,
32-bit build with gcc-4.2.  Also, rebase to m-c 46969:d9253109fd88
(Thu Jul 01 14:40:49 2010 +0200).
Attachment #454639 - Attachment is obsolete: true
Now with win32 support (at least, works on WinXP, MSVC9,
debug and release builds).  Patch relative to m-c 46969:d9253109fd88
Attachment #455470 - Attachment is obsolete: true
This now works on the biggest 3 targets.  I'd like to move it towards
land-ability.  At this point, some level of feedback/review seems to
me necessary.

Patch provides functionality as described in comment #60.  However,
there are various kinds of rough edges:

(from the user's aspect)

* how should profiling be enabled?  atm env var MOZALLOC_HEAPPROF
  is tested at startup; if set, profiling is enabled.  This doesn't
  strike me as good.  I thought about using a pref, but really we
  want the profiler to be enabled or disabled right from process
  start.

* where should output go?  currently it's sent to stderr
  (sending it to stdout breaks jsengine building).

(from the Fx code base aspects)

* Patch routes much more malloc/free traffic through mozalloc
  than previously.  This means these modules acquire a new
  dependency on mozalloc:

  * js
  * db/sqlite3

  I'm sure this totals standalone builds of js (haven't tried).
  Even with those routed through mozalloc, we only capture about
  80% of heap allocation, so potentially more modules should be
  routed through mozalloc.  [IMO, it no longer makes sense to have
  mozalloc separate from NSPR; they should be combined.]

* When not profiling, there is an extra overhead of 1 memory
  reference (read) and 1 conditional branch per alloc and free.
  IMO this is insignificant in that (according to my own measurements)
  the behaviour of Fx in big runs is on the order of 1 malloc and
  1 free per 50000 instructions.

* As per 3rd para in comment #60, supposing we wanted to add more
  allocation tags to cover the 5% of the heap that is currently
  tagged as UNKNOWN (we know an allocation is happening, but we 
  don't know is doing it.)  Finding the point in the sources to
  annotated by hand requires some auxiliary mechanism to get a
  stack trace.  I have been using a hacked valgrind.

  Given that the profiler is supposed to be standalone, I can't see
  how the dependence on some external tool to "bootstrap" it can be
  avoided.  Just mentioning for completeness.
(from the Fx code base aspects in comment #64) += 

* the scheme described in comment #36 provides automatic tagging of
  allocations made by all nsCOMPtr classes.  It does this by stuffing
  a per-class override of new and delete into the NS_DECL_*ISUPPORTS
  macros.  Since there are a couple thousand of such classes, there
  will be some cost in terms of object code size.  I'll measure.
* shows profiling results in "about:heap"
* rebased to m-c 48706:a4d86c3a3494 (Mon Aug 02 01:28:10 2010 -0700)

WIP.  Not re-tested on OSX or Win.  Appears to have stability problems
(occasional heap block overwrites) that weren't there before.
Attachment #464408 - Attachment is obsolete: true
Not pretty, but possibly useful.  Output is in two parts:

* the most recent census (detailed breakdown of contents).  Since
  a census is only made when the heap high-water mark rises, this
  is always a picture of the heap at or close to the max for the run
  so far.

* periodic one-line summaries, to give some feel for changes over time.
Revised version of comment 67 patch, with about 180 allocation sites
in js/src/*.{cpp,h} (that is, calls to js_malloc/_calloc/_realloc)
hand-annotated with allocation tags.  This gives much better
resolution into jsengine's allocations.  Prior to this change the
profiler would say (eg) that lots of stuff was allocated by
js_realloc, which isn't helpful -- see screenshot in comment #68.  Now
it can say who is making those calls.

Still requires refinement to handle allocated-on-behalf-of scenarios.
Eg, saying that 4.5% of the heap is allocated by PL_DHashAllocTable is
pretty useless.  Am thinking about tagging each block with 2 strings
rather than one -- primary and secondary tags.  So then it would be
able to say, effectively: "X% of the heap was allocated by
PL_DHashAllocTable, on behalf of nsFoo::bar" (where nsFoo::bar is
using PL_DHashAllocTable to create/maintain a hash table).

Still has stability problems.  I believe this is the same problem
as documented in bug 587610, and is unrelated to the profiler per se.
Attachment #464410 - Attachment is obsolete: true
jseward: given the presence of about:memory and DMD, I think we can safely close this bug. Please reopen if you disagree.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: