Closed Bug 551477 Opened 15 years ago Closed 11 years ago

Add a built-in space profiler

Categories

(Core :: General, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: jseward, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(4 files, 6 obsolete files)

The following is a placeholder bug following a discussion on irc with jorendorff, bz, ted, w/ apologies in advance if I paraphrase it wrong. It would be nice to have some kind of space profiler facility built into Fx. It appears there is concern about the difficulty in determining what Fx is using memory for, especially in more complex use scenarios (many open tabs, many extensions, many different web domains displayed). Initial discussion assumed that the traversal problem was somehow solvable. Specifically, we would be able to traverse the entire JS graph (easy) and also the entire C++ object graph (conservative scan?) FTR, I am inclined to think of the traversal problem in a pretty abstract sense: we have a combined C++/JS object graph, plus some set of roots into the graph. We will have a traversal mechanism which allows us to visit all nodes in said graph, plus a query mechanism that can ask each node something about itself (what kind are you? who put you here?), although it might be that from C++ objects we can get no such data. The question of what kind of information should or could be collected proved less clear cut. I think Jason was after this: * for each (web) domain, total amount of storage attributed solely to that domain * for each extension, total amount of storage attributed solely to that extension I think I have a more general proposal for this, which I'll put in a following comment, so as to avoid cluttering this initial statement. There may be other metrics which are interesting. Two of the most basic are "what are you" (what kind of object) and "who put you here" (what call stack allocated you?) One other question is: how much can we do without having to add profiling metadata word(s) to each object?
(In reply to comment #0) > * for each (web) domain, total amount of storage attributed solely to > that domain > > * for each extension, total amount of storage attributed solely to > that extension Here's a strawman generalisation, since for sure we don't want to be hardwiring concepts like "domain X" or "extension Y" into the profiler. We have a graph G and a set of roots R, so that each 'r' in R is a pointer to some node 'nd' in G. Imagine we also have a completely arbitrary set T of tags. Each member of T characterises a root in some (arbitrary) way. Each root then has some subset of T associated with it. For example, T could be {"domain is bbc.co.uk", "extension is FlashBlock", "tab number is 57"} So then we can ask the profiler: how much graph is reachable from roots which have the tag "domain is bbc.co.uk" ? Or from roots which have the tag "extension is FlashBlock"? Or from roots which have both of those tags? Is that a useful generalisation? I suspect so. Is it implementable? I don't know.
FWIW, I think it would be fine to do the traversal only on request, as long as we keep enough information around to perform that at any time. I suspect we'd want to be able to gather this information and display it when a user visits about:memory. Maybe it makes sense to just have the underlying API provide the data you just proposed, and then the front-end UI can figure out a way to display it that makes sense?
(In reply to comment #2) > Maybe it makes sense to just have the underlying API provide the data you just > proposed, and then the front-end UI can figure out a way to display it that > makes sense? Yep, that'd be the right approach; I suspect a lot of tools would want to use this data in different ways. Note that exposing a simple query interface would also be perfectly ok, though, since getting the data could well be expensive. But a query could also just be 'give me all the data'.
It seems ok to have an expensive query, as long as we document that the call is expensive. If a user opens about:memory because their Firefox is already screwed, it's probably not a big deal if it takes a little bit to run.
Thinking a bit about the top level implementation choices and their space requirements and intrusiveness. Seems like there's at least three different things we can ask of an object, at heap-census time: (1) What are you? (== what kind of object?) Presumably we can find this out by examining the object itself. Does not require any extra storage. (2) Who put you here? (== what point in the program allocated you?) This requires having extra storage associated with the object, and this is needed for the entire lifetime of the object. Usually done by having an extra word in each object, containing some kind of allocation-point tag. (3) Why are you still alive? (== who's pointing at you, and/or from which root(s) are you reachable?) Requires constant extra storage per object (perhaps a root-set tag?) but that storage is only required transiently at census-time, not for the entire lifetime of the object. (1) and (3) might be acceptable. (2) is massively intrusive because we'd have to change the layout of every object (or do some nasty hack like having a shadow heap containing the allocation-point data), and we'd also have to mess with every allocation point, to insert a tag into the object/shadow-heap. Note that (3) is a generalisation of the tags idea in Comment #1.
Gregor has been working on (2).
(In reply to comment #6) Is there a tracking bug, or some other place I can learn about what he's doing?
(In reply to comment #7) > (In reply to comment #6) > Is there a tracking bug, or some other place I can learn about > what he's doing? There is bug 547327. My approach is to monitor objects that are created within a loop and try to optimize the allocation size for further objects that are created at the same program location.
Julian, is there anything I can do to help move this along? Do you need your car washed? Meals cooked? Heaven and earth moved? Is there anybody you want rubbed out? Andreas maybe? Andreas, we need this feature, you're going to have to take one from the team.
I meant "for the team", but either way.
Ted pointed me to this bug from here: http://groups.google.com/group/mozilla.dev.platform/browse_thread/thread/f2008793142b9abb/49e00f7d41b70ce1?hl=en&lnk=gst&q=memory+profiling+firebug#49e00f7d41b70ce1 One of the most requested features for Firebug is a support for web page memory profiling. Recently we have tried to implement a prototype (based on memory API included in Jetpack prototype). You can see more here: http://getfirebug.com/wiki/index.php/Firebug_Memory_Profiler The first question: what is the relation between the existing Jetpack APIs and this bug? Will the Jetpack be used as a base for the solution? --- So far, there is no way how to get precise size (row bytes) information for e.g. strings, which would be nice addition to some size info already provided. Some other more structured size info (related to images, HTML elements, etc) would be also useful. The prototype we have done for Firebug showed us that one of very useful things (except of the actual amount of memory consumed) is the list of referents (comment #5 - point #3). We have managed to get such list (using Jetpack APIs) since it provides a way how to access the entire graph of JS objects in the current firefox runtime. Note that we needed to filter the graph in order to get JS objects only for the current page (which is how Firebug works, always focusing on the current page) so, having some kind of query interface (as mentioned in comment #3) would help. Also agree with Vladimir that various tools would want to use different way how to present the data to the user. For example, in case of Firebug we also want to integrate the memory-info with existing panels. The second question. As I mentioned, using the Jetpack APIs, it's relatively easy to get the entire graph of meta-data, where each node has its unique ID. The problem is that there is no way how to get JS object ID and/or to get an existing JS object according to the provided ID (I know that JSD provides an APIs to get these IDs for script objects: jsdIScript.tag), which complicates "JS object <-> meta-data" relation. In Firebug it would be extremely helpful to show the meta data in context of real objects on the page (e.g. in the Watch panel, DOM panel, etc.) Could such methods (like: getJSObject(id), getJSObjectID(obj) ) also be part of the memory API? Honza
> The first question: what is the relation between the existing Jetpack APIs and > this bug? Will the Jetpack be used as a base for the solution? I had a look at the Firebug profiler links. It looks like a nice presentation and query environment. I think we're approaching the same problem from opposite ends, though, and I don't think I can answer your questions at present. Here's what I'm looking into. My goal atm is low level enumeration of the graph of objects, with the short-term aim of maximising coverage. I use "object" in the loosest sense here, to mean any lump of memory to which anyone has a pointer or some kind of reference. So I am aiming to cover the JS heap, the C++ heap, and not just stuff under control of the cycle collector, but ad-hoc malloc'd memory too. Given the great diversity of objects that covers, I am thinking about the idea of building a "shadow graph" -- a throwaway structure which is (mostly) constructed when we want a memory census, analysed, and discarded. The shadow graph is composed of shadow nodes, each of which is: start-address length arbitrary-description set-of-out-pointers plus a set of pointers which serve as roots. At this stage the arbitrary-description can be minimal or nothing. The shadow graph can be incomplete or inaccurate. Basically we forage around for information to build it with, then hand it off to an analyser. The analyser is tolerant of missing pointers, pointers to the middle of nodes, pointers to static objects, apparent inconsistencies like nested or overlapping blocks, etc. The graph would be constructed, for JS objects, by a hacked version of JS_DumpHeap. For C++ objects I am thinking to add hooks to memory/mozalloc/mozalloc.h which observe all mallocs/news and frees/deletes. At analysis time any such blocks would be scanned to find pointers to other blocks. This has various advantages: * it gives coverage of C++ and JS objects * without having to get involved with the Cycle Collector * all the analysis code can be in one place, rather than scattered around * the analysis code does not need to be aware of the layout of any of the objects * the only code that does need to be aware of object layouts (and be scattered around) are "abstractor methods", you might say, which take (eg) a JSSquirrellyThing* and give back one or more shadow nodes that represent it * we can incrementally increase the level of detail in the arbitrary-description field as required to support more detailed queries. * it extends naturally to anything else we write an "abstractor method" for, eg, mmap'd blocks * initially the description field can be empty. Even like that we get basic stats on the numbers and sizes of allocated objects, plus reachability information. * the shadow graph doesn't need to be constructed all at the same time, iow we don't necessarily need to stop the entire world to do it. We can do analyses of incomplete or inconsistent graphs if need be. * as soon as the shadow graph is built, the system can continue. We don't have to wait around for expensive reachability or whatever analyses to complete. That can be done by a separate thread. Or even by a different process, offline. * when profiling is inactive it's free in space terms and almost free in time terms. "Almost" because it would still require an extra conditional branch on the mozalloc malloc/new functions to decide not to collect the allocated address/lengths. Anyway. If anyone thinks this is the wrong thing to build, or simply won't work, please yell asap.
(In reply to comment #12) > Anyway. If anyone thinks this is the wrong thing to build, or simply > won't work, please yell asap. If it weren't you saying it, I would say, "you're vastly underestimating the complexity of what you're proposing." But since it *is* you, I'll say, "It sounds too good to be true."
I don't understand what you mean by "abstractor methods", and what they would do.
(In reply to comment #14) > I don't understand what you mean by "abstractor methods", I mean a function which takes an object and returns its essential details for inclusion into the shadow graph. For example, given a pointer "p" to one of these struct Foo { int a; Blah* b; char c; D* d; } the abstractor function must return at a minimum these 3 items: p sizeof(*p) {p->b, p->d} that is, the address range of the object and the set of pointers found within it. That is the minimum info needed to build the shadow graph. The point about intercepting moz_xmalloc et al is that we can extract the above info automatically, for malloc'd blocks. This is just as well since it would obviously be totally infeasible to manually annotate every C++ constructor in the tree. When an malloc-style allocation is made we note the returned pointer and the requested size. That means for a 2 words/block overhead we get a list of all the blocks in the C++ heap. Then at analysis time we can scan each block to find any pointers to other blocks it might hold. This is really just conservative C/C++ garbage collection stuff. I just verified that I can stick a hook in moz_xmalloc et al (thanks cjones!) and see all mallocs and frees going by. So this part of the plan at least looks like it might work. For JS objects and other stuff in the JS heap I will need to write custom abstractor functions. But that's not such a big deal because there's only a limited set of object formats to consider.
Julian, how much of a problem would tagged pointers be for this scheme? We use them in some places in the DOM, and I've been trying to decide whether we should stop doing so... Does it matter for this setup one way or another?
(In reply to comment #16) I don't think that would be a problem -- those bits can just be masked off. It weakens a bit the heuristic used to decide whether something is could be a pointer to a block, because we can no longer reject misaligned values as can't-possibly-be-a-pointer. But I don't think that's a particularly safe heuristic anyway, since it would assume that nobody in the entire system uses tagged pointers. Which I don't believe. So the short answer is: it makes no difference.
As a devil's advocate type question, what is the advantage of this approach versus forking an inert copy of the program or performing a non-fatal core-dump and then probing that static memory snapshot using a python-enabled gdb? Assuming debug symbols are available, gdb can introspect objects and tell when a pointer in memory amounts to an nsCOMPtr/nsRefPtr versus a presumed invariant-controlled backlink and otherwise get a lot of structure knowledge for free. The 'self-introspection' goal could be satisfied by taking a page from roc's chronicle playbook and having the python/gdb driver expose an HTTP loop that consumes and returns JSON. One might even argue that it's cleaner because the patient is never performing analytical surgery on itself. In any event, this is very exciting work you are undertaking and I look forward to the results! You may be interested in trace-malloc's leak-soup logic that does some of the conservative pointer analysis stuff already, apparently: http://mxr.mozilla.org/mozilla-central/source/tools/trace-malloc/
gdb, what's that? I'm a Windows-based JS developer who wants to know where all my memory is going! /be
The use case I proposed when I badgered Julian into filing this bug was "I'm a normal user, and my Firefox is using a lot of memory. How can I find out what's using all that memory?" Ideally we will provide an API that things like about:memory can use which could give a rough idea of how much memory is entrained by each extension and web site. We need to give users tools so that instead of saying "Firefox uses too much memory" they can say "AdBlock uses too much memory when I visit cnn.com". (Then, ideally, the parties responsible can fix their code!)
(In reply to comment #20) > [...] they can say "AdBlock uses too much memory when I visit cnn.com". Yes, I completely agree that's where we want to get to eventually and yes it will no doubt require an API to query the collected data. But first we need the low-level info-collecting machinery in place, since for sure I don't see how we can answer any interesting question without having at a minimum a graph, roots and reachability analysis. Also .. I'm reluctant to build a profiler which only has partial coverage (eg, JS heap only), hence the thinking about the C++ heap at this early stage.
One thing I'd like to ask the assembled wisdom here is, when should a heap census be performed? My current plan is: * immediately after the cycle collector has run, and * immediately after JSGC has run From sticking printfs in various places it appears that a run of CC necessarily causes a run of JSGC (yes?) but not vice versa. So I'm inclined to hack up some logic which does a census after every CC run and after every JSGC run that isn't bracketed by a CC run. Better suggestions / your-analysis-is-bogus style comments welcomed.
Right, I'm not suggesting you need to implement all that, just the underlying bits! It was more in response to Andrew's "why not just use gdb + python" question.
I raised the gdb solution because it seemed like a way to avoid a slippery slope on the C++ analysis front and because forking seems like the fastest way to get the system running again. Specifically, implementing the abstractor methods seems like manually reproducing the work that debug information already provides, and using gdb is much less risk-prone than rolling your own DWARF parser. (Of course, gdb-python or dehydra/treehydra could be used to automatically generate those if the need arose.) It obviously is not the ideal long-term solution for a web developer focused tool. It might be reasonable for a mozilla developer tool, especially as it could be viable for crash-dump analysis. Mainly, I was thinking it could be reasonable as a simpler-to-develop prototype given that I believe Jim Blandy already has the JS side of the house covered for gdb-python ( http://hg.mozilla.org/users/jblandy_mozilla.com/archer-mozilla/ ). The fundamental assumption that the python-gdb way is simpler might be very wrong on my part... fwiw, I believe gdb can run on windows, but I am unclear as to whether it understands the MS compiler's debugging symbols format, which would be the most important thing. In any event, an in-tree solution is even more exciting and useful than an external mechanism with dependencies; I eagerly await the fruits!
I have working, a first cut at machinery that creates a combined C++/JS shadow graph, as per proposal in comment #12. Will clean it up and post a wip patch later this week. This is still far from being useful. For a Fx idling and displaying news.bbc.co.uk, there are about 88k nodes in the graph, of which 84k are C++ and 4k are JS.
Depends on: 563305
What this patch demonstrates: * creation of a combined C++/JS shadow graph * done periodically, at the end of some JS garbage collections * thread-safe-ly, more or less, and without deadlocking the system * reasonable performance (resulting browser is quite usable) * without the profiler's own dynamic memory allocation skewing the results Example output: all C/C++ heap allocations via mozalloc are tracked, and running stats of live blocks maintained. Periodically a 1-liner summary is printed: HEAPPROF 20747: CH: 1605639/1444361/170M total adds/dels/alloc, 161278/14595k live blocks/bytes Distribution of ages at death is also computed but not displayed. After some but not all JS GCs, a full profile is initiated. A shadow graph is built, cleaned up, and some simple stats on nodes and edges computed: HEAPPROF 20747: PROFILE #8: BEGIN creating shadow graph HEAPPROF 20747: profile #8: analysis begin HEAPPROF 20747: raw: 161363 CH blocks, 4200 JS blocks, 207 dups HEAPPROF 20747: initial sanity checks ok HEAPPROF 20747: refinement done, 434094 mapped, 807262 unmapped HEAPPROF 20747: objects/bytes: 4003/46752 JS, 161353/16426453 CH HEAPPROF 20747: edges: 279 JS->JS, 225 JS->CH, 5647 CH->JS, 315904 CH->CH HEAPPROF 20747: outdegree: 504 from JS, 321551 from CH, 322055 total HEAPPROF 20747: indegree: 5926 to JS, 316129 to CH, 322055 total HEAPPROF 20747 18183 objects of 165356 total unreferenced from within the SG HEAPPROF 20747: PROFILE #8: END What next: As it stands, this is more or less useless. It mechanically extracts low level reachability, but knows nothing about each node. One direction is to try to answer "what kind of objects are in the heap?" questions. This could be provided by the JS world without difficulty. For the C++ heap we might be reduced to guessing object types by looking for pointers to vtables within the objects. Another direction is to answer "who put these objects in the heap?" questions. This would require tagging each object with its allocation point. For the C++ heap this could possibly by done by getting stack traces from Breakpad.
(In reply to comment #26) > Another direction is to answer "who put these objects in the heap?" > questions. This would require tagging each object with its allocation > point. For the C++ heap this could possibly by done by getting > stack traces from Breakpad. How about causality tracking? For example, when a DOM click event gets generated it would generate a 'frame' that says "DOM click calling foo.js/func bar()/line 53" and uses that as the tag. When that event-triggered JS initiates a setTimeout it creates a new frame to use when the setTimeout gets dispatched and links to the currently active (event-origin) frame. In order to avoid creating the world's longest linked list, a setTimeout from inside a setTimeout callback would reference the original progenitor instead and boost a counter to reflect that it's the result of a chained setTimeout. It seems like they could just be JS objects and the normal garbage collector would take care of it? The same mechanism could potentially be exposed to C++ consumers with default logic being provided for the event loop/native nsITimers/other low-hanging fruit using the breakpad (or external address->symbol resolution). The startup timeline does some of this and I did a bit more for my recently started and regrettably currently triaged/abandoned Thunderbird memory work. I even went so far as to push state on all XPConnect C++->JS crossings, although I also was interested in the execution and its timing as well as the memory costs.
FYI, Breakpad can only produce stack traces after-the-fact. (We don't ship the debug symbols with the app, which it needs to walk the stack.)
We could expose the symbols via a web service (I don't think the symbol server protocol is fine-grained enough for this) so that the client could request a bunch of symbol/frame factoids and do the walking, if we wanted to. We have the information, can make all sorts of tradeoffs about how to get it to the user, including just pulling them all down the first time the user wants to walk some stack.
Sure, we can do anything (it's just software, right?) I'm just pointing out that as currently implemented, Breakpad isn't really a good fit for the task.
(In reply to comment #29) For sure we can't do anything much interesting with the C++ heap without having symbols or unwind data. Given those two I think we could answer what's-in-the-heap and who-put-it-there questions, which would be a good start. For x86-{linux,darwin} with -fno-omit-frame-pointer (the default) we might be able to get away without unwind data, but pretty much all other platforms require it. I like the pre-canned factoid idea for a couple of reasons: * With careful design of the factoids, they can be platform independent. Almost all of the hassle about unwinding the stack and producing traces is in reading the debuginfo. So this scheme minimises the extra complexity compiled in to the browser, putting it instead in the factoid server. * At least for ELF/Dwarf3, the frame unwind info is stored in a form which is compact but not suitable for fast unwinding, so the Dwarf3->Factoid server could reformat it to be more suitable for fast unwinding.(*) There's lots of stuff on the web about Microsoft symbol servers, but I don't see anything for other platforms. Does there exist cross-platform versions of the technology? (*) partially evaluate the Dwarf3 CFI for each address range, producing, for each range, a canned summary of how to compute the integer registers of the calling frame from the values in this frame.
Didn't Vlad do some factoid-like work already? /be
I accidentally opened a dupe on this topic in order to point out http://www.eclipse.org/mat/ , which from my perspective offers a sort of high-point of the UI you might want. (Note in particular the pleasantly informative choice of showing dominator trees)
(In reply to comment #26) > For the C++ heap we might be reduced to guessing object > types by looking for pointers to vtables within the objects. I suspect you can get pretty far with this sort of thing. There are limits of course, but I think you can get a fair bit out of the existing XPCOM infrastructure. Could dig in an allocation looking for a pointer to <a href="https://developer.mozilla.org/en/Using_nsIClassInfo">nsIClassInfo</a>, and/or could tailor a small non-vtbl mixin class for decorating other non-XPCOM / non-virtual types with a pointer to a tiny amount of metadata (particularly: "your module and type name, an offset to a member pointer that's your owner, and/or offset to a member pointer to an object from which you inherit a sense of site-origin"). If you can infer an ownership-tree out of the graph dominators, a few unidentified intermediate nodes along a chain leading to, say, self-identifying image data, DOM nodes or JS objects probably isn't a big hazard to comprehension. I'd focus on this before worrying about the "who allocated you" side of the equation, and building a stack-crawling system. Just getting a characterization of the contents of a really bad heap would be a good start. The nice thing about bundling this in the standard build is we'll get a whole lot more sampling from the field, cases where sore spots are really standing out, not just "developer sandbox testcases".
(In reply to comment #35) I tried for a while to do a quick hack to get ELF symbol information into the process, so as to see if any info can be scraped out of heap blocks. But got nowhere; dlsym on Linux won't say anything about private data symbols, which most vtables are. Plus it's unportable and no use for field runs. I'm sure there's a benefit to be had from reading debug info, but it seems like it'll take serious integration with breakpad (in some way) to make it workable and cross-platform. > but I think you can get a fair bit out of the existing XPCOM > infrastructure. Could dig in an allocation looking for a pointer to <a > href="https://developer.mozilla.org/en/Using_nsIClassInfo">nsIClassInfo</a>, At profile time I am merely faced with a collection of blocks that malloc/new have noted. I don't know which of those I can safely run QueryInterface on. What would be nice is for the compiler to automatically pass a uniquely identifying text tag to each allocation. I can't find a way to do that in the general case. But for classes that use XPCOM, we're in luck. Every participating class has NS_DECL_ISUPPORTS, NS_DECL_CYCLE_COLLECTING_ISUPPORTS or NS_DECL_ISUPPORTS_INHERITED in its declaration. By adding to those macros the following public: \ static void* operator new (size_t size) { \ return moz_malloc_tagged(size, (char*)__PRETTY_FUNCTION__); \ } \ static void operator delete (void *p) { moz_free(p); } \ it is possible to make a class-specific override of new/delete, which passes __PRETTY_FUNCTION__ onwards, and that string contains the class name. Wiring all that together makes possible to generate stats like this (edited to make it less wide): objects/bytes: 118343/12905257 in CH, 29901/2965424 in CH-tagged [ 0] objects/bytes: 2000 / 480000 sStandardURL [ 1] objects/bytes: 4251 / 408096 CSSStyleRuleImpl [ 2] objects/bytes: 3165 / 278520 nsTextNode [ 3] objects/bytes: 2369 / 265328 XPCWrappedNative [ 4] objects/bytes: 876 / 147168 nsEventListenerManager [ 5] objects/bytes: 1533 / 147168 nsXULElement [ 6] objects/bytes: 2962 / 118480 AtomImpl [ 7] objects/bytes: 673 / 91528 nsHTMLAnchorElement [ 8] objects/bytes: 237 / 45504 nsLocalFile [ 9] objects/bytes: 156 / 41184 imgRequest (many lines deleted) This says: in the C++ heap there are currently 118343 blocks comprising 12905257 bytes. Of these, 29901 blocks (2965424 bytes) have a tag. Followed by a per-class breakdown for the tagged blocks, most voluminous first. The full profile has 447 lines, so at least 447 classes have been covered. What this also shows is that only about 1/4 of the blocks are identified. Since it covers (almost) all XPCOM classes, this gives a quick upper bound on what coverage we can get from XPCOM. It would be nice to expand the coverage. I am wondering if adding a single line to other allocation points would not be a bad tradeoff. It could be done incrementally.
Julian, I hate to suggest this, but I think the biggest benefit here would be to get this working under Win32 initially; there at least I know that the pdb files have all the info we need to resolve vtable pointers, and the dbghelp library makes doing that straightforward. I went off on a bit of a tangent yesterday and wrote a quick tool to walk the entire Win32 heap (win32, not jemalloc), dumping the first 4 bytes of each allocation for later vtable analysis. It returns all that to js, and then dbghelp or other tools (map files also come to mind -- we could make this completely cross platform if we standardized on one map file format) can be used via ctypes/etc. to translate to symbol names. I'll get that patch up and cc you.
(In reply to comment #37) > Julian, I hate to suggest this, but I think the biggest benefit here would be > to get this working under Win32 initially; there at least I know that the pdb > files have all the info we need to resolve vtable pointers, and the dbghelp > library makes doing that straightforward. I think this is a really bad idea. This should be a tool in *every build* of the program. Not "some developer builds, sometimes, when they happen to have set things up right". You should always be able to take a browser -- when it's up at 1gb resident after a week of browsing on a normal desktop and you don't know why -- and just ask it "why?" The problem we face here is that in "our testing", when we do it on talos or under controlled conditions, we have these lovely graphs that say we don't use much memory and we always release it and everything is lovely. And then users in the field keep posting back screenshots from week-long sessions with a few dozen tabs and a modest set of extensions saying "no, you don't, it's really out of control and my system is paging again and it's firefox's fault". Then we turn around and point at our pretty graphs and tell them they must be imaging things because it's well-behaved when we tested it. We need to close that gap, and that means gathering better *field data*.
I think we're saying the same thing -- I'm pointing out that 90% of our users are on win32, so getting that data would be a lot more useful than gathering linux data, and that there's a standardized way to obtain all the symbol data that we'd need via the symbol server, plus a simple library that we can use to query that data. All stuff that makes it possible to actually gathering actual field data :-) We could also look into teaching breakpad how to spit out text map files so that we have a standard "symbol data" format without having to write platform specific code to get it as well. But right now, as far as I know, win32 is the only platform where we have infrastructure to say "give me symbols for release/nightly build with signature abcd".
(In reply to comment #39) > I think we're saying the same thing -- I'm pointing out that 90% of our users > are on win32, so getting that data would be a lot more useful than gathering > linux data, I understand what you're saying, and I'm not disagreeing. However, I'm not trying to gather linux data. I'm trying to see how far it's possible to get without any external assistance, a la Graydon comments. I'm inclining more towards a source-oriented approach -- what can we do to get the compiler to add this info automatically? For one thing, vtable scraping is going to fail to tell us anything for straight malloc'd blocks, or for classes in which no vtable is required.
(In reply to comment #36) > What this also shows is that only about 1/4 of the blocks are > identified. Since it covers (almost) all XPCOM classes, this gives a > quick upper bound on what coverage we can get from XPCOM. 25% coverage of the heap is pretty useless. I wondered how many other allocation points would have to be hand annotated to improve that significantly. Problem is there are thousands of allocation points (constructors, mostly). On the assumption that allocation probably follows a 90/10 rule, I bootstrapped the process by hacking the profiler to record the top 5 frames of each allocation, commoned those up, and looked at the sources corresponding to the 5-frame groups that allocated the most. This tells me where to put hand annotations so as to maximise coverage of heap contents. An afternoon of doing this increased coverage to 80% of heap bytes, sometimes more, via 35 one-liner annotations, mostly like so MOZALLOC_PROF_TAG(this); in constructors, since MOZALLOC_EXPORT void moz_prof_tag(void* ptr, const char* tag); #define MOZALLOC_PROF_TAG(__ptr) \ moz_prof_tag((__ptr), __PRETTY_FUNCTION__) A sample profile is attached (see next comment), for Fx with 7 tabs open. It shows coverage of 74% of bytes and 87% of objects for a set of 177,885 live blocks that have been allocated through mozalloc. The lines marked "static void* C::operator new..." are allocations tagged automatically by the trick in Comment #36. All others are from hand-applied annotations.
Attached file heap profile as per Comment #41 (obsolete) —
I support the standalone from all side-files, works in product builds, source based approach. When in doubt, use brute force. If we have to mangle our source a bit, it could be worthwhile compared to the old manglings we put up with all over for dubious benefit. /be
Julian, how does your addition of operator new stuff work for classes that already have NS_DECL_AND_IMPL_ZEROING_OPERATOR_NEW?
This is cool, though I suspect people will immediately say "what's the stack trace for static void* nsCSSCompressedDataBlock::operator new(size_t, size_t)?" and you'll end up reinventing Massif's call tree structure. But that's ok, if it can be done within the browser so that anyone can use it that would be a big win. Minor comments about the profile output format: - Commas in big numbers (eg. 123,456) make them much easier to read; - Please mark each entry with the percentage of the total that it represents.
(In reply to comment #44) > Julian, how does your addition of operator new stuff work for classes that > already have NS_DECL_AND_IMPL_ZEROING_OPERATOR_NEW? Yes, an excellent question. Basically it doesn't. If a class overrides new and delete itself then adding a second override in NS_DECL_ISUPPORTS causes the compiler to complain. What I _really_ did was to rename NS_DECL_ISUPPORTS to NS_DECL_ISUPPORTS_NO_NEW_OVERRIDE (possibly a bad name :-) and to then do this #define NEW_OVERRIDES \ public: \ static void* operator new (size_t size) { \ return moz_malloc_tagged(size, (char*)__PRETTY_FUNCTION__); \ } \ static void operator delete (void *p) { moz_free(p); } \ #define NS_DECL_ISUPPORTS \ NS_DECL_ISUPPORTS_NO_NEW_OVERRIDE \ NEW_OVERRIDES (ditto with NS_DECL_CYCLE_COLLECTING_ISUPPORTS and NS_DECL_ISUPPORTS_INHERITED) Then, everywhere where g++ complained about duplicate overrides, changed NS_DECL_ISUPPORTS to NS_DECL_ISUPPORTS_NO_NEW_OVERRIDE (viz, back to what it was originally). From a bit of greppery, there are about 1700 uses of NS_DECL_ISUPPORTS_*, of which 23 had to be _NO_NEW_OVERIDE'd. So not a big hit. That said there was some complexity I didn't figure out yet. Even after the compiler accepted everything, in a few cases possibly to do with multiple inheritance, I was seeing uses of uninitialised values somehow to do with the automatic overriding, leading to segfaults, and I had to back off the the _NO_NEW_OVERRIDE versions for those classes. Possibly about 6 cases.
(Stuff I forgot to say in Comment #46) ... consequently, classes that are marked _NO_NEW_OVERIDE have to be hand-annotated using MOZALLOC_PROF_TAG as per Comment #41. MOZALLOC_PROF_TAG is a second way to get tag information into the profiler. It facilitates setting the tag on a block after the block has been allocated. The ideal of course is to set the tag at allocation time, a la moz_malloc_tagged in Comment #46.
(In reply to comment #42) > Created an attachment (id=449124) [details] > heap profile as per Comment #41 As a light diversion from the "what should we build?" question: in this profile, no less than 37% of the objects in the heap (66.6k out of a total of 177k objects) have to do with CSS. That seems a lot. Is that expected? HEAPPROF 16701: objects/bytes: 177885/18446274 in CH, 155018/13662436 in CH-tagged RANK OBJECTS/BYTES ALLOCATED BY 0 9131 / 1827388 static void* nsCSSCompressedDataBlock::operator new(size_t, size_t) 1 19084 / 1374048 nsCSSSelector::nsCSSSelector() 4 8525 / 818400 static void* CSSStyleRuleImpl::operator new(size_t) 9 8491 / 407568 nsCSSDeclaration::nsCSSDeclaration() 12 10306 / 247344 nsCSSSelectorList::nsCSSSelectorList() 17 6484 / 155616 nsCSSValueList::nsCSSValueList() 27 1747 / 69880 nsCSSValuePairList::nsCSSValuePairList() 29 1017 / 50568 nsCSSValue::URL::URL(nsIURI*, nsStringBuffer*, nsIURI*, nsIPrincipal*) 35 1156 / 36992 nsPseudoClassList::nsPseudoClassList(nsIAtom*, nsCSSPseudoClasses::Type) 59 354 / 16992 static void* nsDOMCSSAttributeDeclaration::operator new(size_t) 61 114 / 15504 static void* nsCSSStyleSheet::operator new(size_t) 103 57 / 4560 static void* CSSNameSpaceRuleImpl::operator new(size_t) 104 81 / 4536 static void* nsCSSRuleProcessor::operator new(size_t) 126 90 / 2880 static void* CSSImportantRule::operator new(size_t)
That seems like a large proportion, true; but you're also looking at a profile that (if I read correctly) only relates to ~18mb resident? So like .. a very small snapshot. Possibly quite un-representative of the program later on in its life. I assume this is just from a prefix of startup? Can you get a profile once it's been browsing for a while and, say, passes the 150mb resident mark? Might need to up the numbers in your static array.
> Is that expected? It could be, depending on what's being measured. You mentioned 7 tabs. What were they? Was gmail involved, say?
(In reply to comment #49) > That seems like a large proportion, true; but you're also looking at a profile > that (if I read correctly) only relates to ~18mb resident? Yes .. I was also suspicious about that. By running a profiled build on Massif, which intercepts all mallocs and frees, it's possible to see what proportion of heap allocations are routed through mozalloc and hence are intercepted. The news isn't good. For a startup with 4 tabs, intercepting mozalloc shows the heap volume peaking at around 29.4M. Massif puts that figure at a more credible 65.0M, meaning more than half the heap allocation bypasses mozalloc and so is invisible to the built-in profiler. I'm not sure what to do about this. It would be possible to force (eg) jsengine to route allocations through mozalloc, but that would create a new intermodule dependence. And there's a significant amount of heap allocated by libX11.so.6.3.0 and libxcb.so.1.1.0 which we can't reroute at the source level. One ungood kludge would be to write a shared object redefining malloc, free, etc and LD_PRELOAD it into the process. That of course only works on Linux and perhaps MacOSX. Anyway, some stats from the above run. I should add, all these figures exclude the overhead of the built-in profiler itself (about 18M), to make the comparison fair. ============ Built-in intercepts in mozalloc give a peak heap use of 29.4M. Massif says 65.0M. where did the rest go? Well, approximately: 12,737,468 0x704B744: moz_calloc (mozalloc.cpp:137) 8,484,176 0x704B81B: moz_malloc_tagged (mozalloc.cpp:105) 6,564,989 0x704B87D: moz_xmalloc (mozalloc.cpp:91) p 12737468 + 8484176 + 6564989 $1 = 27786633 So 27.8 MB was noted by Massif as going via mozalloc; that ties in at least approximately with the 29.4M figure above. For the rest: 11,023,595 in 735 places, all below massif's threshold (01.00%) PR_Malloc calls malloc directly: 4,754,242 PR_Malloc (prmem.c:467) js_malloc calls malloc directly: 3,832,065 js_NewScript (jsutil.h:193) 3,478,750 JS_ArenaAllocate (jsutil.h:193) 2,269,696 JS_DHashAllocTable (jsutil.h:193) 1,401,504 JSScope::create(JSContext*, JSObjectOps const*, JSClass*, JSObject*, unsigned int) (jsutil.h:193) 904,564 js_NewStringCopyN (jsutil.h:193) system libraries presumably call malloc directly: 3,142,400 XGetImage (in /usr/lib/libX11.so.6.3.0) 1,571,264 ??? (in /usr/lib/libxcb.so.1.1.0) extensions/spellcheck/hunspell/src/hashmgr.cpp calls malloc directly: 2,685,946 HashMgr::add_word(char const*, int, int, unsigned short*, int, char const*, bool) (hashmgr.cpp:188) ## Hmm, does this mean we're allocating 2.7MB merely to initialise ## a spell checker that didn't get used in this run? XPT_ArenaMalloc calls calloc directly: 1,126,648 XPT_ArenaMalloc (xpt_arena.c:221) sqlite3 calls malloc directly: 1,038,456 sqlite3MemMalloc (sqlite3.c:12975)
(In reply to comment #50) > > Is that expected? > > It could be, depending on what's being measured. You mentioned 7 tabs. What > were they? Was gmail involved, say? No gmail. I don't recall exactly, but I think the tabs were http://lwn.net/Articles http://www.youtube.com http://en.wikipedia.org/wiki/Main_Page and the other 4 were articles accessible from http://www.spiegel.de/international
bug 559263 has some changes to make us wrap malloc more thoroughly with mozalloc. I'm not sure if those apply on all platforms or just Android. cjones would be the one to talk to about mozalloc hijinks, in any case.
The other major one is the JS GC chunks, though that may already go through the right place on Linux. That's better now that Vlad has landed the hooks, though.
> No gmail. I don't recall exactly, but I think the tabs were OK. We'd have to take a look at their source to see whether the numbers make sense, but they're at least plausible...
(In reply to comment #51) > It would be possible to force (eg) jsengine to route > allocations through mozalloc, but that would create a new intermodule > dependence. And there's a significant amount of heap allocated by > libX11.so.6.3.0 and libxcb.so.1.1.0 which we can't reroute at the > source level. > > One ungood kludge would be to write a shared object redefining malloc, > free, etc and LD_PRELOAD it into the process. That of course only > works on Linux and perhaps MacOSX. > So, two notes --- first, mozalloc has no dependencies by design, so using it in js/src would be a relatively simple matter of getting jseng-folk consent and shuffling around sources. This would work on all platforms. Second, we already use linker tricks on linux and windows (!!) to replace system malloc with jemalloc, so we could re-use those hacks to point malloc at a version with your instrumentation. This would take a bit of work. Mac would be left out. (In reply to comment #53) > bug 559263 has some changes to make us wrap malloc more thoroughly with > mozalloc. I'm not sure if those apply on all platforms or just Android. cjones > would be the one to talk to about mozalloc hijinks, in any case. The android code uses GNU ld's --wrap, so linux only (for all intents and purposes).
(In reply to comment #56) > shuffling around sources. This would work on all platforms. Second, we > already use linker tricks on linux and windows (!!) to replace system malloc > with jemalloc, so we could re-use those hacks to point malloc at a version with > your instrumentation. I tried with --enable-wrap-malloc, but got into various build system/linkage difficulties, and gave up. Sticking with the source route for now.
(In reply to comment #51) > Built-in intercepts in mozalloc give a peak heap use of 29.4M. > Massif says 65.0M. I routed js_malloc, XPT_ArenaMalloc and the allocator in db/sqlite3 through mozalloc. This significantly improves coverage. For a startup of Fx with about 100ish tabs, Massif gives a peak C++ heap allocation size of 342.9MB, and the mozalloc interceptor sees 252.0MB of it (perhaps a bit more). That's 73.5%. Most of the rest escapes via PR_Malloc, and it might be possible to reroute a couple of its high-roller callers to mozalloc, so a figure of 80+% is possible, I'd say. That's sufficiently encouraging that I'd like to turn my collection of experimental hacks into a proper patch. However, I have one structural problem unsolved. The interceptor (which is in mozalloc) really needs a lock/unlock facility. Right now I'm kludging it by calling pthread_mutex_{lock,unlock} directly; obviously that won't work on Windows. I need to use the locking facilities in NSPR instead, but mozalloc doesn't depend on NSPR. I could I guess copy JSThinLock into mozalloc, so as to keep mozalloc standalone, but that's not ideal. Any suggestions?
jslock.{cpp,h} depend on NSPR -- a thin lock can fatten into something requiring a PRLock. If you just need a mutex, but also must stand alone, best to use raw Windows API and ifdef. Shouldn't be too bad. /be
Standalone C++ heap profiler, patch against m-c 46349:4aadb903f941 (Mon Jun 28 09:46:49 2010 -0700). Provides who-allocated-what-in-the-C++-heap analysis without external assistance. As per comment #51, covers about 80% of the C++ heap volume. Of that 80%, about 95% of the volume is tagged by allocation point, using the schemes in comment #41 and comment #36. In a quick experiment with 6 tabs, it is tracking about 42MB of blocks, whereas "about:memory" lists about 12.5MB at that point. Linux only for the time being. Working on extending it to MacOS and Windows. Should be jemalloc-agnostic, although that is untested. HOW TO USE: Profiling is enabled/disabled at Fx startup, by default disabled. To enable, start with env var MOZALLOC_HEAPPROF set to anything. When running, it monitors mallocs/frees (etc). Every 20000 such calls, it prints a summary line showing the current total volumes: 1456357/1183643/945M total allocs/frees/totalloc, 272714/40802k live blocks/bytes Periodically the (tracked part of the) heap is sampled and a census printed, one line per allocation point ("tag"), showing the number of blocks and bytes billed to each tag. A census is performed when the maximum volume has expanded by more than 4% (over 50MB) or by 2MB (under 50MB). This means that you are guaranteed to have a census within 4% of peak residency, and it will be the most recently displayed one. There is some space overhead (probably less than 10% of the allocated heap) when profiling is enabled, and some extra processing per allocation. Browser is still usable though. The extra space overhead is not shown in the generated profiles. When disabled, the overheads should be very close to zero (1 extra conditional branch per malloc and per free).
Attachment #445563 - Attachment is obsolete: true
For a startup/quit of the browser, with 6 tabs.
Attachment #449124 - Attachment is obsolete: true
Version that works on both Linux and MacOSX, at least for 10.5.8, 32-bit build with gcc-4.2. Also, rebase to m-c 46969:d9253109fd88 (Thu Jul 01 14:40:49 2010 +0200).
Attachment #454639 - Attachment is obsolete: true
Now with win32 support (at least, works on WinXP, MSVC9, debug and release builds). Patch relative to m-c 46969:d9253109fd88
Attachment #455470 - Attachment is obsolete: true
This now works on the biggest 3 targets. I'd like to move it towards land-ability. At this point, some level of feedback/review seems to me necessary. Patch provides functionality as described in comment #60. However, there are various kinds of rough edges: (from the user's aspect) * how should profiling be enabled? atm env var MOZALLOC_HEAPPROF is tested at startup; if set, profiling is enabled. This doesn't strike me as good. I thought about using a pref, but really we want the profiler to be enabled or disabled right from process start. * where should output go? currently it's sent to stderr (sending it to stdout breaks jsengine building). (from the Fx code base aspects) * Patch routes much more malloc/free traffic through mozalloc than previously. This means these modules acquire a new dependency on mozalloc: * js * db/sqlite3 I'm sure this totals standalone builds of js (haven't tried). Even with those routed through mozalloc, we only capture about 80% of heap allocation, so potentially more modules should be routed through mozalloc. [IMO, it no longer makes sense to have mozalloc separate from NSPR; they should be combined.] * When not profiling, there is an extra overhead of 1 memory reference (read) and 1 conditional branch per alloc and free. IMO this is insignificant in that (according to my own measurements) the behaviour of Fx in big runs is on the order of 1 malloc and 1 free per 50000 instructions. * As per 3rd para in comment #60, supposing we wanted to add more allocation tags to cover the 5% of the heap that is currently tagged as UNKNOWN (we know an allocation is happening, but we don't know is doing it.) Finding the point in the sources to annotated by hand requires some auxiliary mechanism to get a stack trace. I have been using a hacked valgrind. Given that the profiler is supposed to be standalone, I can't see how the dependence on some external tool to "bootstrap" it can be avoided. Just mentioning for completeness.
(from the Fx code base aspects in comment #64) += * the scheme described in comment #36 provides automatic tagging of allocations made by all nsCOMPtr classes. It does this by stuffing a per-class override of new and delete into the NS_DECL_*ISUPPORTS macros. Since there are a couple thousand of such classes, there will be some cost in terms of object code size. I'll measure.
* shows profiling results in "about:heap" * rebased to m-c 48706:a4d86c3a3494 (Mon Aug 02 01:28:10 2010 -0700) WIP. Not re-tested on OSX or Win. Appears to have stability problems (occasional heap block overwrites) that weren't there before.
Attachment #464408 - Attachment is obsolete: true
Not pretty, but possibly useful. Output is in two parts: * the most recent census (detailed breakdown of contents). Since a census is only made when the heap high-water mark rises, this is always a picture of the heap at or close to the max for the run so far. * periodic one-line summaries, to give some feel for changes over time.
Revised version of comment 67 patch, with about 180 allocation sites in js/src/*.{cpp,h} (that is, calls to js_malloc/_calloc/_realloc) hand-annotated with allocation tags. This gives much better resolution into jsengine's allocations. Prior to this change the profiler would say (eg) that lots of stuff was allocated by js_realloc, which isn't helpful -- see screenshot in comment #68. Now it can say who is making those calls. Still requires refinement to handle allocated-on-behalf-of scenarios. Eg, saying that 4.5% of the heap is allocated by PL_DHashAllocTable is pretty useless. Am thinking about tagging each block with 2 strings rather than one -- primary and secondary tags. So then it would be able to say, effectively: "X% of the heap was allocated by PL_DHashAllocTable, on behalf of nsFoo::bar" (where nsFoo::bar is using PL_DHashAllocTable to create/maintain a hash table). Still has stability problems. I believe this is the same problem as documented in bug 587610, and is unrelated to the profiler per se.
Attachment #464410 - Attachment is obsolete: true
jseward: given the presence of about:memory and DMD, I think we can safely close this bug. Please reopen if you disagree.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: