615199 - (JaegerShrink) Methodjit enabled causes the browser to use almost twice as much memory

Are the data in comment 0 solid? I thought I saw a comment in bug 598466 saying there might have been some measurement error there. It seemed like maybe the difference was smaller given in comment 0--maybe more on the order of bug 611400 effects.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 4

•

15 years ago

> Are the data in comment 0 solid? Reasonably. The measurement error is no more than +-20MB or so on the overall number , from what I've seen. So the increase might be only 340MB, not 380MB... or might be 420MB. The second set of numbers in bug 598466 is a 380MB regression over only 70 tabs, so somewhat bigger per-tab than the numbers in comment 0: 2010-09-11: 530 / 554 / 698 2010-09-12: 910 / 928 / 1080

Boris Zbarsky [:bzbarsky]

Reporter

Comment 5

•

15 years ago

> Does the patch at bug 611400 comment 6 help? Measuring now.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 6

•

15 years ago

OK, so I see these numbers (64-bit mac, not 32-bit windows like comment 0, fwiw), using 70 tabs; the two numbers are "real mem" / "private mem" from Activity Monitor: Vanilla: 1270MB / 1000MB mjit off: 898MB / 603MB patched mjit: 1230MB / 926MB So the patch gets us 20% of the way there. ;)

Ed Morley [:emorley]

Comment 7

•

15 years ago

Some additional info from bug 598466 comment 94... Using: - The STR from bug 598466 comment 87 - layers.accelerate-none = true - layers.accelerate-all = false - image.mem.discardable = false - image.mem.decodeondraw = false - javascript.options.methodjit.content = [as below] - javascript.options.methodjit.chrome = false - Several runs of each build to verify results Nightlies from Tracemonkey: (figures in MB; private/working/virtual) 2010-09-12: Methodjit=true 912/929/1084 ; Methodjit=false 635/656/809 2010-09-10: Methodjit=true 937/956/1115 ; Methodjit=false 642/665/825 2010-09-04: Methodjit=true 868/889/1050 ; Methodjit=false 635/659/824 2010-09-02: Methodjit=true 866/888/1047 ; Methodjit=false 641/665/832 2010-09-01: Methodjit=true 925/947/1109 ; Methodjit=false 642/666/824 2010-08-31: 651 / 673 / 837 [methodjit pref didn't exist then] 2010-08-30: 645 / 668 / 834 [ditto] Last good nightly: 2010-08-31 First bad nightly: 2010-09-01 Pushlog: http://hg.mozilla.org/tracemonkey/pushloghtml?fromchange=e8ee411dca70&tochange=e2e1ea2a39ce However, there are 1000+ Jaegermonkey changesets in that pushlog (yey for project branches), so not hugely helpful. If you want to get some tryserver builds going for win32, I'll be happy to give them a go.

Ed Morley [:emorley]

Comment 8

•

15 years ago

Sorry for the bugspam, should have said, the above results were using 68 tabs.

David Mandelin [:dmandelin]

Updated

•

15 years ago

blocking2.0: ? → beta9+

Boris Zbarsky [:bzbarsky]

Reporter

Comment 9

•

15 years ago

I think at this point we may be better served by doing a malloc trace here. I'll try to do one tomorrow, if nothing goes wrong.

Nicholas Nethercote [inactive]

Comment 10

•

15 years ago

I did some measurements with Massif. Annoyingly, Massif wasn't behaving very well but I did discover that ExecutablePool::create() is called an awful lot, though. On my Linux64 box, in a session with 40 cad-comic.com tabs open it is called 9,247 times, and the sum of all the 'n' arguments (which I assume are bytes) is 200,381,190. That's an average of 21,669 per call. This is a fragment of the stack trace that appears to be responsible: JSC::ExecutablePool::systemAlloc(unsigned long) (ExecutableAllocatorPosix.cpp:43) JSC::ExecutablePool::create(unsigned long) (ExecutableAllocator.h:374) js::mjit::Compiler::finishThisUp(js::mjit::JITScript**) (ExecutableAllocator.h:235) js::mjit::Compiler::performCompilation(js::mjit::JITScript**) (Compiler.cpp:208) js::mjit::Compiler::compile() (Compiler.cpp:134) js::mjit::TryCompile(JSContext*, JSStackFrame*) (Compiler.cpp:245) js::mjit::stubs::UncachedCallHelper(js::VMFrame&, unsigned int, js::mjit::stubs::UncachedCallResult*) (InvokeHelpers.cpp:387) js::mjit::ic::Call(js::VMFrame&, js::mjit::ic::CallICInfo*) (MonoIC.cpp:831)

Nicholas Nethercote [inactive]

Comment 11

•

15 years ago

Note that due to inlining there may be some functions elided in that stack trace.

Nicholas Nethercote [inactive]

Comment 12

•

15 years ago

(In reply to comment #10) > On my Linux64 box, in a session with 40 cad-comic.com > tabs open it is called 9,247 times, and the sum of all the 'n' arguments > (which I assume are bytes) is 200,381,190. That's pretty close to 5MB per tab, BTW, matching comment 0.

Brian Hackett [Laid off!]

Comment 13

•

15 years ago

(In reply to comment #10) > I did some measurements with Massif. Annoyingly, Massif wasn't behaving > very well but I did discover that ExecutablePool::create() is called an > awful lot, though. On my Linux64 box, in a session with 40 cad-comic.com > tabs open it is called 9,247 times, and the sum of all the 'n' arguments > (which I assume are bytes) is 200,381,190. That's an average of 21,669 per > call. > > This is a fragment of the stack trace that appears to be responsible: > > JSC::ExecutablePool::systemAlloc(unsigned long) > (ExecutableAllocatorPosix.cpp:43) > JSC::ExecutablePool::create(unsigned long) (ExecutableAllocator.h:374) > js::mjit::Compiler::finishThisUp(js::mjit::JITScript**) > (ExecutableAllocator.h:235) ... ExecutablePool::create is used to allocate code memory for JM. The stack trace above is the path used when allocating the JIT code for an entire script (as opposed to allocations for PIC stubs). No easy fix that I know of, but thoughts: 1. Finer grained information would be harder to collect, but tremendously valuable. How much code is for eval/global vs. function scripts, how much is inline vs. OOL code, frequency of different ops and aggregate size of inline and OOL code generated for each op. 2. Bug 577359 should help if there are lots of big initializers in global/eval scripts. 3. Maybe investigate whether to only compile loops in global/eval scripts. 4. Maybe investigate whether to only compile functions after they get hot (hard to do without impacting benchmark perf). 5. At least in benchmarks, from 1/2 to 2/3 of code memory is for OOL stub code, which hardly ever executes. 6. Could reduce point 5 by coalescing side exits in common ops (point 1) to reduce the amount of sync code. 7. Could reduce point 5 with more PICs targeted at common ops (point 1), e.g. arithmetic. When the types of 'y' and 'z' are unknown, 'x = y + z' uses 48 bytes of inline code memory, and 203 bytes of OOL code memory.

Brian Hackett [Laid off!]

Comment 14

•

15 years ago

Numbers in point 7 above are for OSX x86. For OSX x64 I get 81 bytes inline, 289 bytes OOL.

The 8472

Comment 15

•

15 years ago

(In reply to comment #13) > 4. Maybe investigate whether to only compile functions after they get hot (hard > to do without impacting benchmark perf). To take a page out of the Java VM JIT compiler book: Avoiding to compile everything at startup actually improves the startup time because compilation itself consumes time. If you start up in interpreted mode you can already execute unoptimized code while you still compile on another thread, even making dynamic optimizations based on the profiling results from the execution thread. This also allows you to perform optimistic optimizations such as eliminating branches that are not visited according to the profiler. If you hit such a branch one can back out into interpreted mode and recompile. Same goes for inlining potentially virtual calls, if they get overloaded one can fall back to interpreted mode and recompile in the meantime. Just performing static compilation at startup wastes a lot of optimization potential.

Brian Hackett [Laid off!]

Comment 16

•

15 years ago

(In reply to comment #15) > (In reply to comment #13) > > 4. Maybe investigate whether to only compile functions after they get hot (hard > > to do without impacting benchmark perf). > To take a page out of the Java VM JIT compiler book: Avoiding to compile > everything at startup actually improves the startup time because compilation > itself consumes time. If you start up in interpreted mode you can already > execute unoptimized code while you still compile on another thread, even making > dynamic optimizations based on the profiling results from the execution thread. > This also allows you to perform optimistic optimizations such as eliminating > branches that are not visited according to the profiler. If you hit such a > branch one can back out into interpreted mode and recompile. Same goes for > inlining potentially virtual calls, if they get overloaded one can fall back to > interpreted mode and recompile in the meantime. > > Just performing static compilation at startup wastes a lot of optimization > potential. Yes, definitely. The SM interpreter is slow compared to a JIT but not *that* slow, and done right partial interpretation should be a wash or net speedup in benchmarks (which don't resemble actual web JS all that much). Javascript JIT compilation is different from Java in that there is very little information known statically --- the main reason ADD is expensive in memory is that we need to account for any combination of ints, floats, and other data being added. Type inference (bug 557407) helps greatly here, and can figure out what is being added and reduce code memory. Inference incurs its own memory overhead though in storing intermediate structures, and mjit+inference will most likely use more memory than mjit alone. Again, partial interpretation helps here, and also helps inference precision as it can't figure everything out statically.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 17

•

15 years ago

The 8472, compiling on a different thread would be nice, but not happening for 2.0. And the problem with compiling lazily is that it _is_ likely to hurt benchmark times; a lot of these benchmarks run fast enough that just the context switch overhead of the separate thread would hurt. And yes, they're crappy benchmarks. :(

The 8472

Comment 18

•

15 years ago

(In reply to comment #17) > compiling on a different thread would be nice > but not happening for 2.0. Ok, but even if compiling happens on the same thread lazy compilation can be of advantage in extremely short-running scripts where the compilation overhead would outweigh the performance gain. Think of initializer code that only runs once, compiling it would only cause dead weight. Of course i'm basing my argument on knowledge about the hotspot VM, i have no idea how large the speed difference between javascript interpreted and JIT mode is in comparison. > a lot of these benchmarks run fast enough that just the > context switch overhead of the separate thread would hurt. No context switching should be required on a multi-core system. Firefox is under-utilizing those. And passing data from one thread to another can be done without context switches too by using atomics. > And yes, they're crappy benchmarks. :( Then the question is if we want to optimize for real-world performance or for crappy benchmarks.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 19

•

15 years ago

There are no good measures of JS real-world performance; makes it hard to optimize for.

Brian Hackett [Laid off!]

Comment 20

•

15 years ago

I measured the compilation time and interpreter execution time for this function (release build, added PRMJ_Now before/after compilation): function run(x, y) { var a, b, c; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; } Running this function 10000 times in the interpreter takes 27ms, compiling it takes 350us. So interpreting this straight-line code is ~130 times faster than compiling it. Compilation time on SS is IIRC between 10-20ms, so the cost of interpreting code a few times before compiling should be puny in comparison to the total benchmark time. Switching between the interpreter and mjit is quick so shouldn't affect times either. I get similar numbers if I make these GNAME accesses (interpreter time doubles, but so does compilation time).

nivtwig

Comment 21

•

15 years ago

In the 80 tabs testcase and generally if you browse multiple tabs in a specific website, most of the script URLs will be the same for all the tabs. So why is 5MB needed per tab, is it possible to share the compiled code between the tabs?

Boris Zbarsky [:bzbarsky]

Reporter

Comment 22

•

15 years ago

The compiled code is typically specialized to the specific global object (and state of said global object) that it's compiled for.

Julian Seward [:jseward]

Comment 23

•

15 years ago

w.r.t the extra space use, I suspect we won't find any single culprit. (In reply to comment #6) > So the patch gets us 20% of the way there. ;) I profiled w/ DHAT an x86_64-linux build of (M-C + said patch) opening 20 of the tabs (DHAT is slow). This shows a max C++ heap size of 184MB. Of this, the top allocation stack now results from property table allocations (js::PropertyTable::init calling calloc), accounting for 22.7 out of the 184MB. This was filed as bug 610070. We might be able to save some space here since the average usage of these blocks is only 44% -- more than half the bytes in them are never accessed. I also see nearly 4MB of essentially useless allocation in the style of bug 609905 (4MB held live for the entire process, actual usage of these blocks is below 5%).

The 8472

Comment 24

•

15 years ago

(In reply to comment #20) > Running this function 10000 times in the interpreter takes 27ms, compiling it > takes 350us. So interpreting this straight-line code is ~130 times faster than > compiling it. Compilation time on SS is IIRC between 10-20ms, so the cost of > interpreting code a few times before compiling should be puny in comparison to > the total benchmark time. Switching between the interpreter and mjit is quick > so shouldn't affect times either. Did you run the compiling in a loop too? I would think that 350µs is very low and probably noisy. But if your figures are right then waiting 10-20 method calls before compiling should not only reduce memory usage for initializers but also speed up short-running code segments. (In reply to comment #23) > w.r.t the extra space use, I suspect we won't find any single culprit. How about an alternative solution then? Instead of trying to optimize memory usage for live, active JIT stuff we could just discard all inactive JIT stuff in background tabs. Similar to image discarding or garbage collection. If for example the minimum age of JIT objects is set to 10 seconds then even erroneous discarding would have a minimal impact on performance. And as long as the scripts are running no discarding would happen anyway.

David Mandelin [:dmandelin]

Comment 25

•

15 years ago

Various responses and ideas, in descending order of guessed reward/risk ratio: 1. It looks like Julian has already found a good fraction of the extra space as coming from: - big nmaps - PropertyTable allocations - useless allocations Those are straight-out bugs and seem like the most effective place to start work. 2. Bug 577359 could be big. We should get measurements in the context of this problem. I.e., how much code is being generated for constant initializers. Shouldn't be too hard to instrument the compiler to collect that. Or else just get that bugfix landed and measure what it does. X. Background on what the jitcode allocator does, needed for understanding the next two points: - if N >= 64k, call VirtualAlloc, which effectively allocates N rounded up to the nearest multiple of 64k. - if N < 64k, try to take space inside the current 64k "small allocation pool". If there is enough space left, bump-allocate inside there. Otherwise, first allocate a new pool. Then, - if the new pool will have more space than the previous small pool, then allocate from the new pool and make it the new small pool. - otherwise, allocate inside the new pool but leave the old small pool. Thus, big scripts (in jitcode size) get their own pool, while small scripts (and generated PIC stubs) get grouped together to fill a 64k pool. The pools are refcounted--the script holds the ref. 3. So, one potential big problem with the allocator is fragmentation. If we have lots of 65k allocations, then we waste about half our space by allocating 128k each time with VirtualAlloc. If we have lots of 33k allocations, then we waste about half our space by allocating a new 64k small pool each time. A retune of the allocation policy might really help. It shouldn't be too hard to measure fragmentation with some manual instrumentation on the pool allocator. The main summary statistics would be (bytes of jitcode in current allocations) and (bytes currently allocated for jitcode). Distribution stats might be helpful too. 4. Refcounting might be making us hold on to pools for too long. For example, say script S is 33k, and it ends up sharing a pool with a bunch of PICs from other scripts. In that case, one of the PICs could be 20 bytes, but keep the whole 64k chunk live for a long time. This shouldn't be that much harder to measure, and in fact seems like a form of fragmentation. The idea would be to distinguish (bytes of jitcode in current allocations) and (bytes of jitcode for non-destroyed scripts in current allocations). 5. A simple idea for controlling jitcode memory usage is to throw away jitcode if memory pressure gets high. It can just be recompiled if it gets run again. This is a very simple technique that fits in perfectly with existing facilities and needs no new execution mode combos or anything like that. (The only trick is to avoid throwing away memory for running scripts.) And it seems not too hard to make it sensitive to memory pressure or memory usage. 6. Delayed compilation of scripts might be worth a try, but it has some difficulties. bhackett's measurements show that compilation is worthwhile only if the code runs at least 100 iterations. But it may be hard to take advantage of this: - it would be really nice and simple to compile on the Nth entry to a function, but that is bad if it contains a loop that will run many iterations--in that case we should compile right away. - we could try to compile after N runs of a loop, but then we need to compile while we are running the script, and then jump into the compiled code in the middle. That doesn't sound *too* hard, but the tracer integration code had to do things like that and it was not easy to get right. dvander probably has more insight on this issue. - and then there are the inevitable tuning problems. They might not be so bad with simple schemes like compiling after 10 iterations, though. Compiling global or eval scripts only if they contain a loop (similar to what bhackett suggested) seems like an easier starting point. But even there there are some issues: some benchmarks eval the same script many times, in which case compilation may be worthwhile even if the script doesn't contain a loop. Maybe compiling loop-containing scripts right away and others on the Nth iteration (for N ~= 10) would be really easy and possibly helpful? 7. Delayed compilation of OOL paths should definitely be in our long-term plans, but I don't know if we can reasonably get that right for Fx4. (dvander?) To me, the easiest way to do that seems to be to not generate them until they are actually called, IC-style. We know that's among the hardest code to get right, though, so it seems risky.

Julian Seward [:jseward]

Comment 26

•

15 years ago

(In reply to comment #10) > I did some measurements with Massif. [...] Yeah, I see the same thing. (--tool=massif --pages-as-heap=yes). I had no problems w/ massif, btw. For 20 tabs I'm seeing 79.8MB held live by JSC::ExecutablePool::systemAlloc, which is just about 4MB per tab. I'm surprised the mjit manages to generate 4-5MB of executable code per tab (if that's the correct diagnosis). I wonder if there's some overallocation of space going on.

David Mandelin [:dmandelin]

Comment 27

•

15 years ago

(In reply to comment #26) > (In reply to comment #10) > > I did some measurements with Massif. [...] > > Yeah, I see the same thing. (--tool=massif --pages-as-heap=yes). I > had no problems w/ massif, btw. For 20 tabs I'm seeing 79.8MB held > live by JSC::ExecutablePool::systemAlloc, which is just about 4MB per > tab. > > I'm surprised the mjit manages to generate 4-5MB of executable code > per tab (if that's the correct diagnosis). I wonder if there's some > overallocation of space going on. That seems to give hope that it's due to fragmentation, or better yet, just some simple bug in the allocator. Maybe we're allocating a whole 64k chunk for each PIC entry or something dumb like that. (Btw, those chunks are 16k on Linux/Mac for those testing there.) Another thing I forgot to mention previously is that detailed stats on the allocations would be nice: (1) distributions, so we know whether we are doing a zillion small allocations or a few huge ones, and (2) where they come from: PICs vs. scripts etc.

Luke Wagner [:luke]

Comment 28

•

15 years ago

Another patch that may be worth measuring is billm's in bug 547327: it decreases JSObject::SLOT_CAPACITY_MIN from 8 to 2 (relying instead on learning object size). In the best case, this could be saving 6 * sizeof(Value) == 48 bytes per JSObject.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 29

•

15 years ago

(In reply to comment #25) > Various responses and ideas, in descending order of guessed reward/risk ratio: This ordering seems right. We should look at all the easy places first, especially the allocator where its real-world behavior seems to be basically unknown. It's also fairly easy to replace, and worst case, our JIT code is almost already relocatable. We could compact it, or as you said, just throw it out if it's not live (we must be mindful of call ICs - maybe those should refcount). Side exits are a problem. Sometimes there are multiple per op, like the tracer, but we also sink stores. This was a conscious decision (knowing the code bloat) so we could allocate registers in ops incrementally. Post 4.0 as we experiment with new register allocation techniques and type inference, I suspect this will change, and we can just reduce side exits and not bother with the complicated coalescing problem. We could also try to reduce the size of exits, for example, if we know the frame pointer has been sunk to VMFrame early on in a script, we never need to sink it again. On point #6, I'd worry about tuning the most. I agree that the technical problems aren't too hard. The parser can tell us if there's a loop, if it comes to it we can get statistics on how much memory we'd save delaying compilation further on loopless scripts. On point #7, yeah, I've wanted to IC the ADD path ever since it became the horrifying monstrosity it is. It sounds risky for 4.0. No IC lands without bugs. On the other hand, it would not take long to implement, and it would likely be a perf win. Before tackling that though we should measure how much memory we'd really save.

The 8472

Comment 30

•

15 years ago

(In reply to comment #26) > I'm surprised the mjit manages to generate 4-5MB of executable code > per tab (if that's the correct diagnosis). I wonder if there's some > overallocation of space going on. have a look at bug 598466 comment 94, additionally to the 4-5MB per tab that you can save with switching methodjit off there are another 75MB (~1MB per tab) introduced shortly before methodjit was turned on in the tracemonkey branch. This might be some related (management?) code that also needs cleanup and can't be tested by turning methodjit on/off, the overhead is always there.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 31

•

15 years ago

Please don't drag that into this bug. We should file a separate bug on that issue.

David Mandelin [:dmandelin]

Updated

•

15 years ago

Depends on: 577359

David Mandelin [:dmandelin]

Updated

•

15 years ago

Depends on: 611400

Julian Seward [:jseward]

Comment 32

•

15 years ago

(In reply to comment #27) > That seems to give hope that it's due to fragmentation, or better > yet, just some simple bug in the allocator. Yeah, I'm peering at ExecutablePool* poolForSize(size_t n) and the logic looks a bit funny: // If the new allocator will result in more free space than in // the current small allocator, then we will use it instead if ((pool->available() - n) > m_smallAllocationPool->available()) { m_smallAllocationPool->release(); m_smallAllocationPool = pool; pool->addRef(); } Seems like this abandons the current small allocation pool regardless of how much space is left in it, whenever satisfying the allocation from a new pool would result in more free space. Not sure tho. Will dig more.

Nicholas Nethercote [inactive]

Comment 33

•

15 years ago

This bug is heavy on speculation and light on data. We need more measurements. In particular, I'd like to know if cad-comic.com is typical, or if it's doing something unusual. I'll start doing more measurements, but I hope others will do likewise, as my JM knowledge is scant.

The 8472

Comment 34

•

15 years ago

(In reply to comment #33) > In particular, I'd like to know if cad-comic.com is typical, or if it's doing > something unusual. I selected cad purely for two properties: a) the random button allowing me to quickly spawn a bunch of different pages b) the fact that it contains somewhat large images I.e. it was not picked for being an especially bad case. I'm also getting similar savings per tab when disabling mjit on my real browsing session, which contains tabs from many different domains.

nivtwig

Comment 35

•

15 years ago

(In reply to comment #33) > In particular, I'd like to know if cad-comic.com is typical, or if it's doing > something unusual. You may want to try the reduced 3 tab testcase from bug 598466 comment 15 , which is from a totally different website. (measurements for it are in comment 28 of the same bug).

Julian Seward [:jseward]

Comment 36

•

15 years ago

(In reply to comment #32) > (In reply to comment #27) > > That seems to give hope that it's due to fragmentation, or better > > yet, just some simple bug in the allocator. [...] > > Yeah, I'm peering at ExecutablePool* poolForSize(size_t n) and the > logic looks a bit funny: I'm getting the impression that there is no (serious) fragmentation problem in the executable allocator. From adding manual instrumentation, in the 20 tab case, mjit::Compiler::finishThisUp requests 74.9MB from execPool->alloc(totalSize). This turns into a total request of 81.0MB in poolForSize (not sure where the 6.1MB increase comes from). That in turn results in a total 88.4MB request to ExecutablePool::create. So it could do a bit better (88.4MB resulting from 81.0MB of requests) but it's not fundamentally the cause of the large amount of allocation. > Seems like this abandons the current small allocation pool regardless > of how much space is left in it, I also measured that. The amount of space left in abandoned pools is 5.40MB. Not great, but not a disaster either. I don't think we can do better unless ExecutableAllocator is modified so as to keep track of multiple small allocation pools, rather than just one. -------- From this (and watching the numbers when opening new c.a.d tabs) it does appear that the mjit generates ~4MB of code per tab. A good question seems to be: why? Looking at the page source for a tab, it looks pretty harmless, although presumably it drags lots of .js in from elsewhere.

Julian Seward [:jseward]

Comment 37

•

15 years ago

(In reply to comment #36) > Looking at the page source for a > tab, it looks pretty harmless, although presumably it drags lots of > .js in from elsewhere. I wget'd all the scripts I could find from one of the tabs (http://www.cad-comic.com/cad/20050715.htm). They don't amount to a lot of source code: 5205 2010-10-21 01:58 widgets.js 43130 2010-11-05 00:53 1289620584.js 28975 2010-11-18 12:00 widget.php?v=10 2391 2010-12-01 23:56 spcjs.php?id=12&target=_blank Is there a way to get debug spew from mjit when it's embedded in a browser, a la JMFLAGS= ? I'd be instructive to see what's going through the compilation pipeline each time a new cad-comic.com tab is opened.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 38

•

15 years ago

(In reply to comment #37) > Is there a way to get debug spew from mjit when it's embedded in a > browser, a la JMFLAGS= ? I'd be instructive to see what's going > through the compilation pipeline each time a new cad-comic.com > tab is opened. I think JMFLAGS should work on browser debug builds, although there will be a lot of spew. But I think it prints the filename of the script, so it should still be helpful if you redirect to a file.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 39

•

15 years ago

Julian, comment 0 has a list of script urls that are possibly worth looking at. In addition to the ones you tried, there's the 100KB facebook thing, the 20KB Google API thing, 80KB of minified jquery code, 23KB of Google analytics. You should be able to use JMFLAGS in browser. Just start the browser from a shell with that env var set.

Mike Shaver (:shaver emeritus)

Comment 40

•

15 years ago

For ease of debugging , i recommend preffing the jit off , starting with JMFLAGS set ,then preffing the jit on and reloading .

Nicholas Nethercote [inactive]

Comment 41

•

15 years ago

If I change this line: # define JIT_ALLOCATOR_LARGE_ALLOC_SIZE (ExecutableAllocator::pageSize * 4) to use 1 as the multiple instead of 4 I get a ~5% reduction in create() call totals for the 40 tab cad-comic.com case on Linux64. If I increase it to 16 (as it is on Windows) I get a ~4% increase. So there's some fragmentation there.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Updated

•

15 years ago

Assignee: general → dvander

Nicholas Nethercote [inactive]

Comment 42

•

15 years ago

Attached file create() and poolForSize() histograms — Details

To follow-up comment 25, attached are some stats for the 40-tab/cad-comic/Linux64 case: histograms showing the sizes passed to create() and poolForSize(). The short version: - For create(), 75% of the calls have the minimum size (16KB). On Windows, where the minimum size is 64KB, I estimate 97% would have the minimum size. - For poolForSize(), something like 80%+ of the sizes are small, eg. < 200 bytes. Most of the rest are a few thousand. The biggest is 134268.

Nicholas Nethercote [inactive]

Comment 43

•

15 years ago

Attached file scripts histogram — Details

Same experimental setup as comment 42. This shows a histogram of the URLs of the scripts compiled, as reported by the "compiling script" line produced with JMFLAGS=scripts, with the "line" and "length" part removed. Basically, this agrees with what bz said in comment 0. There's lots of standard stuff there: jquery, facebook, twitter, google-analytics, etc.

Nicholas Nethercote [inactive]

Comment 44

•

15 years ago

For the same case, 60% of the code is stubs (stubcc.size()), 40% is the rest (masm.size()).

Nicholas Nethercote [inactive]

Comment 45

•

15 years ago

The |cx->calloc(totalBytes)| call in finishThisUp() is responsible for a lot of memory use as well. I just did a not-quite-40-tabs Massif run in which create() was responsible for 108MB and the cx->calloc() was responsible for 67MB. (Massif works much better with Firefox if you use --smc-check=all; of all people I should have remembered this.) Presumably fragmentation is less of an issue there(?), which means it's just a lot of data. Here's a breakdown of the different components of the calloc() in a 40-tab run: sizeof(JITScript); 5,214,888 sizeof(void *) * script->length; 34,990,208 #if defined JS_MONOIC sizeof(ic::MICInfo) * mics.length(); 2,532,624 sizeof(ic::CallICInfo) * callICs.length(); 10,332,480 sizeof(ic::EqualityICInfo) * equalityICs.length(); 339,944 sizeof(ic::TraceICInfo) * traceICs.length(); 333,408 #endif #if defined JS_POLYIC sizeof(ic::PICInfo) * pics.length(); 37,873,416 sizeof(ic::GetElementIC) * getElemICs.length(); 3,216,416 sizeof(ic::SetElementIC) * setElemICs.length(); 716,616 #endif sizeof(CallSite) * callSites.length(); 1,332,024

Luke Wagner [:luke]

Comment 46

•

15 years ago

On an OSX64 build, sizeof(PICInfo) is 136 while sizeof(BasePolyIC::ExecPoolVector) is 32. Similarly sizeof(CallICInfo) is 96 while sizeof(CallICInfo::pools) is 24. Extrapolating, this is 8.9MB + 2.5MB = 11.4MB (or 11%) of the calloc() breakdown reported in comment 45. IIUC, these fields are only used to release ExecutablePools when the whole JITScript is released. Thus, it seems like these fields could be removed and the corresponding ExecutablePool*'s stored in the ics' JITScript's execPools.

Julian Seward [:jseward]

Comment 47

•

15 years ago

(In reply to comment #45) > sizeof(void *) * script->length; 34,990,208 611400 should improve that significantly.

Julian Seward [:jseward]

Comment 48

•

15 years ago

> Here's a breakdown of the different components of the calloc() in a > 40-tab run: That's interesting. The effect of the 611400 fix is shown below (20 tabs). With that in place, the PICInfo is by far the largest remaining component, so we should next see if we can get rid of them as per Luke's comment 46. We're still skirting around the central issue of why (or indeed, does?) the jit create so much code, but I guess we'll get to that. Pre 611400 2327400 sizeof(JITScript) 15604256 sizeof(void *) * script->length 1130976 sizeof(ic::MICInfo) * mics.length() 4619040 sizeof(ic::CallICInfo) * callICs.length() 185504 sizeof(ic::EqualityICInfo) * equalityICs.length() 148944 sizeof(ic::TraceICInfo) * traceICs.length() 16858832 sizeof(ic::PICInfo) * pics.length() 1432368 sizeof(ic::GetElementIC) * getElemICs.length() 315936 sizeof(ic::SetElementIC) * setElemICs.length() 595368 sizeof(CallSite) * callSites.length() Post 611400 2415392 sizeof(JITScript) 742240 sizeof(NativeMapEntry) * nNmapLive 1131312 sizeof(ic::MICInfo) * mics.length() 4621536 sizeof(ic::CallICInfo) * callICs.length() 185504 sizeof(ic::EqualityICInfo) * equalityICs.length() 149040 sizeof(ic::TraceICInfo) * traceICs.length() 16872976 sizeof(ic::PICInfo) * pics.length() 1432928 sizeof(ic::GetElementIC) * getElemICs.length() 316008 sizeof(ic::SetElementIC) * setElemICs.length() 595680 sizeof(CallSite) * callSites.length()

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 49

•

15 years ago

(In reply to comment #46) > Thus, it seems like these fields could be removed and > the corresponding ExecutablePool*'s stored in the ics' JITScript's execPools. Then we'd have to flush all ICs on GC. Maybe not a big deal, but we can thread the execPool pointers in the IC executable code instead. Most PICInfo structs are wasting that space since they're not even polymorphic.

Luke Wagner [:luke]

Comment 50

•

15 years ago

(In reply to comment #49) Still, the releasePools calls are done for all ics in a given script, so if the JITScript held a union of all the ics' pool vectors/arrays (in a new per-JITScript vector), we could achieve the same effect.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 51

•

15 years ago

Good idea! Patch soon.

Julian Seward [:jseward]

Comment 52

•

15 years ago

Attached file JMFLAGS=scripts output created by loading one extra cad tab — Details

Julian Seward [:jseward]

Comment 53

•

15 years ago

Attached file JMFLAGS=scripts,jsops output created by loading one extra cad tab — Details

Julian Seward [:jseward]

Comment 54

•

15 years ago

(In reply to comment #52, comment #53) > JMFLAGS=scripts{,jsops} output created by loading one extra cad tab As an attempt to figure out what extra stuff is compiled for each new tab. I could also include the JMFLAGS=insns output, but the above two are already pretty huge and I don't have a clue what they signify, if anything.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 55

•

15 years ago

> Good idea! Patch soon. Could we please keep this bug as a metabug and put all patches in bugs blocking this one? That way we don't run into trouble with partial fixes landing and then not knowing what to do with this bug.

Nicholas Nethercote [inactive]

Updated

•

15 years ago

Depends on: 616310

Nicholas Nethercote [inactive]

Comment 56

•

15 years ago

I field bug 616310 to reduce the fragmentation in the allocator.

Nicholas Nethercote [inactive]

Comment 57

•

15 years ago

Also, this is blocking beta9, but it's not clear what the criteria is for deciding that it's been fixed.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 58

•

15 years ago

Yeah. The method JIT is going to add *some* memory usage, we can't block on 3.6 parity. Let's investigate where we can easily reduce bad memory use, file bugs on those, and then unblock this.

Nicholas Nethercote [inactive]

Comment 59

•

15 years ago

The calloc'd space is being worked on, as is fragmentation. That just leaves the actual JITted native code. We still need more data on how that is broken up; currently we only have comment 44 which isn't much.

David Mandelin [:dmandelin]

Comment 60

•

15 years ago

(In reply to comment #57) > Also, this is blocking beta9, but it's not clear what the criteria is for > deciding that it's been fixed. (In reply to comment #58) > Yeah. The method JIT is going to add *some* memory usage, we can't block on 3.6 > parity. Let's investigate where we can easily reduce bad memory use, file bugs > on those, and then unblock this. Clarification: I set this to block beta9 as an indicator that we should be working on it now, because it's important and will take an unknown amount of time. So I agree with dvander that we should unblock it once we have a good analysis of the problem and have filed well-defined sub-bugs.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Updated

•

15 years ago

Depends on: 616367

Nicholas Nethercote [inactive]

Comment 61

•

15 years ago

(In reply to comment #59) > The calloc'd space is being worked on, as is fragmentation. That just leaves > the actual JITted native code. We still need more data on how that is broken > up; currently we only have comment 44 which isn't much. <njn> dvander: any ideas how to space-profile JM JITted code? <dvander> njn, there's probably a few things of interest: sync blocks (code emitted by FrameState::sync/syncAndKill), code generated by fallibleVMCall, code generated by FrameState::merge, and then Everything Else <dvander> Assembler has a size() function so it should be easy to compute before/after <njn> cool, that's a good start

Nicholas Nethercote [inactive]

Comment 62

•

15 years ago

Attached patch patch instrumenting JM code creation — Details — Splinter Review

- fallibleVMCall: 13,326,945 - sync: 3,635,958 - syncAndKill: 1,961,004 - merge: 3,241,445 - everything: 44,001,154 I used the attached patch to get these numbers, which are for a 10 tab cad-comic.com session. Looks like ~50% of the code size isn't covered by the above four functions.

Nicholas Nethercote [inactive]

Comment 63

•

15 years ago

In case it wasnt' clear, in comment 62 the "everything" line counts *all* code generated; it's *not* "everything else". If you subtract the first four counts from the last count you get 21,835,802 bytes. That is the "everything else" number.

Robert Sayre

Comment 64

•

15 years ago

Attached image massif visualizer screenshot — Details

Just to share some knowledge, I think massif in combination with the KDE program massif-visualizer <https://projects.kde.org/projects/kdereview/massif-visualizer> can help us get our heads around this problem. The attached screen shot has an issue in that it is only identifying our allocator wrapper functions, but valgrind has a flag to consider a function an allocator. Doing enough of that should get us some pretty informative data that many engineers can understand, without having to wade through massif logs. ff-massif/dist/bin$> LD_LIBRARY_PATH=. valgrind --tool=massif --smc-check=all ./firefox-bin mozconfig: mk_add_options MOZ_MAKE_FLAGS=-j8 . $topsrcdir/browser/config/mozconfig mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/ff-massif ac_add_options --enable-optimize=-O1 ac_add_options --disable-debug ac_add_options --enable-tests ac_add_options --enable-valgrind ac_add_options --disable-jemalloc

Julian Seward [:jseward]

Comment 65

•

15 years ago

Attached file JMFLAGS=insns output w/ boundaries of njn's annotations shown — Details

(In reply to comment #62) > Looks like ~50% of the code size isn't covered by the > above four functions. Yes. This attachment is the JMFLAGS=insns output for a -j -m -p run of bitops-3bit-bits-in-byte.js. I enhanced your c62 patch so as to make it clear in the output which insns are covered and which aren't (aren't = the areas not inside an "XXXX BEGIN" ... "XXXX END" section) dvander, can you glance at this and see if the non-counted areas are generated by any specific part of the compiler, that we can add counter(s) for?

Nicholas Nethercote [inactive]

Comment 66

•

15 years ago

(In reply to comment #64) > > Just to share some knowledge, I think massif in combination with the KDE > program massif-visualizer > <https://projects.kde.org/projects/kdereview/massif-visualizer> can help us get > our heads around this problem. The attached screen shot has an issue in that it > is only identifying our allocator wrapper functions, but valgrind has a flag to > consider a function an allocator. Doing enough of that should get us some > pretty informative data that many engineers can understand, without having to > wade through massif logs. > > ff-massif/dist/bin$> LD_LIBRARY_PATH=. valgrind --tool=massif --smc-check=all > ./firefox-bin The function-is-an-allocator flag might not work quite like you expect... basically it assumes that the function you name is a wrapper for malloc/new. In a lot of cases (ie. the pertinent ones here) the functions are wrappers for mmap, IIRC Massif will ignore them by default. However, there is an option --pages-as-heap which basically says "ignore the malloc/new level, just measure everything at the mmap/page level". The results are more coarse-grained and a bit harder to interpret but it includes *all* memory allocations. I've found that profiling Firefox without this flag is pretty useless because so much allocation doesn't go via malloc/new; indeed, I implemented that option exactly because of this. Anyway, it's clear from looking at --pages-as-heap output that the vast majority of the increase is due to two allocations in finishThisUp() -- the calloc() call, which is being attacked in multiple ways (bug 611400, bug 616367), and the executable code allocation, which has had less attention so far (bug 616310).

The 8472

Comment 67

•

15 years ago

There seems to be a lot of work on optimizing memory allocation, reducing the footprint of the compiled code etc. But ultimately memory usage will still increase in a linear manner with the number of tabs. Wouldn't it be better to just keep the compiled code around where it's necessary and discard it everywhere else? Considering that only a finite amount of code can be run at any point in time (since we can't have more than 100% CPU load anyway) there should only be a finite amount of code that needs compiling in most cases, i.e. usually whatever is running in the inner loops. Background tabs are either completely idle or only run lightweight scripts in long intervals otherwise having a bunch of such tabs open would lead to intolerable JS load anyway. I don't know when mjit-ed code is normally discarded (when the window object is GCed?). But i think it would make more sense if it was subject to some kind of garbage collection. If compiling of all code on an average website takes around 30ms as mentioned in a previous comment then even a rather simple such as "discard compiled method if it has not been used for N seconds" would already result in most memory being freed without much of a performance penalty. That is of course assuming we have method-level hit counters that could be used for such a scheme or that adding them wouldn't be a significant performance hit. TL;DR: Most memory optimizations discussed don't improve scalability, they only improve the footprint by a constant factor. Actively discarding compiled code on the other hand can turn it from O(n) to O(1) memory usage without a significant performance hit.

David Mandelin [:dmandelin]

Comment 68

•

15 years ago

(In reply to comment #67) > TL;DR: Most memory optimizations discussed don't improve scalability, they only > improve the footprint by a constant factor. Actively discarding compiled code > on the other hand can turn it from O(n) to O(1) memory usage without a > significant performance hit. This is exactly my #5 in comment 25. I do think it's the next thing to try after the things we're doing now. It's non-trivial, though. If someone has the time to try it out, that would be nice.

David Mandelin [:dmandelin]

Comment 69

•

15 years ago

The key dependent bugs are now blocking, so this one doesn't need to anymore.

blocking2.0: beta9+ → ---

Nicholas Nethercote [inactive]

Comment 70

•

15 years ago

(In reply to comment #69) > The key dependent bugs are now blocking, so this one doesn't need to anymore. We have bugs filed to improve fragmentation, and to reduce the size of the calloc() in finishThisUp(). But we don't have anything for the "JM generates an awful lot of native code" issue, other than bug 577359, but we don't have any data on whether that'll actually help. So should we have another bug(s) on the "lots of native code" issue? Either to reduce it, or to discard it more aggressively as per comment 67?

David Mandelin [:dmandelin]

Comment 71

•

15 years ago

(In reply to comment #70) > (In reply to comment #69) > > The key dependent bugs are now blocking, so this one doesn't need to anymore. > > We have bugs filed to improve fragmentation, and to reduce the size of the > calloc() in finishThisUp(). But we don't have anything for the "JM generates > an awful lot of native code" issue, other than bug 577359, but we don't have > any data on whether that'll actually help. > > So should we have another bug(s) on the "lots of native code" issue? Either to > reduce it, or to discard it more aggressively as per comment 67? That is an excellent question. I asked for data in bug 598466 to help guide that decision, and bsmedberg posted some good advice for how to get the data, but nothing has come back yet. Reducing the memory usage seems pretty hard at this stage, because it has to be something low-risk for us to meet our release target. If someone has some simple ideas and spare time, by all means take a crack at it. Throwing away code seems like the easiest approach, but there are a couple of problems to solve there: we'd need some good code mem usage metering, and we need to purge call ICs when we do that. We might take a perf hit if we discard too early, but I don't think we'll be able to cook up memory pressure detection on short notice, except the easy kind of VirtualAlloc returning NULL in the code memory allocator. If we're going to try that, we should really start now.

Dão Gottwald [:dao]

Updated

•

15 years ago

blocking2.0: --- → ?

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 72

•

15 years ago

see comment #69 - We've identified and filed a few individual, short-term bugs where the method JIT allocates memory unnecessarily. I don't think we can justify blocking a release on this meta bug. It's not clear what the goal would be (the method JIT must use *some* memory), and each additional fix gets increasingly more risky and difficult to implement.

blocking2.0: ? → -

The 8472

Comment 73

•

15 years ago

(In reply to comment #72) > It's not clear what the goal would > be (the method JIT must use *some* memory), and each additional fix gets > increasingly more risky and difficult to implement. That does not justify a *+100% memory usage increase* that also scales with the number of tabs. If you take a look at various measurements in bug 598466 you'll notice that methodjit is the worst offender by far. You can't just bloat up firefox and then say "it's too late to fix it". Especially not for a single feature that's not directly visible to the user in many cases. Even more so when you consider that there are bunch of other features which *also* increase memory usage and it is most likely that they won't be able to fully fix that either. Just to emphasize: We have an overall increase of approximately +200% to +250% in memory footprint since firefox 3.6. And methodjit is responsible for half of that increase.

Michael Lefevre

Comment 74

•

15 years ago

(In reply to comment #72) > I don't think we can > justify blocking a release on this meta bug. It's not clear what the goal would > be (the method JIT must use *some* memory), and each additional fix gets > increasingly more risky and difficult to implement. The goal would be not regressing memory usage by hundreds of megabytes compared to 3.6. Maybe it's just that annoying vocal minority, but there's a good chunk of people that would rather just lose the few milliseconds that methodjit gains in order to reduce the memory usage. If it's too late to fix it, then it could be disabled for 4.0 and enabled in a later release, or the 4.0 release could be pushed back another month or two...

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 75

•

15 years ago

I think it does justify it. Firefox 4 is blocked on being fast, which feedback indicates is a feature people like. We're certainly not going to pref that off, though if you're using many, many tabs - and the browser is crashing due to out-of-memory, the pref is there to toggle :) Just to be clear: we've measured some sizable memory wins that are easy targets for 4.0. They're blocking b9 (since it really is "too late", and we need the feedback sooner) and hanging off this bug. The real risk is that users will run out of the limited 2GB Windows address space faster, and if that ends up being a major problem, we can dig deeper. There are a bunch of great ideas here.

nivtwig

Comment 76

•

15 years ago

(In reply to comment #21) > In the 80 tabs testcase and generally if you browse multiple tabs in a specific > website, most of the script URLs will be the same for all the tabs. > > So why is 5MB needed per tab, is it possible to share the compiled code between > the tabs? Boris Zbarsky replied to that in comment #22 : > The compiled code is typically specialized to the specific global object (and > state of said global object) that it's compiled for. But I am not sure I completely understand why compiled code can't be shared in some situations. Can someone or Boris elaborate on that with a more detailed answer? 1. Boris writes that the compiled code is *typically* specialized to the specific global object . What does "typically" mean? I guess typically is not "always", so does it mean that a certain percentage of methods are specialized, but the other methods are not specialized to the global object, and therefore can be compiled once and stored once, and their compiled code be shareable among all tabs that use the same script where the method resides? In this case memory can be saved by not making copies of the compiled code of the method for each tab. Can someone estimate what is the percentage of specialized methods vs non specialized? Can you give small examples of cases where it needs to be specialized , and cases where it doesn't ? 2. How much performance does the specialization of the methods to the global object bring vs compiling the methods but not specializing to the global object (i.e. accesses to the global object will not be hard-coded in the compiled code). If the added performance is not much, it may be possible to save hundreds of megabytes of memory by having a single copy of the compiled methods instead of 80 copies (in the case of 80 tabs), with a small performance hit. Will this (small?) performance hit, bring the javascript compiler performance to be non-competitive with the other JS engines out there ?

Dão Gottwald [:dao]

Comment 77

•

15 years ago

(In reply to comment #75) > I think it does justify it. Firefox 4 is blocked on being fast, which feedback > indicates is a feature people like. Methodjt managed to massively spoil the speed of Firefox 4 for me several times when it caused swapping... > We're certainly not going to pref that off, > though if you're using many, many tabs - and the browser is crashing due to > out-of-memory, the pref is there to toggle :) Pointing users being confronted with an unusable Firefox to a hidden pref is not a solution.

blocking2.0: - → ?

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 78

•

15 years ago

How did you narrow down the method JIT as the cause of swapping problems?

Dão Gottwald [:dao]

Comment 79

•

15 years ago

I disabled it and the problem vanished.

The 8472

Comment 80

•

15 years ago

(In reply to comment #76) > > So why is 5MB needed per tab, is it possible to share the compiled code between > > the tabs? This approach is nice on paper for this particular test case, but in practice it won't save as much because in practice large sessions contain content from many different sites. You would basically fix the testcase and we could pat ourselves on each other's back... and the real problem would still be there. And i doubt we can share code for... let's say google analytics across different compartments. Which Things get even dicier if you consider that any method could be overloaded, even many of the built-in objects (Array, String, Object) could be modified on some tabs. Making code shareable basically would prevent inlining. The Java Hotspot VM does make overly optimistic optimizations like these, but it has the ability to back out and recompile if it detects overloading that makes inlined JITed code invalid, methodjit doesn't have that ability.

The 8472

Comment 81

•

15 years ago

(In reply to comment #75) > I think it does justify it. Firefox 4 is blocked on being fast, which feedback > indicates is a feature people like. Well, the problem can be easily framed as being a performance problem too. Such as the swapping problem described in comment 77. Or consider a hypothetical benchmark that would measure performance for tons of of dynamically generating scripts (eval, new Function(), generated script tags). If we compile everything and then run it only a few times this would actually be slower than other browsers/firefox without methodjit (see comment 16).

Boris Zbarsky [:bzbarsky]

Reporter

Comment 82

•

15 years ago

> What does "typically" mean? I guess typically is not "always" It's "always", to a first approximation. Any code that uses any global variable is specialized to the global.

Robert Sayre

Comment 83

•

15 years ago

(In reply to comment #79) > I disabled it and the problem vanished. I don't think an edit-war with the blocking flag is going to be productive. Why don't you file a bug on the actual problem you're experiencing instead of making noise in this one? The data in this bug does show some overhead problems, and we're working on them. But your swapping problem could be just a leak, or yet another problem.

Julian Seward [:jseward]

Comment 84

•

15 years ago

(In reply to comment #64) > Just to share some knowledge, I think massif in combination with the KDE > program massif-visualizer Yeah, +1 for that. It facilitates doing something that isn't really feasible from the text-only output: seeing how space use evolves over time. Some stuff hangs around forever, other stuff spikes and then disappears. For example, having loaded 20 cad-comic tabs, I see from the profile that the peak heap load is caused by 26.3MB of transient allocations from sqlite3malloc, driven by nsURLClassifierDBService. And I'm thinking (1) that seems pretty extravagant! and (2) I wouldn't easily have noticed it in the text only output. Recommended.

David Mandelin [:dmandelin]

Comment 85

•

15 years ago

This blog doesn't block because it doesn't have a well-defined fixed state. JITs are explicitly a trade of space for time, so space must increase by some amount, which is hard to predict in advance. I hope that people who are concerned about this issue can see that we have put a high priority on mitigating the memory increase, as evidenced by all the work in the discrete bugs that help the problem, and the fact that they are marked beta9+. I think things are going to look a lot better once those bugs are fixed, and also bug 617505. I will file a bug for the discarding-jitcode idea. I think it's doable in the time we have, and I think it will more or less satisfy the demands being made here.

blocking2.0: ? → ---

David Mandelin [:dmandelin]

Updated

•

15 years ago

Depends on: 617656

Nicholas Nethercote [inactive]

Comment 86

•

15 years ago

(In reply to comment #75) > I think it does justify it. Firefox 4 is blocked on being fast, which feedback > indicates is a feature people like. Speed is definitely one of the most common complaints about Firefox 3.6. "Bloat" is another one. And although "bloat" is a vague and subjective term, I bet lots of people will fire up their memory measurement tool of choice and conclude "bah, Firefox 4.0 is even more bloated, I'm going back to Chrome" > The real risk is that users will run out of the limited 2GB Windows address > space faster, and if that ends up being a major problem, we can dig deeper. A million times yes. This has enormous potential to (legitimately) hurt people's perception of Firefox.

The 8472

Comment 87

•

15 years ago

(In reply to comment #86) > Speed is definitely one of the most common complaints about Firefox 3.6. > "Bloat" is another one. Have similar memory comparisons been done between 2.0 and 3.x? I wonder if there might be some lower-hanging fruit that have been introduced in the past and not been found because nobody bothered to look for them in the first place.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 88

•

15 years ago

Between 2.0 and 3.0 we made a major effort to reduce memory usage and fragmentation (including switching to jemalloc), no? Between 3.0 and 3.6 is an interesting question. But not for this bug...

Nicholas Nethercote [inactive]

Comment 89

•

15 years ago

(In reply to comment #66) > > The function-is-an-allocator flag might not work quite like you expect... > basically it assumes that the function you name is a wrapper for malloc/new. > In a lot of cases (ie. the pertinent ones here) the functions are wrappers for > mmap, IIRC Massif will ignore them by default. > > However, there is an option --pages-as-heap which basically says "ignore the > malloc/new level, just measure everything at the mmap/page level". I should add that you can use the --alloc-fn option in concert with the --pages-as-heap=yes option. Just in case anyone wanted to try.

Nicholas Nethercote [inactive]

Updated

•

15 years ago

Depends on: 619622

Nicholas Nethercote [inactive]

Comment 90

•

15 years ago

Bug 616367 didn't end up having much effect, so I filed bug 619622 as an alternative approach.

Nicholas Nethercote [inactive]

Comment 91

•

15 years ago

Looking again at the calloc'd space in finishThisUp(): Bug 611400 has been completed. Bug 619622 (which shrinks the size of an IC from 88 to 64 bytes on 32-bit and 144 to 112 on 64-bit) is almost ready. If I apply that and measure a 20-tab, 64-bit cad-comics run, I see these numbers: 1 sizeof(JITScript); 2,588,992 2 sizeof(NativeMapEntry) * nNmapLive; 797,776 3 sizeof(ic::MICInfo) * mics.length(); 1,242,960 4 sizeof(ic::CallICInfo) * callICs.length(); 4,984,704 5 sizeof(ic::EqualityICInfo) * equalityICs.length(); 166,408 6 sizeof(ic::TraceICInfo) * traceICs.length(); 161,904 7 sizeof(ic::PICInfo) * pics.length(); 14,878,528 8 sizeof(ic::GetElementIC) * getElemICs.length(); 1,273,800 9 sizeof(ic::SetElementIC) * setElemICs.length(); 392,832 10 sizeof(CallSite) * callSites.length(); 642,060 total: 27,129,964 Compare this with comment 45 which was a 40-tab run with a total of 96,882,024. If we halve that to account for 40 tabs vs 20 tabs we get 48,441,012. So 27,129,964 is a 1.79x reduction. Pretty good! PICs still dominate, CallICs are also important, the rest probably aren't worth bothering with. I'll see if there's any more fat to trim.

Nicholas Nethercote [inactive]

Updated

•

15 years ago

Depends on: 619849

Nicholas Nethercote [inactive]

Comment 92

•

15 years ago

I just filed bug 619849 which shrinks the size of JITScript on 64-bit platforms. It's a small but easy win. Beyond that, I'm out of ideas -- I can't see how to shrink PICInfo or CallICInfo. I wonder if fewer ICs could be allocated? Probably too hard for Fx 4.0.

Brian Hackett [Laid off!]

Comment 93

•

15 years ago

(In reply to comment #92) > I just filed bug 619849 which shrinks the size of JITScript on 64-bit > platforms. It's a small but easy win. > > Beyond that, I'm out of ideas -- I can't see how to shrink PICInfo or > CallICInfo. I wonder if fewer ICs could be allocated? Probably too hard for > Fx 4.0. Bug 617656, which I'll land tomorrow, will take a lot of pressure off of the JITScript calloc and code memory allocation for long lived browser sessions.

Nicholas Nethercote [inactive]

Comment 94

•

15 years ago

(In reply to comment #93) > > Bug 617656, which I'll land tomorrow, will take a lot of pressure off of the > JITScript calloc and code memory allocation for long lived browser sessions. Defintely, that patch looks like it'll be really good for keeping code size down over time. And we've made good progress on reducing the peak size as well. I'm feeling much better about this bug now.

Nicholas Nethercote [inactive]

Comment 95

•

15 years ago

All the bugs blocking this meta-bug have been fixed! Closing.

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

LegNeato

Updated

•

15 years ago

No longer depends on: 617656

Ed Morley [:emorley]

Comment 96

•

15 years ago

Were there any other viable usage saving ideas in the 90 comments above that haven't been filed yet? ie: Comment 61 onwards? (Having trouble telling since there were lots of different ideas in comment 60+ and a lot of it is over my head).

LegNeato

Updated

•

15 years ago

No longer depends on: 619849

Nochum Sossonko [:Natch]

Updated

•

15 years ago

Depends on: 619849

Ed Morley [:emorley]

Updated

•

15 years ago

Depends on: 617656

The 8472

Comment 97

•

15 years ago

Numbers for the current nightly: 962/988/1360 with methodjit on (initial value after startup and 1st GC cycle) 793/818/1218 with methodjit on (after waiting 15+ minutes. those extremely long GC cycles make it hard to test repeatedly) 679/704/1084 with methodjit off

Mike Shaver (:shaver emeritus)

Comment 98

•

15 years ago

I've got an about:memory reporter for the method JIT code stuff underway in bug 623281, as well.

Nicholas Nethercote [inactive]

Comment 99

•

15 years ago

The 8472, what's the workload for comment 97? I'm trying to compare those numbers with the ones you put in bug 598466 comment 133. Also, can you remind me what the three measurements are? Virtual/something/private?

The 8472

Comment 100

•

15 years ago

(In reply to comment #99) > The 8472, what's the workload for comment 97? I'm trying to compare those > numbers with the ones you put in bug 598466 comment 133. > > Also, can you remind me what the three measurements are? > Virtual/something/private? private/working set/virtual The workload in comment 97 is 80 tabs, in bug 598466 comment 133 it's 40 tabs to compare with old data.

Nicholas Nethercote [inactive]

Comment 101

•

15 years ago

I'll reopen, since the method JIT still uses a lot of memory. Please note that reopening the bug doesn't mean much by itself. For the situation to change we'll need concrete ideas and follow-up. Reading back through the comments I don't see anything in the way of low-hanging fruit.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

The 8472

Comment 102

•

15 years ago

(In reply to comment #101) > I'll reopen, since the method JIT still uses a lot of memory. Please note that > reopening the bug doesn't mean much by itself. For the situation to change > we'll need concrete ideas and follow-up. Reading back through the comments I > don't see anything in the way of low-hanging fruit. First of all we need to figure out what's left (compared to methodjit=off) after everything has been GCed. Are those recompiled scripts or something else that isn't covered by the GC? If it's recompiled scripts then lazy compiling would help, but that's not exactly a low-hanging fruit according to comment 25. If lazy compilation would be implemented we would need hit-counters for code (compile after 100 hits). Those counters could also be used to discard compiled code more aggressively and only keep hot code. If it's something else then we should see if it can be included in the garbage collector cycles too. Another issue is that bug 617656 does not help with peak virtual memory usage, which currently is an issue on 32bit systems, leading to OOM crashes. The discarding is strictly time-based right now, as is garbage-collecting.

Matt Brubeck (:mbrubeck)

Updated

•

15 years ago

Keywords: footprint

Tomas

Comment 103

•

15 years ago

All Depended bug all fixed. Any change on this?

Nicholas Nethercote [inactive]

Comment 104

•

15 years ago

(In reply to comment #103) > All Depended bug all fixed. Any change on this? See comment 101. Just because all dependent bugs have been fixed doesn't mean the overall problem has been fixed; it just means all the good ideas we have had for fixing have been done.

Nicholas Nethercote [inactive]

Updated

•

15 years ago

Depends on: 629601

Nicholas Nethercote [inactive]

Updated

•

15 years ago

Depends on: 630445

Nicholas Nethercote [inactive]

Updated

•

15 years ago

Depends on: 630447

Nicholas Nethercote [inactive]

Updated

•

15 years ago

Depends on: 631139

Nicholas Nethercote [inactive]

Updated

•

15 years ago

Depends on: 630738

Nicholas Nethercote [inactive]

Updated

•

15 years ago

Depends on: 631045

David Mandelin [:dmandelin]

Updated

•

15 years ago

Alias: JaegerShrink

David Mandelin [:dmandelin]

Updated

•

15 years ago

Depends on: 631578

David Mandelin [:dmandelin]

Updated

•

15 years ago

Depends on: 631706

David Mandelin [:dmandelin]

Updated

•

15 years ago

Depends on: 631714

Nicholas Nethercote [inactive]

Comment 105

•

15 years ago

(In reply to comment #25) > > 6. Delayed compilation of scripts might be worth a try, but it has some > difficulties. bhackett's measurements show that compilation is worthwhile only > if the code runs at least 100 iterations. But it may be hard to take advantage > of this: > > - it would be really nice and simple to compile on the Nth entry to a > function, but that is bad if it contains a loop that will run many > iterations--in that case we should compile right away. > > - we could try to compile after N runs of a loop, but then we need to compile > while we are running the script, and then jump into the compiled code in the > middle. That doesn't sound *too* hard, but the tracer integration code had to > do things like that and it was not easy to get right. dvander probably has more > insight on this issue. > > - and then there are the inevitable tuning problems. They might not be so bad > with simple schemes like compiling after 10 iterations, though. > > Compiling global or eval scripts only if they contain a loop (similar to what > bhackett suggested) seems like an easier starting point. But even there there > are some issues: some benchmarks eval the same script many times, in which case > compilation may be worthwhile even if the script doesn't contain a loop. > > Maybe compiling loop-containing scripts right away and others on the Nth > iteration (for N ~= 10) would be really easy and possibly helpful? Various measurements have shown this approach is likely to make a huge improvement in JM's memory usage, ie. somewhere between 2x and 10x depending on the details. The discussion is currently split across four bugs, see bug 631706 comment 6 for a guide.)

Nicholas Nethercote [inactive]

Updated

•

15 years ago

Depends on: 631951

Nicholas Nethercote [inactive]

Comment 106

•

15 years ago

(In reply to comment #105) > The discussion is currently split across four bugs, see bug > 631706 comment 6 for a guide. Bug 631951 has been created to subsume those four bugs.

Nicholas Nethercote [inactive]

Comment 107

•

14 years ago

> Bug 631951 has been created to subsume those four bugs. This bug has landed in the tracemonkey repository, with big reductions in the amount of space used by JM -- eg. I've seen figures ranging from 2.5x--4.5x. Ed, The 8472, anyone else: new measurements would be welcome. All the other open bugs [*] that are still blocking this bug will produce improvements that are tiny compared to bug 631591. For Firefox 4.0, this is as good as it's going to get (which is a lot better than it was in late November). After 4.0 is released I'll close this bug and start a new one for improvements that can go into Firefox 5 (including the aforementioned tiny-improvement bugs). Many thanks to everyone who has contributed. [*] The exception to that is bug 631578 which is about measurements, which may lead to further bugs being filed, but not for 4.0.

The 8472

Comment 108

•

14 years ago

(In reply to comment #107) > This bug has landed in the tracemonkey repository > Ed, The 8472, anyone else: new measurements would be welcome. this one? http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2011-02-15-03-tracemonkey/firefox-4.0b12pre.en-US.win32.zip

Nicholas Nethercote [inactive]

Comment 109

•

14 years ago

> this one? > http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2011-02-15-03-tracemonkey/firefox-4.0b12pre.en-US.win32.zip The patch landed on Feb 11, so that should be fine.

The 8472

Comment 110

•

14 years ago

ok, using 2011-02-15-03-tracemonkey. 40 cad-comic tabs, image discarding off, HW acceleration off, ctrl+tabbing through all tabs. data in the following format: private/working/virtual js/gc-heap js/string-data js/mjit-code methodjit always: 913/922/1144 90,177,536 9,542,568 237,386,742 methodjit on: 648/657/870 80,740,352 10,222,868 38,223,577 methodjit on, 20 minutes later: 676/687/889 105,906,176 5,495,670 4,799,560 methodjit off: 612/621/833 88,080,384 5,529,310 0

Nicholas Nethercote [inactive]

Comment 111

•

14 years ago

Thanks, The 8472! That looks pretty satisfactory :)

Nicholas Nethercote [inactive]

Comment 112

•

14 years ago

Marking this fixed as discussed in comment 107. See bug 640457 for follow-ups; note that bug is about memory usage reductions in Firefox in general, not just JaegerMonkey. Please CC yourself if you're interested.

Status: REOPENED → RESOLVED

Closed: 15 years ago → 14 years ago

Resolution: --- → FIXED

create() and poolForSize() histograms 15 years ago Nicholas Nethercote [inactive] 55.23 KB, text/plain		Details
scripts histogram 15 years ago Nicholas Nethercote [inactive] 60.76 KB, text/plain		Details
JMFLAGS=scripts output created by loading one extra cad tab 15 years ago Julian Seward [:jseward] 7.44 KB, application/x-bzip		Details
JMFLAGS=scripts,jsops output created by loading one extra cad tab 15 years ago Julian Seward [:jseward] 342.25 KB, application/x-bzip		Details
patch instrumenting JM code creation 15 years ago Nicholas Nethercote [inactive] 5.69 KB, patch		Details \| Diff \| Splinter Review
massif visualizer screenshot 15 years ago Robert Sayre 383.04 KB, image/png		Details
JMFLAGS=insns output w/ boundaries of njn's annotations shown 15 years ago Julian Seward [:jseward] 86.65 KB, text/plain		Details