Closed Bug 615199 (JaegerShrink) Opened 14 years ago Closed 13 years ago

Methodjit enabled causes the browser to use almost twice as much memory

Categories

(Core :: JavaScript Engine, defect)

x86
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: bzbarsky, Assigned: dvander)

References

Details

(Keywords: memory-footprint)

Attachments

(7 files)

See bug 598466 comment 13 for the steps being used to reproduce, using 80 tabs instead of 40.

The relevant numbers are in bug 598466 comment 70:

baseline: 989/1006/1360
methodjit disabled: 605/635/973

The numbers are private/working/virtual memory on Windows, in MB.

So methodjit is using about 5MB per tab here.

The <script> elements that might be relevant on the page are loading these scripts:

  http://openx.blindferret.com/www/delivery/spcjs.php?id=12&amp;target=_blank
  http://platform.twitter.com/widgets.js
  http://connect.facebook.net/en_US/all.js#xfbml=1
  http://ajax.googleapis.com/ajax/libs/jquery/1.4/jquery.min.js
  http://www.google.com/jsapi?key=ABQIAAAAQ8CNTaqzPpNZ-cOX8L976hRkQ4u0u7eQGNeqk4bpF3SI_sm34hS7f9Ck-EcNDO_ay9qMUuK9sutsHA
  http://cdn.cad-comic.com/js/1289620584.js
  http://s9.addthis.com/js/widget.php?v=10
  http://www.google-analytics.com/urchin.js

plus I guess whatever subframes load...
blocking2.0: --- → ?
OS: Mac OS X → Windows 7
Blocks: 598466
Oh, and another open question here: why didn't this issue show up on talos?
Could this caused by bug 611400?  That identifies some wasted space
allocated by mjit.  Does the patch at bug 611400 comment 6 help?
Are the data in comment 0 solid? I thought I saw a comment in bug 598466 saying there might have been some measurement error there. It seemed like maybe the difference was smaller given in comment 0--maybe more on the order of bug 611400 effects.
> Are the data in comment 0 solid? 

Reasonably.  The measurement error is no more than +-20MB or so on the overall number , from what I've seen.  So the increase might be only 340MB, not 380MB...  or might be 420MB.

The second set of numbers in bug 598466 is a 380MB regression over only 70 tabs, so somewhat bigger per-tab than the numbers in comment 0:

  2010-09-11: 530 / 554 / 698
  2010-09-12: 910 / 928 / 1080
> Does the patch at bug 611400 comment 6 help?

Measuring now.
OK, so I see these numbers (64-bit mac, not 32-bit windows like comment 0, fwiw), using 70 tabs; the two numbers are "real mem" / "private mem" from Activity Monitor:

Vanilla: 1270MB / 1000MB
mjit off: 898MB / 603MB
patched mjit: 1230MB / 926MB

So the patch gets us 20% of the way there.  ;)
Some additional info from bug 598466 comment 94...

Using:
- The STR from bug 598466 comment 87
- layers.accelerate-none = true
- layers.accelerate-all = false
- image.mem.discardable = false
- image.mem.decodeondraw = false
- javascript.options.methodjit.content = [as below]
- javascript.options.methodjit.chrome = false
- Several runs of each build to verify results

Nightlies from Tracemonkey: (figures in MB; private/working/virtual)
2010-09-12: Methodjit=true 912/929/1084 ; Methodjit=false 635/656/809
2010-09-10: Methodjit=true 937/956/1115 ; Methodjit=false 642/665/825
2010-09-04: Methodjit=true 868/889/1050 ; Methodjit=false 635/659/824
2010-09-02: Methodjit=true 866/888/1047 ; Methodjit=false 641/665/832
2010-09-01: Methodjit=true 925/947/1109 ; Methodjit=false 642/666/824
2010-08-31: 651 / 673 / 837 [methodjit pref didn't exist then]
2010-08-30: 645 / 668 / 834 [ditto]

Last good nightly: 2010-08-31 First bad nightly: 2010-09-01
Pushlog:
http://hg.mozilla.org/tracemonkey/pushloghtml?fromchange=e8ee411dca70&tochange=e2e1ea2a39ce

However, there are 1000+ Jaegermonkey changesets in that pushlog (yey for project branches), so not hugely helpful.

If you want to get some tryserver builds going for win32, I'll be happy to give them a go.
Sorry for the bugspam, should have said, the above results were using 68 tabs.
blocking2.0: ? → beta9+
I think at this point we may be better served by doing a malloc trace here.  I'll try to do one tomorrow, if nothing goes wrong.
I did some measurements with Massif.  Annoyingly, Massif wasn't behaving
very well but I did discover that ExecutablePool::create() is called an
awful lot, though.  On my Linux64 box, in a session with 40 cad-comic.com
tabs open it is called 9,247 times, and the sum of all the 'n' arguments
(which I assume are bytes) is 200,381,190.  That's an average of 21,669 per
call.

This is a fragment of the stack trace that appears to be responsible:

  JSC::ExecutablePool::systemAlloc(unsigned long) (ExecutableAllocatorPosix.cpp:43)
  JSC::ExecutablePool::create(unsigned long) (ExecutableAllocator.h:374)
  js::mjit::Compiler::finishThisUp(js::mjit::JITScript**) (ExecutableAllocator.h:235)
  js::mjit::Compiler::performCompilation(js::mjit::JITScript**) (Compiler.cpp:208)
  js::mjit::Compiler::compile() (Compiler.cpp:134)
  js::mjit::TryCompile(JSContext*, JSStackFrame*) (Compiler.cpp:245)
  js::mjit::stubs::UncachedCallHelper(js::VMFrame&, unsigned int, js::mjit::stubs::UncachedCallResult*) (InvokeHelpers.cpp:387)
  js::mjit::ic::Call(js::VMFrame&, js::mjit::ic::CallICInfo*) (MonoIC.cpp:831)
Note that due to inlining there may be some functions elided in that stack trace.
(In reply to comment #10)
> On my Linux64 box, in a session with 40 cad-comic.com
> tabs open it is called 9,247 times, and the sum of all the 'n' arguments
> (which I assume are bytes) is 200,381,190.

That's pretty close to 5MB per tab, BTW, matching comment 0.
(In reply to comment #10)
> I did some measurements with Massif.  Annoyingly, Massif wasn't behaving
> very well but I did discover that ExecutablePool::create() is called an
> awful lot, though.  On my Linux64 box, in a session with 40 cad-comic.com
> tabs open it is called 9,247 times, and the sum of all the 'n' arguments
> (which I assume are bytes) is 200,381,190.  That's an average of 21,669 per
> call.
> 
> This is a fragment of the stack trace that appears to be responsible:
> 
>   JSC::ExecutablePool::systemAlloc(unsigned long)
> (ExecutableAllocatorPosix.cpp:43)
>   JSC::ExecutablePool::create(unsigned long) (ExecutableAllocator.h:374)
>   js::mjit::Compiler::finishThisUp(js::mjit::JITScript**)
> (ExecutableAllocator.h:235)
...

ExecutablePool::create is used to allocate code memory for JM.  The stack trace above is the path used when allocating the JIT code for an entire script (as opposed to allocations for PIC stubs).  No easy fix that I know of, but thoughts:

1. Finer grained information would be harder to collect, but tremendously valuable.  How much code is for eval/global vs. function scripts, how much is inline vs. OOL code, frequency of different ops and aggregate size of inline and OOL code generated for each op.

2. Bug 577359 should help if there are lots of big initializers in global/eval scripts.

3. Maybe investigate whether to only compile loops in global/eval scripts.

4. Maybe investigate whether to only compile functions after they get hot (hard to do without impacting benchmark perf).

5. At least in benchmarks, from 1/2 to 2/3 of code memory is for OOL stub code, which hardly ever executes.

6. Could reduce point 5 by coalescing side exits in common ops (point 1) to reduce the amount of sync code.

7. Could reduce point 5 with more PICs targeted at common ops (point 1), e.g. arithmetic.  When the types of 'y' and 'z' are unknown, 'x = y + z' uses 48 bytes of inline code memory, and 203 bytes of OOL code memory.
Numbers in point 7 above are for OSX x86.  For OSX x64 I get 81 bytes inline, 289 bytes OOL.
(In reply to comment #13)
> 4. Maybe investigate whether to only compile functions after they get hot (hard
> to do without impacting benchmark perf).
To take a page out of the Java VM JIT compiler book: Avoiding to compile everything at startup actually improves the startup time because compilation itself consumes time. If you start up in interpreted mode you can already execute unoptimized code while you still compile on another thread, even making dynamic optimizations based on the profiling results from the execution thread.
This also allows you to perform optimistic optimizations such as eliminating branches that are not visited according to the profiler. If you hit such a branch one can back out into interpreted mode and recompile. Same goes for inlining potentially virtual calls, if they get overloaded one can fall back to interpreted mode and recompile in the meantime.

Just performing static compilation at startup wastes a lot of optimization potential.
(In reply to comment #15)
> (In reply to comment #13)
> > 4. Maybe investigate whether to only compile functions after they get hot (hard
> > to do without impacting benchmark perf).
> To take a page out of the Java VM JIT compiler book: Avoiding to compile
> everything at startup actually improves the startup time because compilation
> itself consumes time. If you start up in interpreted mode you can already
> execute unoptimized code while you still compile on another thread, even making
> dynamic optimizations based on the profiling results from the execution thread.
> This also allows you to perform optimistic optimizations such as eliminating
> branches that are not visited according to the profiler. If you hit such a
> branch one can back out into interpreted mode and recompile. Same goes for
> inlining potentially virtual calls, if they get overloaded one can fall back to
> interpreted mode and recompile in the meantime.
> 
> Just performing static compilation at startup wastes a lot of optimization
> potential.

Yes, definitely.  The SM interpreter is slow compared to a JIT but not *that* slow, and done right partial interpretation should be a wash or net speedup in benchmarks (which don't resemble actual web JS all that much).

Javascript JIT compilation is different from Java in that there is very little information known statically --- the main reason ADD is expensive in memory is that we need to account for any combination of ints, floats, and other data being added.  Type inference (bug 557407) helps greatly here, and can figure out what is being added and reduce code memory.  Inference incurs its own memory overhead though in storing intermediate structures, and mjit+inference will most likely use more memory than mjit alone.  Again, partial interpretation helps here, and also helps inference precision as it can't figure everything out statically.
The 8472, compiling on a different thread would be nice, but not happening for 2.0.  And the problem with compiling lazily is that it _is_ likely to hurt benchmark times; a lot of these benchmarks run fast enough that just the context switch overhead of the separate thread would hurt.  And yes, they're crappy benchmarks.  :(
(In reply to comment #17)
> compiling on a different thread would be nice
> but not happening for 2.0.
Ok, but even if compiling happens on the same thread lazy compilation can be of advantage in extremely short-running scripts where the compilation overhead would outweigh the performance gain. Think of initializer code that only runs once, compiling it would only cause dead weight.

Of course i'm basing my argument on knowledge about the hotspot VM, i have no idea how large the speed difference between javascript interpreted and JIT mode is in comparison.

> a lot of these benchmarks run fast enough that just the
> context switch overhead of the separate thread would hurt.
No context switching should be required on a multi-core system. Firefox is under-utilizing those. And passing data from one thread to another can be done without context switches too by using atomics.

> And yes, they're crappy benchmarks.  :(
Then the question is if we want to optimize for real-world performance or for crappy benchmarks.
There are no good measures of JS real-world performance; makes it hard to optimize for.
I measured the compilation time and interpreter execution time for this function (release build, added PRMJ_Now before/after compilation):

function run(x, y)
{
  var a, b, c;
  a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y; a = x + y; b = a + x; c = b + y;
}

Running this function 10000 times in the interpreter takes 27ms, compiling it takes 350us.  So interpreting this straight-line code is ~130 times faster than compiling it.  Compilation time on SS is IIRC between 10-20ms, so the cost of interpreting code a few times before compiling should be puny in comparison to the total benchmark time.  Switching between the interpreter and mjit is quick so shouldn't affect times either.

I get similar numbers if I make these GNAME accesses (interpreter time doubles, but so does compilation time).
In the 80 tabs testcase and generally if you browse multiple tabs in a specific website, most of the script URLs will be the same for all the tabs.

So why is 5MB needed per tab, is it possible to share the compiled code between the tabs?
The compiled code is typically specialized to the specific global object (and state of said global object) that it's compiled for.
w.r.t the extra space use, I suspect we won't find any single culprit.

(In reply to comment #6)

> So the patch gets us 20% of the way there. ;)

I profiled w/ DHAT an x86_64-linux build of (M-C + said patch) opening
20 of the tabs (DHAT is slow).

This shows a max C++ heap size of 184MB.  Of this, the top allocation
stack now results from property table allocations
(js::PropertyTable::init calling calloc), accounting for 22.7 out of
the 184MB.  This was filed as bug 610070.  We might be able to save
some space here since the average usage of these blocks is only 44% --
more than half the bytes in them are never accessed.

I also see nearly 4MB of essentially useless allocation in the style
of bug 609905 (4MB held live for the entire process, actual usage of
these blocks is below 5%).
(In reply to comment #20)
> Running this function 10000 times in the interpreter takes 27ms, compiling it
> takes 350us.  So interpreting this straight-line code is ~130 times faster than
> compiling it.  Compilation time on SS is IIRC between 10-20ms, so the cost of
> interpreting code a few times before compiling should be puny in comparison to
> the total benchmark time.  Switching between the interpreter and mjit is quick
> so shouldn't affect times either.
Did you run the compiling in a loop too? I would think that 350µs is very low and probably noisy.

But if your figures are right then waiting 10-20 method calls before compiling should not only reduce memory usage for initializers but also speed up short-running code segments.

(In reply to comment #23)
> w.r.t the extra space use, I suspect we won't find any single culprit.
How about an alternative solution then? Instead of trying to optimize memory usage for live, active JIT stuff we could just discard all inactive JIT stuff in background tabs. Similar to image discarding or garbage collection.

If for example the minimum age of JIT objects is set to 10 seconds then even erroneous discarding would have a minimal impact on performance. And as long as the scripts are running no discarding would happen anyway.
Various responses and ideas, in descending order of guessed reward/risk ratio:

1. It looks like Julian has already found a good fraction of the extra space as coming from:

 - big nmaps
 - PropertyTable allocations
 - useless allocations

Those are straight-out bugs and seem like the most effective place to start work.

2. Bug 577359 could be big. We should get measurements in the context of this problem. I.e., how much code is being generated for constant initializers. Shouldn't be too hard to instrument the compiler to collect that. Or else just get that bugfix landed and measure what it does.

X. Background on what the jitcode allocator does, needed for understanding the next two points:

 - if N >= 64k, call VirtualAlloc, which effectively allocates N rounded up to the nearest multiple of 64k.
 - if N < 64k, try to take space inside the current 64k "small allocation pool". If there is enough space left, bump-allocate inside there. Otherwise, first allocate a new pool. Then,
     - if the new pool will have more space than the previous small pool, then allocate from the new pool and make it the new small pool.
     - otherwise, allocate inside the new pool but leave the old small pool.

Thus, big scripts (in jitcode size) get their own pool, while small scripts (and generated PIC stubs) get grouped together to fill a 64k pool. The pools are refcounted--the script holds the ref.

3. So, one potential big problem with the allocator is fragmentation. If we have lots of 65k allocations, then we waste about half our space by allocating 128k each time with VirtualAlloc. If we have lots of 33k allocations, then we waste about half our space by allocating a new 64k small pool each time. A retune of the allocation policy might really help.

It shouldn't be too hard to measure fragmentation with some manual instrumentation on the pool allocator. The main summary statistics would be (bytes of jitcode in current allocations) and (bytes currently allocated for jitcode). Distribution stats might be helpful too.

4. Refcounting might be making us hold on to pools for too long. For example, say script S is 33k, and it ends up sharing a pool with a bunch of PICs from other scripts. In that case, one of the PICs could be 20 bytes, but keep the whole 64k chunk live for a long time.

This shouldn't be that much harder to measure, and in fact seems like a form of fragmentation. The idea would be to distinguish (bytes of jitcode in current allocations) and (bytes of jitcode for non-destroyed scripts in current allocations).

5. A simple idea for controlling jitcode memory usage is to throw away jitcode if memory pressure gets high. It can just be recompiled if it gets run again. This is a very simple technique that fits in perfectly with existing facilities and needs no new execution mode combos or anything like that. (The only trick is to avoid throwing away memory for running scripts.) And it seems not too hard to make it sensitive to memory pressure or memory usage.

6. Delayed compilation of scripts might be worth a try, but it has some difficulties. bhackett's measurements show that compilation is worthwhile only if the code runs at least 100 iterations. But it may be hard to take advantage of this:

 - it would be really nice and simple to compile on the Nth entry to a function, but that is bad if it contains a loop that will run many iterations--in that case we should compile right away.

 - we could try to compile after N runs of a loop, but then we need to compile while we are running the script, and then jump into the compiled code in the middle. That doesn't sound *too* hard, but the tracer integration code had to do things like that and it was not easy to get right. dvander probably has more insight on this issue.

 - and then there are the inevitable tuning problems. They might not be so bad with simple schemes like compiling after 10 iterations, though.

Compiling global or eval scripts only if they contain a loop (similar to what bhackett suggested) seems like an easier starting point. But even there there are some issues: some benchmarks eval the same script many times, in which case compilation may be worthwhile even if the script doesn't contain a loop.

Maybe compiling loop-containing scripts right away and others on the Nth iteration (for N ~= 10) would be really easy and possibly helpful?

7. Delayed compilation of OOL paths should definitely be in our long-term plans, but I don't know if we can reasonably get that right for Fx4. (dvander?) To me, the easiest way to do that seems to be to not generate them until they are actually called, IC-style. We know that's among the hardest code to get right, though, so it seems risky.
(In reply to comment #10)
> I did some measurements with Massif.  [...]

Yeah, I see the same thing.  (--tool=massif --pages-as-heap=yes).  I
had no problems w/ massif, btw.  For 20 tabs I'm seeing 79.8MB held
live by JSC::ExecutablePool::systemAlloc, which is just about 4MB per
tab.

I'm surprised the mjit manages to generate 4-5MB of executable code
per tab (if that's the correct diagnosis).  I wonder if there's some
overallocation of space going on.
(In reply to comment #26)
> (In reply to comment #10)
> > I did some measurements with Massif.  [...]
> 
> Yeah, I see the same thing.  (--tool=massif --pages-as-heap=yes).  I
> had no problems w/ massif, btw.  For 20 tabs I'm seeing 79.8MB held
> live by JSC::ExecutablePool::systemAlloc, which is just about 4MB per
> tab.
> 
> I'm surprised the mjit manages to generate 4-5MB of executable code
> per tab (if that's the correct diagnosis).  I wonder if there's some
> overallocation of space going on.

That seems to give hope that it's due to fragmentation, or better yet, just some simple bug in the allocator. Maybe we're allocating a whole 64k chunk for each PIC entry or something dumb like that. (Btw, those chunks are 16k on Linux/Mac for those testing there.)

Another thing I forgot to mention previously is that detailed stats on the allocations would be nice: (1) distributions, so we know whether we are doing a zillion small allocations or a few huge ones, and (2) where they come from: PICs vs. scripts etc.
Another patch that may be worth measuring is billm's in bug 547327: it decreases JSObject::SLOT_CAPACITY_MIN from 8 to 2 (relying instead on learning object size).  In the best case, this could be saving 6 * sizeof(Value) == 48 bytes per JSObject.
(In reply to comment #25)
> Various responses and ideas, in descending order of guessed reward/risk ratio:

This ordering seems right. We should look at all the easy places first, especially the allocator where its real-world behavior seems to be basically unknown. It's also fairly easy to replace, and worst case, our JIT code is almost already relocatable. We could compact it, or as you said, just throw it out if it's not live (we must be mindful of call ICs - maybe those should refcount).

Side exits are a problem. Sometimes there are multiple per op, like the tracer, but we also sink stores. This was a conscious decision (knowing the code bloat) so we could allocate registers in ops incrementally. Post 4.0 as we experiment with new register allocation techniques and type inference, I suspect this will change, and we can just reduce side exits and not bother with the complicated coalescing problem.

We could also try to reduce the size of exits, for example, if we know the frame pointer has been sunk to VMFrame early on in a script, we never need to sink it again.

On point #6, I'd worry about tuning the most. I agree that the technical problems aren't too hard. The parser can tell us if there's a loop, if it comes to it we can get statistics on how much memory we'd save delaying compilation further on loopless scripts.

On point #7, yeah, I've wanted to IC the ADD path ever since it became the horrifying monstrosity it is. It sounds risky for 4.0. No IC lands without bugs. On the other hand, it would not take long to implement, and it would likely be a perf win. Before tackling that though we should measure how much memory we'd really save.
(In reply to comment #26)
> I'm surprised the mjit manages to generate 4-5MB of executable code
> per tab (if that's the correct diagnosis).  I wonder if there's some
> overallocation of space going on.
have a look at bug 598466 comment 94, additionally to the 4-5MB per tab that you can save with switching methodjit off there are another 75MB (~1MB per tab) introduced shortly before methodjit was turned on in the tracemonkey branch. This might be some related (management?) code that also needs cleanup and can't be tested by turning methodjit on/off, the overhead is always there.
Please don't drag that into this bug.  We should file a separate bug on that issue.
Depends on: 577359
Depends on: 611400
(In reply to comment #27)
> That seems to give hope that it's due to fragmentation, or better
> yet, just some simple bug in the allocator.

Yeah, I'm peering at ExecutablePool* poolForSize(size_t n) and the
logic looks a bit funny:

   // If the new allocator will result in more free space than in
   // the current small allocator, then we will use it instead
   if ((pool->available() - n) > m_smallAllocationPool->available()) {
       m_smallAllocationPool->release();
       m_smallAllocationPool = pool;
       pool->addRef();
   }

Seems like this abandons the current small allocation pool regardless
of how much space is left in it, whenever satisfying the allocation
from a new pool would result in more free space.

Not sure tho.  Will dig more.
This bug is heavy on speculation and light on data.  We need more measurements.  In particular, I'd like to know if cad-comic.com is typical, or if it's doing something unusual.

I'll start doing more measurements, but I hope others will do likewise, as my JM knowledge is scant.
(In reply to comment #33)
> In particular, I'd like to know if cad-comic.com is typical, or if it's doing
> something unusual.
I selected cad purely for two properties:
a) the random button allowing me to quickly spawn a bunch of different pages 
b) the fact that it contains somewhat large images

I.e. it was not picked for being an especially bad case. I'm also getting similar savings per tab when disabling mjit on my real browsing session, which contains tabs from many different domains.
(In reply to comment #33)
>  In particular, I'd like to know if cad-comic.com is typical, or if it's doing
> something unusual.

You may want to try the reduced 3 tab testcase from bug 598466 comment 15 ,
which is from a totally different website. (measurements for it are in comment 28 of the same bug).
(In reply to comment #32)
> (In reply to comment #27)
> > That seems to give hope that it's due to fragmentation, or better
> > yet, just some simple bug in the allocator.  [...]
>
> Yeah, I'm peering at ExecutablePool* poolForSize(size_t n) and the
> logic looks a bit funny:

I'm getting the impression that there is no (serious) fragmentation
problem in the executable allocator.  From adding manual
instrumentation, in the 20 tab case, mjit::Compiler::finishThisUp
requests 74.9MB from execPool->alloc(totalSize).

This turns into a total request of 81.0MB in poolForSize (not sure
where the 6.1MB increase comes from).

That in turn results in a total 88.4MB request to
ExecutablePool::create.

So it could do a bit better (88.4MB resulting from 81.0MB of requests)
but it's not fundamentally the cause of the large amount of
allocation.


> Seems like this abandons the current small allocation pool regardless
> of how much space is left in it,

I also measured that.  The amount of space left in abandoned pools
is 5.40MB.  Not great, but not a disaster either.  I don't think we
can do better unless ExecutableAllocator is modified so as to keep
track of multiple small allocation pools, rather than just one.

--------

From this (and watching the numbers when opening new c.a.d tabs)
it does appear that the mjit generates ~4MB of code per tab.

A good question seems to be: why?  Looking at the page source for a
tab, it looks pretty harmless, although presumably it drags lots of
.js in from elsewhere.
(In reply to comment #36)
> Looking at the page source for a
> tab, it looks pretty harmless, although presumably it drags lots of
> .js in from elsewhere.

I wget'd all the scripts I could find from one of the tabs
(http://www.cad-comic.com/cad/20050715.htm).  They don't
amount to a lot of source code:

    5205 2010-10-21 01:58 widgets.js
   43130 2010-11-05 00:53 1289620584.js
   28975 2010-11-18 12:00 widget.php?v=10
    2391 2010-12-01 23:56 spcjs.php?id=12&amp;target=_blank

Is there a way to get debug spew from mjit when it's embedded in a
browser, a la JMFLAGS= ?  I'd be instructive to see what's going
through the compilation pipeline each time a new cad-comic.com
tab is opened.
(In reply to comment #37)
> Is there a way to get debug spew from mjit when it's embedded in a
> browser, a la JMFLAGS= ?  I'd be instructive to see what's going
> through the compilation pipeline each time a new cad-comic.com
> tab is opened.

I think JMFLAGS should work on browser debug builds, although there will be a lot of spew. But I think it prints the filename of the script, so it should still be helpful if you redirect to a file.
Julian, comment 0 has a list of script urls that are possibly worth looking at.

In addition to the ones you tried, there's the 100KB facebook thing, the 20KB Google API thing, 80KB of minified jquery code, 23KB of Google analytics.

You should be able to use JMFLAGS in browser.  Just start the browser from a shell with that env var set.
For ease of debugging , i recommend preffing the jit off , starting with JMFLAGS set ,then preffing the jit on and reloading .
If I change this line:

# define JIT_ALLOCATOR_LARGE_ALLOC_SIZE (ExecutableAllocator::pageSize * 4)

to use 1 as the multiple instead of 4 I get a ~5% reduction in create() call totals for the 40 tab cad-comic.com case on Linux64.  If I increase it to 16 (as it is on Windows) I get a ~4% increase.  So there's some fragmentation there.
Assignee: general → dvander
To follow-up comment 25, attached are some stats for the 40-tab/cad-comic/Linux64 case:  histograms showing the sizes passed to create() and poolForSize().

The short version:
- For create(), 75% of the calls have the minimum size (16KB).  On Windows, where the minimum size is 64KB, I estimate 97% would have the minimum size.

- For poolForSize(), something like 80%+ of the sizes are small, eg. < 200 bytes.  Most of the rest are a few thousand.  The biggest is 134268.
Attached file scripts histogram
Same experimental setup as comment 42.  This shows a histogram of the URLs of the scripts compiled, as reported by the "compiling script" line produced with JMFLAGS=scripts, with the "line" and "length" part removed.

Basically, this agrees with what bz said in comment 0.  There's lots of standard stuff there:  jquery, facebook, twitter, google-analytics, etc.
For the same case, 60% of the code is stubs (stubcc.size()), 40% is the rest (masm.size()).
The |cx->calloc(totalBytes)| call in finishThisUp() is responsible for a lot of memory use as well.  I just did a not-quite-40-tabs Massif run in which create() was responsible for 108MB and the cx->calloc() was responsible for 67MB.

(Massif works much better with Firefox if you use --smc-check=all;  of all people I should have remembered this.) 

Presumably fragmentation is less of an issue there(?), which means it's just a lot of data.

Here's a breakdown of the different components of the calloc() in a 40-tab run:

  sizeof(JITScript);                                     5,214,888
  sizeof(void *) * script->length;                      34,990,208
#if defined JS_MONOIC
  sizeof(ic::MICInfo) * mics.length();                   2,532,624
  sizeof(ic::CallICInfo) * callICs.length();            10,332,480
  sizeof(ic::EqualityICInfo) * equalityICs.length();       339,944
  sizeof(ic::TraceICInfo) * traceICs.length();             333,408
#endif
#if defined JS_POLYIC
  sizeof(ic::PICInfo) * pics.length();                  37,873,416
  sizeof(ic::GetElementIC) * getElemICs.length();        3,216,416
  sizeof(ic::SetElementIC) * setElemICs.length();          716,616
#endif  
  sizeof(CallSite) * callSites.length();                 1,332,024
On an OSX64 build, sizeof(PICInfo) is 136 while sizeof(BasePolyIC::ExecPoolVector) is 32.  Similarly sizeof(CallICInfo) is 96 while sizeof(CallICInfo::pools) is 24.  Extrapolating, this is 8.9MB + 2.5MB = 11.4MB (or 11%) of the calloc() breakdown reported in comment 45.

IIUC, these fields are only used to release ExecutablePools when the whole JITScript is released.  Thus, it seems like these fields could be removed and the corresponding ExecutablePool*'s stored in the ics' JITScript's execPools.
(In reply to comment #45)

>   sizeof(void *) * script->length;                      34,990,208

611400 should improve that significantly.
> Here's a breakdown of the different components of the calloc() in a
> 40-tab run:

That's interesting.

The effect of the 611400 fix is shown below (20 tabs).  With that
in place, the PICInfo is by far the largest remaining component, so
we should next see if we can get rid of them as per Luke's comment 46.

We're still skirting around the central issue of why (or indeed,
does?) the jit create so much code, but I guess we'll get to that.

Pre 611400

    2327400  sizeof(JITScript)
   15604256  sizeof(void *) * script->length
    1130976  sizeof(ic::MICInfo) * mics.length()
    4619040  sizeof(ic::CallICInfo) * callICs.length()
     185504  sizeof(ic::EqualityICInfo) * equalityICs.length()
     148944  sizeof(ic::TraceICInfo) * traceICs.length()
   16858832  sizeof(ic::PICInfo) * pics.length()
    1432368  sizeof(ic::GetElementIC) * getElemICs.length()
     315936  sizeof(ic::SetElementIC) * setElemICs.length()
     595368  sizeof(CallSite) * callSites.length()

Post 611400

    2415392  sizeof(JITScript)
     742240  sizeof(NativeMapEntry) * nNmapLive
    1131312  sizeof(ic::MICInfo) * mics.length()
    4621536  sizeof(ic::CallICInfo) * callICs.length()
     185504  sizeof(ic::EqualityICInfo) * equalityICs.length()
     149040  sizeof(ic::TraceICInfo) * traceICs.length()
   16872976  sizeof(ic::PICInfo) * pics.length()
    1432928  sizeof(ic::GetElementIC) * getElemICs.length()
     316008  sizeof(ic::SetElementIC) * setElemICs.length()
     595680  sizeof(CallSite) * callSites.length()
(In reply to comment #46)
> Thus, it seems like these fields could be removed and
> the corresponding ExecutablePool*'s stored in the ics' JITScript's execPools.

Then we'd have to flush all ICs on GC. Maybe not a big deal, but we can thread the execPool pointers in the IC executable code instead. Most PICInfo structs are wasting that space since they're not even polymorphic.
(In reply to comment #49)
Still, the releasePools calls are done for all ics in a given script, so if the JITScript held a union of all the ics' pool vectors/arrays (in a new per-JITScript vector), we could achieve the same effect.
Good idea! Patch soon.
(In reply to comment #52, comment #53)
> JMFLAGS=scripts{,jsops} output created by loading one extra cad tab

As an attempt to figure out what extra stuff is compiled for each
new tab.  I could also include the JMFLAGS=insns output, but the
above two are already pretty huge and I don't have a clue what they
signify, if anything.
> Good idea! Patch soon.

Could we please keep this bug as a metabug and put all patches in bugs blocking this one?  That way we don't run into trouble with partial fixes landing and then not knowing what to do with this bug.
Depends on: 616310
I field bug 616310 to reduce the fragmentation in the allocator.
Also, this is blocking beta9, but it's not clear what the criteria is for deciding that it's been fixed.
Yeah. The method JIT is going to add *some* memory usage, we can't block on 3.6 parity. Let's investigate where we can easily reduce bad memory use, file bugs on those, and then unblock this.
The calloc'd space is being worked on, as is fragmentation.  That just leaves the actual JITted native code.  We still need more data on how that is broken up;  currently we only have comment 44 which isn't much.
(In reply to comment #57)
> Also, this is blocking beta9, but it's not clear what the criteria is for
> deciding that it's been fixed.

(In reply to comment #58)
> Yeah. The method JIT is going to add *some* memory usage, we can't block on 3.6
> parity. Let's investigate where we can easily reduce bad memory use, file bugs
> on those, and then unblock this.

Clarification: I set this to block beta9 as an indicator that we should be working on it now, because it's important and will take an unknown amount of time. So I agree with dvander that we should unblock it once we have a good analysis of the problem and have filed well-defined sub-bugs.
(In reply to comment #59)
> The calloc'd space is being worked on, as is fragmentation.  That just leaves
> the actual JITted native code.  We still need more data on how that is broken
> up;  currently we only have comment 44 which isn't much.

<njn>       dvander: any ideas how to space-profile JM JITted code?
<dvander>   njn, there's probably a few things of interest: sync blocks
            (code emitted by FrameState::sync/syncAndKill), code generated
            by fallibleVMCall, code generated by FrameState::merge, and then
            Everything Else
<dvander>   Assembler has a size() function so it should be easy to compute
            before/after
<njn>       cool, that's a good start
- fallibleVMCall: 13,326,945
- sync:            3,635,958
- syncAndKill:     1,961,004
- merge:           3,241,445

- everything:     44,001,154

I used the attached patch to get these numbers, which are for a 10 tab cad-comic.com session.  Looks like ~50% of the code size isn't covered by the above four functions.
In case it wasnt' clear, in comment 62 the "everything" line counts *all* code generated;  it's *not* "everything else".

If you subtract the first four counts from the last count you get 21,835,802 bytes.  That is the "everything else" number.
Just to share some knowledge, I think massif in combination with the KDE program massif-visualizer <https://projects.kde.org/projects/kdereview/massif-visualizer> can help us get our heads around this problem. The attached screen shot has an issue in that it is only identifying our allocator wrapper functions, but valgrind has a flag to consider a function an allocator. Doing enough of that should get us some pretty informative data that many engineers can understand, without having to wade through massif logs.

ff-massif/dist/bin$> LD_LIBRARY_PATH=. valgrind --tool=massif --smc-check=all ./firefox-bin 


mozconfig:

mk_add_options MOZ_MAKE_FLAGS=-j8
. $topsrcdir/browser/config/mozconfig
mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/ff-massif
ac_add_options --enable-optimize=-O1
ac_add_options --disable-debug 
ac_add_options --enable-tests
ac_add_options --enable-valgrind
ac_add_options --disable-jemalloc
(In reply to comment #62)
> Looks like ~50% of the code size isn't covered by the
> above four functions.

Yes.  This attachment is the JMFLAGS=insns output for a -j -m -p run
of bitops-3bit-bits-in-byte.js.  I enhanced your c62 patch so as to
make it clear in the output which insns are covered and which aren't
(aren't = the areas not inside an "XXXX BEGIN" ... "XXXX END" section)

dvander, can you glance at this and see if the non-counted areas 
are generated by any specific part of the compiler, that we can add
counter(s) for?
(In reply to comment #64)
> 
> Just to share some knowledge, I think massif in combination with the KDE
> program massif-visualizer
> <https://projects.kde.org/projects/kdereview/massif-visualizer> can help us get
> our heads around this problem. The attached screen shot has an issue in that it
> is only identifying our allocator wrapper functions, but valgrind has a flag to
> consider a function an allocator. Doing enough of that should get us some
> pretty informative data that many engineers can understand, without having to
> wade through massif logs.
> 
> ff-massif/dist/bin$> LD_LIBRARY_PATH=. valgrind --tool=massif --smc-check=all
> ./firefox-bin 

The function-is-an-allocator flag might not work quite like you expect... basically it assumes that the function you name is a wrapper for malloc/new.  In a lot of cases (ie. the pertinent ones here) the functions are wrappers for mmap, IIRC Massif will ignore them by default.

However, there is an option --pages-as-heap which basically says "ignore the malloc/new level, just measure everything at the mmap/page level".  The results are more coarse-grained and a bit harder to interpret but it includes *all* memory allocations.  I've found that profiling Firefox without this flag is pretty useless because so much allocation doesn't go via malloc/new;  indeed, I implemented that option exactly because of this.

Anyway, it's clear from looking at --pages-as-heap output that the vast majority of the increase is due to two allocations in finishThisUp() -- the calloc() call, which is being attacked in multiple ways (bug 611400, bug 616367), and the executable code allocation, which has had less attention so far (bug 616310).
There seems to be a lot of work on optimizing memory allocation, reducing the footprint of the compiled code etc. But ultimately memory usage will still increase in a linear manner with the number of tabs.

Wouldn't it be better to just keep the compiled code around where it's necessary and discard it everywhere else? Considering that only a finite amount of code can be run at any point in time (since we can't have more than 100% CPU load anyway) there should only be a finite amount of code that needs compiling in most cases, i.e. usually whatever is running in the inner loops.

Background tabs are either completely idle or only run lightweight scripts in long intervals otherwise having a bunch of such tabs open would lead to intolerable JS load anyway.


I don't know when mjit-ed code is normally discarded (when the window object is GCed?). But i think it would make more sense if it was subject to some kind of garbage collection.


If compiling of all code on an average website takes around 30ms as mentioned in a previous comment then even a rather simple such as "discard compiled method if it has not been used for N seconds" would already result in most memory being freed without much of a performance penalty. That is of course assuming we have method-level hit counters that could be used for such a scheme or that adding them wouldn't be a significant performance hit.


TL;DR: Most memory optimizations discussed don't improve scalability, they only improve the footprint by a constant factor. Actively discarding compiled code on the other hand can turn it from O(n) to O(1) memory usage without a significant performance hit.
(In reply to comment #67)
> TL;DR: Most memory optimizations discussed don't improve scalability, they only
> improve the footprint by a constant factor. Actively discarding compiled code
> on the other hand can turn it from O(n) to O(1) memory usage without a
> significant performance hit.

This is exactly my #5 in comment 25. I do think it's the next thing to try after the things we're doing now. It's non-trivial, though. If someone has the time to try it out, that would be nice.
The key dependent bugs are now blocking, so this one doesn't need to anymore.
blocking2.0: beta9+ → ---
(In reply to comment #69)
> The key dependent bugs are now blocking, so this one doesn't need to anymore.

We have bugs filed to improve fragmentation, and to reduce the size of the calloc() in finishThisUp().  But we don't have anything for the "JM generates an awful lot of native code" issue, other than bug 577359, but we don't have any data on whether that'll actually help.

So should we have another bug(s) on the "lots of native code" issue?  Either to reduce it, or to discard it more aggressively as per comment 67?
(In reply to comment #70)
> (In reply to comment #69)
> > The key dependent bugs are now blocking, so this one doesn't need to anymore.
> 
> We have bugs filed to improve fragmentation, and to reduce the size of the
> calloc() in finishThisUp().  But we don't have anything for the "JM generates
> an awful lot of native code" issue, other than bug 577359, but we don't have
> any data on whether that'll actually help.
> 
> So should we have another bug(s) on the "lots of native code" issue?  Either to
> reduce it, or to discard it more aggressively as per comment 67?

That is an excellent question. I asked for data in bug 598466 to help guide that decision, and bsmedberg posted some good advice for how to get the data, but nothing has come back yet.

Reducing the memory usage seems pretty hard at this stage, because it has to be something low-risk for us to meet our release target. If someone has some simple ideas and spare time, by all means take a crack at it.

Throwing away code seems like the easiest approach, but there are a couple of problems to solve there: we'd need some good code mem usage metering, and we need to purge call ICs when we do that. We might take a perf hit if we discard too early, but I don't think we'll be able to cook up memory pressure detection on short notice, except the easy kind of VirtualAlloc returning NULL in the code memory allocator. If we're going to try that, we should really start now.
blocking2.0: --- → ?
see comment #69 - We've identified and filed a few individual, short-term bugs where the method JIT allocates memory unnecessarily. I don't think we can justify blocking a release on this meta bug. It's not clear what the goal would be (the method JIT must use *some* memory), and each additional fix gets increasingly more risky and difficult to implement.
blocking2.0: ? → -
(In reply to comment #72)
> It's not clear what the goal would
> be (the method JIT must use *some* memory), and each additional fix gets
> increasingly more risky and difficult to implement.
That does not justify a *+100% memory usage increase* that also scales with the number of tabs.

If you take a look at various measurements in bug 598466 you'll notice that methodjit is the worst offender by far. You can't just bloat up firefox and then say "it's too late to fix it". Especially not for a single feature that's not directly visible to the user in many cases. Even more so when you consider that there are bunch of other features which *also* increase memory usage and it is most likely that they won't be able to fully fix that either.

Just to emphasize: We have an overall increase of approximately +200% to +250% in memory footprint since firefox 3.6. And methodjit is responsible for half of that increase.
(In reply to comment #72)
> I don't think we can
> justify blocking a release on this meta bug. It's not clear what the goal would
> be (the method JIT must use *some* memory), and each additional fix gets
> increasingly more risky and difficult to implement.

The goal would be not regressing memory usage by hundreds of megabytes compared to 3.6. Maybe it's just that annoying vocal minority, but there's a good chunk of people that would rather just lose the few milliseconds that methodjit gains in order to reduce the memory usage. If it's too late to fix it, then it could be disabled for 4.0 and enabled in a later release, or the 4.0 release could be pushed back another month or two...
I think it does justify it. Firefox 4 is blocked on being fast, which feedback indicates is a feature people like. We're certainly not going to pref that off, though if you're using many, many tabs - and the browser is crashing due to out-of-memory, the pref is there to toggle :)

Just to be clear: we've measured some sizable memory wins that are easy targets for 4.0. They're blocking b9 (since it really is "too late", and we need the feedback sooner) and hanging off this bug.

The real risk is that users will run out of the limited 2GB Windows address space faster, and if that ends up being a major problem, we can dig deeper. There are a bunch of great ideas here.
(In reply to comment #21)
> In the 80 tabs testcase and generally if you browse multiple tabs in a specific
> website, most of the script URLs will be the same for all the tabs.
> 
> So why is 5MB needed per tab, is it possible to share the compiled code between
> the tabs?

Boris Zbarsky replied to that in comment #22 :

> The compiled code is typically specialized to the specific global object (and
> state of said global object) that it's compiled for.

But I am not sure I completely understand why compiled code can't be shared in some situations. Can someone or Boris elaborate on that with a more detailed answer?

1. Boris writes that the compiled code is *typically* specialized to the specific global object . 
What does "typically" mean? I guess typically is not "always", so does it mean that a certain percentage of methods are specialized, but the other methods are not specialized to the global object, and therefore can be compiled once and stored once, and their compiled code be shareable among all tabs that use the same script where the method resides? In this case memory can be saved by not making copies of the compiled code of the method for each tab.

Can someone estimate what is the percentage of specialized methods vs non specialized?
Can you give small examples of cases where it needs to be specialized , and cases where it doesn't ?

2. How much performance does the specialization of the methods to the global object bring vs compiling the methods but not specializing to the global object (i.e. accesses to the global object will not be hard-coded in the compiled code). If the added performance is not much, it may be possible to save hundreds of megabytes of memory by having a single copy of the compiled methods instead of 80 copies (in the case of 80 tabs), with a small performance hit.
Will this (small?) performance hit, bring the javascript compiler performance to be non-competitive with the other JS engines out there ?
(In reply to comment #75)
> I think it does justify it. Firefox 4 is blocked on being fast, which feedback
> indicates is a feature people like.

Methodjt managed to massively spoil the speed of Firefox 4 for me several times when it caused swapping...

> We're certainly not going to pref that off,
> though if you're using many, many tabs - and the browser is crashing due to
> out-of-memory, the pref is there to toggle :)

Pointing users being confronted with an unusable Firefox to a hidden pref is not a solution.
blocking2.0: - → ?
How did you narrow down the method JIT as the cause of swapping problems?
I disabled it and the problem vanished.
(In reply to comment #76)
> > So why is 5MB needed per tab, is it possible to share the compiled code between
> > the tabs?

This approach is nice on paper for this particular test case, but in practice it won't save as much because in practice large sessions contain content from many different sites. You would basically fix the testcase and we could pat ourselves on each other's back... and the real problem would still be there.


And i doubt we can share code for... let's say google analytics across different compartments. Which

Things get even dicier if you consider that any method could be overloaded, even many of the built-in objects (Array, String, Object) could be modified on some tabs. Making code shareable basically would prevent inlining.

The Java Hotspot VM does make overly optimistic optimizations like these, but it has the ability to back out and recompile if it detects overloading that makes inlined JITed code invalid, methodjit doesn't have that ability.
(In reply to comment #75)
> I think it does justify it. Firefox 4 is blocked on being fast, which feedback
> indicates is a feature people like. 

Well, the problem can be easily framed as being a performance problem too. Such as the swapping problem described in comment 77. Or consider a hypothetical benchmark that would measure performance for tons of of dynamically generating scripts (eval, new Function(), generated script tags). If we compile everything and then run it only a few times this would actually be slower than other browsers/firefox without methodjit (see comment 16).
> What does "typically" mean? I guess typically is not "always"

It's "always", to a first approximation.   Any code that uses any global variable is specialized to the global.
(In reply to comment #79)
> I disabled it and the problem vanished.

I don't think an edit-war with the blocking flag is going to be productive. Why don't you file a bug on the actual problem you're experiencing instead of making noise in this one? The data in this bug does show some overhead problems, and we're working on them. But your swapping problem could be just a leak, or yet another problem.
(In reply to comment #64)
> Just to share some knowledge, I think massif in combination with the KDE
> program massif-visualizer

Yeah, +1 for that.  It facilitates doing something that isn't really
feasible from the text-only output: seeing how space use evolves over
time.  Some stuff hangs around forever, other stuff spikes and then
disappears.  For example, having loaded 20 cad-comic tabs, I see from
the profile that the peak heap load is caused by 26.3MB of transient
allocations from sqlite3malloc, driven by nsURLClassifierDBService.
And I'm thinking (1) that seems pretty extravagant! and (2) I wouldn't
easily have noticed it in the text only output.

Recommended.
This blog doesn't block because it doesn't have a well-defined fixed state. JITs are explicitly a trade of space for time, so space must increase by some amount, which is hard to predict in advance.

I hope that people who are concerned about this issue can see that we have put a high priority on mitigating the memory increase, as evidenced by all the work in the discrete bugs that help the problem, and the fact that they are marked beta9+. I think things are going to look a lot better once those bugs are fixed, and also bug 617505.

I will file a bug for the discarding-jitcode idea. I think it's doable in the time we have, and I think it will more or less satisfy the demands being made here.
blocking2.0: ? → ---
Depends on: 617656
(In reply to comment #75)
> I think it does justify it. Firefox 4 is blocked on being fast, which feedback
> indicates is a feature people like.

Speed is definitely one of the most common complaints about Firefox 3.6.  "Bloat" is another one.  And although "bloat" is a vague and subjective term, I bet lots of people will fire up their memory measurement tool of choice and conclude "bah, Firefox 4.0 is even more bloated, I'm going back to Chrome"
 
> The real risk is that users will run out of the limited 2GB Windows address
> space faster, and if that ends up being a major problem, we can dig deeper.

A million times yes.  This has enormous potential to (legitimately) hurt people's perception of Firefox.
(In reply to comment #86)
> Speed is definitely one of the most common complaints about Firefox 3.6. 
> "Bloat" is another one.
Have similar memory comparisons been done between 2.0 and 3.x? I wonder if there might be some lower-hanging fruit that have been introduced in the past and not been found because nobody bothered to look for them in the first place.
Between 2.0 and 3.0 we made a major effort to reduce memory usage and fragmentation (including switching to jemalloc), no?

Between 3.0 and 3.6 is an interesting question.  But not for this bug...
(In reply to comment #66)
> 
> The function-is-an-allocator flag might not work quite like you expect...
> basically it assumes that the function you name is a wrapper for malloc/new. 
> In a lot of cases (ie. the pertinent ones here) the functions are wrappers for
> mmap, IIRC Massif will ignore them by default.
> 
> However, there is an option --pages-as-heap which basically says "ignore the
> malloc/new level, just measure everything at the mmap/page level".

I should add that you can use the --alloc-fn option in concert with the --pages-as-heap=yes option.  Just in case anyone wanted to try.
Depends on: 619622
Bug 616367 didn't end up having much effect, so I filed bug 619622 as an alternative approach.
Looking again at the calloc'd space in finishThisUp():  Bug 611400 has been completed. Bug 619622 (which shrinks the size of an IC from 88 to 64 bytes on 32-bit and 144 to 112 on 64-bit) is almost ready.  If I apply that and measure a 20-tab, 64-bit cad-comics run, I see these numbers:

1 sizeof(JITScript);                                       2,588,992
2 sizeof(NativeMapEntry) * nNmapLive;                        797,776
3 sizeof(ic::MICInfo) * mics.length();                     1,242,960
4 sizeof(ic::CallICInfo) * callICs.length();               4,984,704
5 sizeof(ic::EqualityICInfo) * equalityICs.length();         166,408
6 sizeof(ic::TraceICInfo) * traceICs.length();               161,904
7 sizeof(ic::PICInfo) * pics.length();                    14,878,528
8 sizeof(ic::GetElementIC) * getElemICs.length();          1,273,800
9 sizeof(ic::SetElementIC) * setElemICs.length();            392,832
10 sizeof(CallSite) * callSites.length();                    642,060

total:                                                  27,129,964

Compare this with comment 45 which was a 40-tab run with a total of 96,882,024.  If we halve that to account for 40 tabs vs 20 tabs we get 48,441,012.  So 27,129,964 is a 1.79x reduction.  Pretty good!

PICs still dominate, CallICs are also important, the rest probably aren't worth bothering with.  I'll see if there's any more fat to trim.
Depends on: 619849
I just filed bug 619849 which shrinks the size of JITScript on 64-bit platforms.  It's a small but easy win.

Beyond that, I'm out of ideas -- I can't see how to shrink PICInfo or CallICInfo.  I wonder if fewer ICs could be allocated?  Probably too hard for Fx 4.0.
(In reply to comment #92)
> I just filed bug 619849 which shrinks the size of JITScript on 64-bit
> platforms.  It's a small but easy win.
> 
> Beyond that, I'm out of ideas -- I can't see how to shrink PICInfo or
> CallICInfo.  I wonder if fewer ICs could be allocated?  Probably too hard for
> Fx 4.0.

Bug 617656, which I'll land tomorrow, will take a lot of pressure off of the JITScript calloc and code memory allocation for long lived browser sessions.
(In reply to comment #93)
> 
> Bug 617656, which I'll land tomorrow, will take a lot of pressure off of the
> JITScript calloc and code memory allocation for long lived browser sessions.

Defintely, that patch looks like it'll be really good for keeping code size down over time.  And we've made good progress on reducing the peak size as well.  I'm feeling much better about this bug now.
All the bugs blocking this meta-bug have been fixed!  Closing.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
No longer depends on: 617656
Were there any other viable usage saving ideas in the 90 comments above that
haven't been filed yet? ie: Comment 61 onwards? (Having trouble telling since
there were lots of different ideas in comment 60+ and a lot of it is over my
head).
No longer depends on: 619849
Depends on: 619849
Depends on: 617656
Numbers for the current nightly:


962/988/1360 with methodjit on (initial value after startup and 1st GC cycle)
793/818/1218 with methodjit on (after waiting 15+ minutes. those extremely long GC cycles make it hard to test repeatedly)
679/704/1084 with methodjit off
I've got an about:memory reporter for the method JIT code stuff underway in bug 623281, as well.
The 8472, what's the workload for comment 97?  I'm trying to compare those numbers with the ones you put in bug 598466 comment 133.

Also, can you remind me what the three measurements are?  Virtual/something/private?
(In reply to comment #99)
> The 8472, what's the workload for comment 97?  I'm trying to compare those
> numbers with the ones you put in bug 598466 comment 133.
> 
> Also, can you remind me what the three measurements are? 
> Virtual/something/private?

private/working set/virtual

The workload in comment 97 is 80 tabs, in bug 598466 comment 133 it's 40 tabs to compare with old data.
I'll reopen, since the method JIT still uses a lot of memory.  Please note that reopening the bug doesn't mean much by itself.  For the situation to change we'll need concrete ideas and follow-up.  Reading back through the comments I don't see anything in the way of low-hanging fruit.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #101)
> I'll reopen, since the method JIT still uses a lot of memory.  Please note that
> reopening the bug doesn't mean much by itself.  For the situation to change
> we'll need concrete ideas and follow-up.  Reading back through the comments I
> don't see anything in the way of low-hanging fruit.

First of all we need to figure out what's left (compared to methodjit=off) after everything has been GCed. Are those recompiled scripts or something else that isn't covered by the GC?

If it's recompiled scripts then lazy compiling would help, but that's not  exactly a low-hanging fruit according to comment 25. If lazy compilation would be implemented we would need hit-counters for code (compile after 100 hits). Those counters could also be used to discard compiled code more aggressively and only keep hot code.

If it's something else then we should see if it can be included in the garbage collector cycles too.


Another issue is that bug 617656 does not help with peak virtual memory usage, which currently is an issue on 32bit systems, leading to OOM crashes. The discarding is strictly time-based right now, as is garbage-collecting.
Keywords: footprint
All Depended bug all fixed. Any change on this?
(In reply to comment #103)
> All Depended bug all fixed. Any change on this?

See comment 101.  Just because all dependent bugs have been fixed doesn't mean the overall problem has been fixed;  it just means all the good ideas we have had for fixing have been done.
Depends on: 629601
Depends on: 630445
Depends on: 630447
Depends on: 631139
Depends on: 630738
Depends on: 631045
Alias: JaegerShrink
Depends on: 631578
Depends on: 631706
Depends on: 631714
(In reply to comment #25)
> 
> 6. Delayed compilation of scripts might be worth a try, but it has some
> difficulties. bhackett's measurements show that compilation is worthwhile only
> if the code runs at least 100 iterations. But it may be hard to take advantage
> of this:
> 
>  - it would be really nice and simple to compile on the Nth entry to a
> function, but that is bad if it contains a loop that will run many
> iterations--in that case we should compile right away.
> 
>  - we could try to compile after N runs of a loop, but then we need to compile
> while we are running the script, and then jump into the compiled code in the
> middle. That doesn't sound *too* hard, but the tracer integration code had to
> do things like that and it was not easy to get right. dvander probably has more
> insight on this issue.
> 
>  - and then there are the inevitable tuning problems. They might not be so bad
> with simple schemes like compiling after 10 iterations, though.
> 
> Compiling global or eval scripts only if they contain a loop (similar to what
> bhackett suggested) seems like an easier starting point. But even there there
> are some issues: some benchmarks eval the same script many times, in which case
> compilation may be worthwhile even if the script doesn't contain a loop.
> 
> Maybe compiling loop-containing scripts right away and others on the Nth
> iteration (for N ~= 10) would be really easy and possibly helpful?

Various measurements have shown this approach is likely to make a huge improvement in JM's memory usage, ie. somewhere between 2x and 10x depending on the details.  The discussion is currently split across four bugs, see bug 631706 comment 6 for a guide.)
Depends on: 631951
(In reply to comment #105)
> The discussion is currently split across four bugs, see bug
> 631706 comment 6 for a guide.

Bug 631951 has been created to subsume those four bugs.
> Bug 631951 has been created to subsume those four bugs.

This bug has landed in the tracemonkey repository, with big reductions in the amount of space used by JM -- eg. I've seen figures ranging from 2.5x--4.5x.

Ed, The 8472, anyone else:  new measurements would be welcome.

All the other open bugs [*] that are still blocking this bug will produce improvements that are tiny compared to bug 631591.  For Firefox 4.0, this is as good as it's going to get (which is a lot better than it was in late November).

After 4.0 is released I'll close this bug and start a new one for improvements that can go into Firefox 5 (including the aforementioned tiny-improvement bugs).  Many thanks to everyone who has contributed.

[*] The exception to that is bug 631578 which is about measurements, which may lead to further bugs being filed, but not for 4.0.
(In reply to comment #107)
> This bug has landed in the tracemonkey repository
> Ed, The 8472, anyone else:  new measurements would be welcome.
this one? http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2011-02-15-03-tracemonkey/firefox-4.0b12pre.en-US.win32.zip
ok, using 2011-02-15-03-tracemonkey. 40 cad-comic tabs, image discarding off, HW acceleration off, ctrl+tabbing through all tabs.


data in the following format:
private/working/virtual
js/gc-heap 
js/string-data 
js/mjit-code


methodjit always:
913/922/1144
90,177,536
9,542,568
237,386,742

methodjit on:
648/657/870
80,740,352
10,222,868
38,223,577

methodjit on, 20 minutes later:
676/687/889
105,906,176
5,495,670
4,799,560

methodjit off:
612/621/833
88,080,384
5,529,310
0
Thanks, The 8472!  That looks pretty satisfactory :)
Marking this fixed as discussed in comment 107.  See bug 640457 for follow-ups;  note that bug is about memory usage reductions in Firefox in general, not just JaegerMonkey.  Please CC yourself if you're interested.
Status: REOPENED → RESOLVED
Closed: 14 years ago13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: