Last Comment Bug 670596 - GC: Allow unused arenas to be marked as uncommitted
: GC: Allow unused arenas to be marked as uncommitted
Status: RESOLVED FIXED
[MemShrink:P1]
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal with 5 votes (vote)
: mozilla10
Assigned To: Terrence Cole [:terrence]
:
Mentors:
Depends on: 692748 700343 752339 670594 699279 700291
Blocks: 697970 698588 700357 702251 703067
  Show dependency treegraph
 
Reported: 2011-07-10 22:35 PDT by Bill McCloskey (:billm)
Modified: 2012-05-06 10:37 PDT (History)
17 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
WIP - works on Mac (7.26 KB, patch)
2011-07-13 16:01 PDT, Paul Biggar
no flags Details | Diff | Splinter Review
WIP v2: updated to current tip. (7.35 KB, patch)
2011-10-11 13:51 PDT, Terrence Cole [:terrence]
no flags Details | Diff | Splinter Review
WIP v3: fix most crashers. (9.42 KB, patch)
2011-10-11 16:30 PDT, Terrence Cole [:terrence]
no flags Details | Diff | Splinter Review
WIP v4: performance improved to match baseline when body commented (17.36 KB, patch)
2011-10-14 17:23 PDT, Terrence Cole [:terrence]
no flags Details | Diff | Splinter Review
Synthetic benchmark. (3.21 KB, text/plain)
2011-10-14 17:56 PDT, Terrence Cole [:terrence]
no flags Details
v5 - Memory decommit on GC_SHRINK. (21.90 KB, patch)
2011-10-17 17:39 PDT, Terrence Cole [:terrence]
no flags Details | Diff | Splinter Review
v6 - Getting close. (22.45 KB, patch)
2011-10-18 15:37 PDT, Terrence Cole [:terrence]
wmccloskey: feedback+
Details | Diff | Splinter Review
v7: With working Windows memory decommit. (31.84 KB, patch)
2011-10-25 10:23 PDT, Terrence Cole [:terrence]
no flags Details | Diff | Splinter Review
v8: Simplified arenaAllocate (32.31 KB, patch)
2011-10-31 10:11 PDT, Terrence Cole [:terrence]
no flags Details | Diff | Splinter Review
v9: With review feedback. (33.40 KB, patch)
2011-10-31 18:44 PDT, Terrence Cole [:terrence]
no flags Details | Diff | Splinter Review
v10: Hooked up about:memory (44.61 KB, patch)
2011-11-02 13:44 PDT, Terrence Cole [:terrence]
no flags Details | Diff | Splinter Review
v11: rebased and fixed a critical bug (44.58 KB, patch)
2011-11-02 15:12 PDT, Terrence Cole [:terrence]
no flags Details | Diff | Splinter Review
This appears to work in win7. (604.79 KB, image/png)
2011-11-03 18:24 PDT, Terrence Cole [:terrence]
no flags Details
v12: final version (46.02 KB, patch)
2011-11-03 18:32 PDT, Terrence Cole [:terrence]
n.nethercote: review+
wmccloskey: review+
Details | Diff | Splinter Review
v13: final final version (45.62 KB, patch)
2011-11-04 18:15 PDT, Terrence Cole [:terrence]
no flags Details | Diff | Splinter Review
v13: with a fixed reviewer string (45.62 KB, patch)
2011-11-04 18:24 PDT, Terrence Cole [:terrence]
no flags Details | Diff | Splinter Review

Description Bill McCloskey (:billm) 2011-07-10 22:35:45 PDT
Right now it sounds like we have a fair amount of fragmentation in the JS GC heap. The reason this happens is that we have a big chunk that's mostly unused. One idea to solve this would be to allow individual arenas in the chunk to be marked as uncommitted (so that the address space is reserved but they're not backed by anything).

I haven't considered it in depth, but naively this doesn't seem super difficult. We'd have to have a way of knowing which arenas are committed or not, so that we don't accidentally try to access an uncommitted arena's header or something. This would probably impose some additional expense in conservative stack scanning (maybe an additional hashtable lookup or something). The arena allocation and deallocation code would also get a little slower. However, I think that most of the time we only access the header of an arena where we know object's are stored.

Before doing this, it would be good to have more data about the fragmentation. It would be useful to be able to separate out these two things: the unused space in a chunk that belongs to an unused arena versus the unused cells in an allocated arena. If most of the unused space is of the former kind, the optimization will work well. If it's of the latter kind, it will work poorly.

Note that we would still pay a cost for having lots of mostly empty chunks. Each chunk has a fairly large amount of fixed overhead---mostly the mark bits, I think. I'm not sure how big this is offhand.

Gregor, Igor, does this seem feasible to you guys?
Comment 1 Nicholas Nethercote [:njn] 2011-07-10 22:41:37 PDT
(In reply to comment #0)
> 
> Before doing this, it would be good to have more data about the
> fragmentation. It would be useful to be able to separate out these two
> things: the unused space in a chunk that belongs to an unused arena versus
> the unused cells in an allocated arena.

Bug 670594 is open for this.  I'm partway through a patch.
Comment 2 Paul Biggar 2011-07-10 22:51:04 PDT
I mentioned this to Nick briefly recently, but it seems that managing the JS arenas seems to be something we could allow jemalloc to do. jemalloc already does the decommit work at least, in much the same way that we would do it ourselves.
Comment 3 Paul Biggar 2011-07-12 16:19:44 PDT
I looked at the platform-specific decommiting code that jemalloc does. It actually seems like it would be trivial to port.

Windows:

    VirtualFree(addr, size, MEM_DECOMMIT);

Mac 10.6+:

                madvise((void *)((uintptr_t)chunk + (i <<
                    pagesize_2pow)), (npages << pagesize_2pow),
                    MADV_FREE);

jemalloc doesn't do this for all decommits, but rather keeps a track of dirty pages and does it when that hits some threshold. It could be that this is expensive, or it could be that the myriad hacks built into our jemalloc prevented us from doing this in a nice way.

There may be a way on 10.5, but I've still to solve that bug for jemalloc (bug 670492). I dont think madvise works on 10.5).


Other (linux+android I guess):

    if (mmap(addr, size, PROT_NONE, MAP_FIXED | MAP_PRIVATE | MAP_ANON, -1,
        0) == MAP_FAILED)
        abort();
Comment 4 Bill McCloskey (:billm) 2011-07-12 16:24:42 PDT
Paul, would you mind checking what this call does on Linux and Mac?
    mprotect(addr, len, PROT_NONE);
Comment 5 Paul Biggar 2011-07-12 16:55:05 PDT
On 10.6, that instantly decommits (that is, Activity Monitor does not count the decommited memory, instantly). Tested with:


#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <strings.h>

int main() {
  int size = 100 * 1024 * 1024;
  char* x = malloc (size);
  memset(x, 1, size);

  printf("Ready...\n");
  sleep (10);

  printf("Go!!\n");
  mprotect(x, size/2, PROT_NONE);

  sleep (10);
}
Comment 6 Bill McCloskey (:billm) 2011-07-12 17:20:41 PDT
OK, I just tried it on Linux myself. The mprotect thing didn't work by itself, but this does:
    mprotect(addr, len, PROT_NONE);
    madvise(addr, len, MADV_DONTNEED);

I'm not sure about Mac.
Comment 7 Paul Biggar 2011-07-12 17:57:49 PDT
mprotect works the same on 10.5 as 10.6.
Comment 8 Bill McCloskey (:billm) 2011-07-13 09:25:33 PDT
Hmm, I realized I may have been wrong here, Paul. In theory, if you do this:
  mprotect(addr, len, PROT_NONE);
  mprotect(addr, len, PROT_READ);
then the data stored at addr should still be accessible. That means that the OS is not allowed to throw the page away. It's possible that on MacOS, it's just being paged out to disk, which is not what we want.

On Linux, the madvise(DONTNEED) call actually causes the page to be thrown away. I wonder if it has the same effect on the Mac. One way to test this is as follows:
  memset(addr, 0x37, len);
  mprotect(addr, len, PROT_NONE);
  mprotect(addr, len, PROT_READ);
  printf("%x\n", *addr);
If you get 0x37, then it was just paged out and not thrown away. In Linux, adding the DONTNEED call between the mprotect calls causes you to get 0.

Also, it turns out you don't even need mprotect on Linux. The madvise is sufficient.
Comment 9 Paul Biggar 2011-07-13 12:53:06 PDT
(In reply to comment #8)
> Hmm, I realized I may have been wrong here, Paul. In theory, if you do this:
>   mprotect(addr, len, PROT_NONE);
>   mprotect(addr, len, PROT_READ);
> then the data stored at addr should still be accessible. That means that the
> OS is not allowed to throw the page away. It's possible that on MacOS, it's
> just being paged out to disk, which is not what we want.

Ah yes, that's right. Back to the drawing board.

> On Linux, the madvise(DONTNEED) call actually causes the page to be thrown
> away. I wonder if it has the same effect on the Mac. One way to test this is
> as follows:
>   memset(addr, 0x37, len);
>   mprotect(addr, len, PROT_NONE);
>   mprotect(addr, len, PROT_READ);
>   printf("%x\n", *addr);
> If you get 0x37, then it was just paged out and not thrown away. In Linux,
> adding the DONTNEED call between the mprotect calls causes you to get 0.
> 
> Also, it turns out you don't even need mprotect on Linux. The madvise is
> sufficient.

This doesn't work on Mac. However, I tested the mmap call from comment 3, and that _does_ decommit, so I think we have a winner there. I'm going to experiment on doing that for jemalloc, and watch whether Tp5 drops. That will definitely confirm it.
Comment 10 Igor Bukanov 2011-07-13 14:16:39 PDT
See bug 669245 comment 13 for alternative ways to deal with the fragmentation issue.
Comment 11 Paul Biggar 2011-07-13 16:01:16 PDT
Created attachment 545769 [details] [diff] [review]
WIP - works on Mac

This seems to work on Mac (no crashes). I'm going to benchmark shortly, but after that I think we should see how bug 669245 comment 13 plays out before finishing this.
Comment 12 Paul Biggar 2011-07-13 20:07:20 PDT
So this implementation completely destroys the performance. Running earley-boyer in the shell goes from .26s to 2.5s, v8-v6 takes 9s instead of 1.5s.

So Earley-boyer is 2.25s slower, and has 29584*2 arena allocations/deallocations. That seems to be about 4.35*10^-5s per mmap.

I tried to see if that's reasonable, and wrote a benchmark that does 10,000,000 mmaps (half committing+memset, half decommtting), and takes 26 seconds (vs 0.5s for just 5,000,000 memsets). So that means a single mmap call takes 2.6*10^-6s.

Assuming that's the best case, we'd optimally add 0.4s to a benchmark which currently takes .26s. So the number of mmaps would need to be improved 30x or so. Since each chunk has 32 arenas, the only thing that sounds plausible is to limit this to the case when we have only 1 or two arenas live in a chunk, and we want to mmap the rest of the chunk.

All in all, this doesn't seem a good way to go, at least compared to the promise of bug 669245 comment 13 and beyond.
Comment 13 Gregor Wagner [:gwagner] 2011-07-13 20:16:11 PDT
Are you doing it for every GC?
I think we should only consider it for the "idle-GC" (gckind == GC_SHRINK) and on the background thread.
Comment 14 Bill McCloskey (:billm) 2011-07-13 20:52:44 PDT
(In reply to comment #13)
> Are you doing it for every GC?
> I think we should only consider it for the "idle-GC" (gckind == GC_SHRINK)
> and on the background thread.

This seems like a great idea. Then we'll never trigger this at all during the benchmarks. This would make it so that arenas could be in three states (used, free, or free and uncommitted) rather than two, but that should be easy to handle.
Comment 15 Paul Biggar 2011-07-14 00:04:36 PDT
That seems a good idea. I'll fix that up tomorrow and how it goes.
Comment 16 Paul Biggar 2011-07-15 15:09:44 PDT
Actually, I'm going to wait and see how bug 671702 goes, it looks like it's simpler and will work better.
Comment 17 Terrence Cole [:terrence] 2011-10-11 13:51:59 PDT
Created attachment 566330 [details] [diff] [review]
WIP v2: updated to current tip.

Updated to apply against current mozilla-central.  Left to address: over-allocates memory in the chunk header, how much memory do we actually save, what is the performance implication?
Comment 18 Justin Lebar (not reading bugmail) 2011-10-11 13:59:29 PDT
You may want to look into using madvise(MADV_DONTNEED) on Linux and madvise(MADV_FREE) on Mac.  In jemalloc, we've found that using these is a lot faster than mmap.

Note that measuring RSS on mac with MADV_FREE is...hard.  See bug 693404.
Comment 19 Terrence Cole [:terrence] 2011-10-11 16:30:19 PDT
Created attachment 566390 [details] [diff] [review]
WIP v3: fix most crashers.

This minimizes the excess memory we require in the chunk headers for the arena bitmap.  It also catches more places where we attempt to access an uncommitted header and does a better job of bringing arenas safely back to life after they have been uncommitted for a time.  We can now pass all of the shell tests with no aborts.

Unfortunately, this still has a catastrophic effect on v8 performance, including an OOM when running splay:

Before:
[terrence@kaapuri v8]$ ../js-opt/js -m -n run.js
Richards: 7117
DeltaBlue: 8622
Crypto: 10957
RayTrace: 3510
EarleyBoyer: 7824
RegExp: 2104
Splay: 10007
----
Score (version 6): 6288

After:
[terrence@kaapuri v8]$ ../js-opt/js -m -n run.js
Richards: 7173
DeltaBlue: 4850
Crypto: 9636
RayTrace: 2230
EarleyBoyer: 3455
RegExp: 1889
splay.js:50: out of memory

Todo:
  - figure out why we are OOMing on splay
  - find out how much of the overhead is in the OS's page management and how much is in the new arena management
Comment 20 Terrence Cole [:terrence] 2011-10-11 17:24:51 PDT
(In reply to Justin Lebar [:jlebar] from comment #18)
> You may want to look into using madvise(MADV_DONTNEED) on Linux and
> madvise(MADV_FREE) on Mac.  In jemalloc, we've found that using these is a
> lot faster than mmap.
> 
> Note that measuring RSS on mac with MADV_FREE is...hard.  See bug 693404.

I played with madvise(MADV_DONTNEED), mprotect(PROT_NONE), and mmap(PROT_NONE) yesterday afternoon.  Only mmap causes linux to count the uncommitted memory towards our RSS.  I worry about the utility of tracking unused pages in userspace if we cannot point to an easily visible number that we are improving.  This is a job that the kernel is already doing, with the help of dedicated hardware.

Ideally, providing the kernel with more accurate information from userspace should help it do a better job of managing our memory.  The fact that we are making things worse with this patch is probably entirely our own fault still -- e.g. we could be keeping the page table from helping by aggressively thrashing the PROT bits on a handful of arenas.  I'll do some serious profiling once I get things fully working to see if we can find and remedy the root cause of the slowdown.
Comment 21 Justin Lebar (not reading bugmail) 2011-10-11 17:43:29 PDT
> Only mmap causes linux to count the uncommitted memory towards our RSS.

Really?  That's surprising.

What's the RSS of the following program after it goes to sleep?  For me, it's 256MB, indicating that memory freed by MADV_DONTNEED is not counted against RSS.

#include <sys/mman.h>
#include <stdio.h>
#include <unistd.h>

#define SIZE (512 * 1024 * 1024)

int main()
{
  char *x = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_ANON | MAP_PRIVATE, -1, 0);
  int i;
  for (i = 0; i < SIZE; i += 1024) {
    x[i] = i;
  }
  madvise(x, SIZE / 2, MADV_FREE);
  fprintf(stderr, "Sleeping.  Now check my RSS.  Hopefully it's %dMB.\n", SIZE / (2 * 1024 * 1024));
  sleep(1024);
  return 0;
}


$ ps -p 4770 -F
UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
jlebar    4770  4677  0 132067 262520 7 20:41 pts/0    00:00:00 ./temp
Comment 22 Terrence Cole [:terrence] 2011-10-11 18:03:08 PDT
You are quite right: that works.  I was using a malloc'd block when I performed the test.  I thought that glibc's malloc used madvise to reserve addresses -- I will have to go find out what is really going on there now.
Comment 23 Terrence Cole [:terrence] 2011-10-14 17:23:38 PDT
Created attachment 567229 [details] [diff] [review]
WIP v4: performance improved to match baseline when body commented

When we comment out the actual decommit/recommit functionality (they are not by default in this patch), this patch matches the performance of mozilla-central.  E.g. we _can_ get equivalent performance managing the free set outside of the arena bodies with a bit of work to find pages efficiently.  This patch includes code for using mmap to decommit pages (commented) and madvise to decommit pages.  The second seems to be a bit faster and, more importantly, does not fail with ENOMEM when we have too much fragmentation (i.e. on v8 splay).  The performance of both however is cripplingly bad, almost halving our performance on v8 splay and EarleyBoyer.
Comment 24 Justin Lebar (not reading bugmail) 2011-10-14 17:35:21 PDT
I don't think you need to madvise(MADV_NORMAL) memory to recommit it.  Unless this actually improves performance?  (I doubt it does much of anything...)
Comment 25 Justin Lebar (not reading bugmail) 2011-10-14 17:41:41 PDT
I also don't understand why you're EOM'ing with "hard" commit/decommit (which is what you'll need to use on Windows...) but not with "soft" decommit (madvise).  Do you understand what's going on here?
Comment 26 Justin Lebar (not reading bugmail) 2011-10-14 17:46:02 PDT
It looks like this patch decommits an arena as soon as it's freed; I assume the plan, now that you've showed that it's possible for this code to perform well, is to decommit less aggressively?
Comment 27 Terrence Cole [:terrence] 2011-10-14 17:56:49 PDT
Created attachment 567233 [details]
Synthetic benchmark.

Build with: g++ -std=c++0x -O3 -o test test.cpp

This synthetic benchmark allocates 1GiB of memory with mmap, fills a vector with pointers to the first byte of every page in the chunk of memory, then shuffles those pointers.  It then walks the list (e.g. random walk of the memory area) and commits the page, decommits the page, or touches the first byte of the page, as listed, recording the time taken for the entire operation (e.g. time of all accesses/commits/decommits in the 1GiB chunk).

The results from a typical run are:
Allocate:            5 us
Touch:          127993 us
Touch:            2340 us
Touch:            2340 us
Commit:          38980 us
Commit:          39210 us
Commit:          38910 us
Touch:            2292 us
Touch:            2303 us
Touch:            2321 us
Decommit:       247002 us
Decommit:        71764 us
Decommit:        71452 us
Commit:          40491 us
Commit:          40383 us
Commit:          40039 us
Touch:          189611 us
Touch:            4063 us
Touch:            2712 us
Decommit:       160546 us
Decommit:        71747 us
Decommit:        71436 us

At one point, the rss at the end of this test was ~500MiB, but I cannot reproduce that -- at the moment it is sitting at 3KiB.

Observations:
- Decommitting has about the same overhead as a page fault.
- Decommitting causes a page fault (after recommit).
- Committing is fast, but not all that fast.

Things still to do:
- Investigate cost of commit/decommit of larger batches of pages (with touch on all subpages).
- Figure out how jemalloc does this without incurring disastrous overhead.
Comment 28 Justin Lebar (not reading bugmail) 2011-10-14 19:00:39 PDT
> - Figure out how jemalloc does this without incurring disastrous overhead.

1. On Linux/Mac, there's no explicit commit step.  Memory freed with MADV_DONTNEED (Linux) or MADV_FREE (Mac) doesn't need to be explicitly re-madvise'd before it's touched.  When you modify an MADV_DONTNEED'ed page, you incur a page fault which gives you a physical page there.

2. jemalloc keeps around some amount (a few mb) of committed but unused memory.  (This is reported in about:memory as heap-dirty.)

3. When jemalloc decommits, it decommits adjacent pages in a single operation.
Comment 29 Justin Lebar (not reading bugmail) 2011-10-14 19:05:15 PDT
I don't know how easy this would be to test, but I wonder whether malloc'ing arenas directly would be comparable in speed here.  It's a shame to have to reinvent this wheel.
Comment 30 Justin Lebar (not reading bugmail) 2011-10-14 20:09:48 PDT
(In reply to Justin Lebar [:jlebar] from comment #29)
> I don't know how easy this would be to test, but I wonder whether malloc'ing
> arenas directly would be comparable in speed here.  It's a shame to have to
> reinvent this wheel.

Note that the new version of jemalloc (bug 580408) appears to have very fast, lockless paths for malloc and free.  So if the current malloc is too slow, the new one might be fast enough.  The memory reductions from this bug, plus the speed increases of the new jemalloc, might be enough to convince us to assign to it the resources it needs.
Comment 31 Terrence Cole [:terrence] 2011-10-17 17:39:18 PDT
Created attachment 567630 [details] [diff] [review]
v5 - Memory decommit on GC_SHRINK.

This is two patches, one part commented, the other uncommented.  

1) The commented bits are the sane implementation using two bitmaps to track the free/uncommitted states.  This version is slower because of the extra math in the fast path (commenting the committedArenas.get() check restores our performance). 

2) The uncommented bits are the same computations, but tracking the free/committed state of each arena in a full byte in the chunk header.  This version is the same speed as tip, even though it wastes a huge amount of memory (we spill 41 bytes into our last arena and end up with 4055 bytes of padding per chunk).

The correct solution here is probably going to be: use the 'next' pointers in the arena headers to track free arenas (like tip) and add a single bitmap (which fits in our current padding) to track committed arenas.  This will complicate the state tracking a bit, but should reduce our fragmentation and allow us to preserve our current performance on splay.

If that does not work, the last thing we can try is to smash the free/committed bitmaps into one oversize bitmap.  This will reduce the amount of math we need to do back to the single-bitmap case, but will require us to write some custom bit twiddling.
Comment 32 Terrence Cole [:terrence] 2011-10-17 17:52:59 PDT
(In reply to Justin Lebar [:jlebar] from comment #24)
> I don't think you need to madvise(MADV_NORMAL) memory to recommit it. 
> Unless this actually improves performance?  (I doubt it does much of
> anything...)
No, this is just ported forward from Paul's original patch with mmap.  It is good to have there so that we can easily compare performance between methods that require this step and those that don't.

(In reply to Justin Lebar [:jlebar] from comment #25)
> I also don't understand why you're EOM'ing with "hard" commit/decommit
> (which is what you'll need to use on Windows...) but not with "soft"
> decommit (madvise).  Do you understand what's going on here?
Yes, it's pretty clear from the man page.  When we decommit an arena from the middle of one memory mapping that the kernel is tracking, it now has to track two memory maps.  If we split enough memory maps, we overflow the number of memory mappings that the kernel will allow one process to have.

(In reply to Justin Lebar [:jlebar] from comment #26)
> It looks like this patch decommits an arena as soon as it's freed; I assume
> the plan, now that you've showed that it's possible for this code to perform
> well, is to decommit less aggressively?
Yes, on GC_SHRINK in particular.  This will help the browser out without impacting shell performance.

(In reply to Justin Lebar [:jlebar] from comment #29)
> I don't know how easy this would be to test, but I wonder whether malloc'ing
> arenas directly would be comparable in speed here.  It's a shame to have to
> reinvent this wheel.
I agree that this is a shame, but it is not something we can get rid of.  The reason we have 4K aligned arenas inside of 1M aligned chunks is so that we can have insanely fast containment testing for words on the stack when doing conservative scanning.  This is a reasonable tradeoff for the js engine, but not for jemalloc.
Comment 33 Justin Lebar (not reading bugmail) 2011-10-17 18:14:13 PDT
> I agree that this is a shame, but it is not something we can get rid of.  The reason we have 4K 
> aligned arenas inside of 1M aligned chunks is so that we can have insanely fast containment testing 
> for words on the stack when doing conservative scanning.

It's a hash lookup, right?  Is the writing on the wall that a hashtable with 256 times as many entries would be unacceptably slow?  If so, I bet there are more clever algorithms we could use to reduce cache misses.  For example, do a hash lookup using the address rounded to 1MB as the key.  The value is a bitmap with 256 bits, indicating whether we're a GC page.

I'm not saying we definitely 100% for sure can get rid of chunks without slowing down, but I don't think we've really investigated.  (And I don't think we really can until we land the new jemalloc and see how fast it is.)

> If we split enough memory maps, we overflow the number of memory mappings that the kernel will 
> allow one process to have.

Interesting!  I wonder what's going to happen on Windows, where we don't have madvise.
Comment 34 Terrence Cole [:terrence] 2011-10-18 10:42:43 PDT
(In reply to Justin Lebar [:jlebar] from comment #33)
> It's a hash lookup, right?  Is the writing on the wall that a hashtable with
> 256 times as many entries would be unacceptably slow?  If so, I bet there
> are more clever algorithms we could use to reduce cache misses.  For
> example, do a hash lookup using the address rounded to 1MB as the key.  The
> value is a bitmap with 256 bits, indicating whether we're a GC page.
> 
> I'm not saying we definitely 100% for sure can get rid of chunks without
> slowing down, but I don't think we've really investigated.  (And I don't
> think we really can until we land the new jemalloc and see how fast it is.)

Bill already tried this: the overhead of the hash lookup was too high.

The real, long-term, solution to this bug is to have a moving GC and do compaction.  Using malloc, even if it gives us a speedup in the short term, will make it harder to solve the fundamental problem later.

> Interesting!  I wonder what's going to happen on Windows, where we don't
> have madvise.

Are you volunteering to do the porting work? :-)
Comment 35 Justin Lebar (not reading bugmail) 2011-10-18 11:04:06 PDT
> Bill already tried this

He tried both the naive hashtable and the hybrid hashtable / trie approach I suggested?  Can you point me to a bug?

(There are all sorts of things one might do here if the data structure is a bottleneck.  I'm skeptical that we can't make the operation "is page X contained in our data structure?" very fast, even for a table of ~50,000 elements.)

I don't want to distract us from the bug at hand; I think what you're doing is the right short-term solution; it's easier than getting jemalloc2 to work.  But I don't understand how a compacting GC changes the game.  You still have to manage chunks and decommit memory you're not using, right?  My point is that this is hard to do right.
Comment 36 Bill McCloskey (:billm) 2011-10-18 11:10:03 PDT
Just to clarify, I haven't tried replacing chunks with malloc. Igor might have tried something like that, but it was a while ago.

A moving GC would pretty much make this bug obsolete, since compaction eliminates fragmentation and there's no need to decommit. We won't have that for a while, though. Let's stay focused on this bug.
Comment 37 Terrence Cole [:terrence] 2011-10-18 15:37:58 PDT
Created attachment 567909 [details] [diff] [review]
v6 - Getting close.

This patch tracks free arenas in a linked list, as with tip, and tracks decommitted arenas in the chunk header in a bitmap, as in prior patches.  Arena decommit is done (in the simplest possible way currently) when we take a GC_SHRINK event.  With the re-arrangement I have made to allocateArena, this patch actually gives a very slight speedup on shell tests.

Open Questions:
1) How is the style and naming?
2) What do we want/need to do when our (De)CommitMemory fails?  On linux, we can ignore this (and should be skipping the CommitMemory call).  On windows, these are not advisory, so it is more serious there.  Do we want to throw an OOM exception up, investigate our recovery options, abort?
3) Where should I really put the call to DecommitFreePages?  Right now it is right after the MarkAndSweep call, but that's a wild guess on my part.  This seems like a reasonable place to put it, but there is a ton of THREADSAFE in this area that I don't totally understand yet.
4) How do we want to reflect this change in about:memory?  I think they still get reflected in the dirty chunk counts, but we probably want to track them separately so that we can distinguish them in bug reports, etc.
Comment 38 Justin Lebar (not reading bugmail) 2011-10-18 16:04:28 PDT
> 4) How do we want to reflect this change in about:memory?  I think they still get reflected in the 
> dirty chunk counts, but we probably want to track them separately so that we can distinguish them in 
> bug reports, etc.

We are, like, *so* unconcerned with mappings which don't contribute to RSS.

I'd add a new js-heap-decommitted field and make the rest of the numbers reflect resident memory.
Comment 39 Bill McCloskey (:billm) 2011-10-19 12:24:30 PDT
Comment on attachment 567909 [details] [diff] [review]
v6 - Getting close.

Looks good! Terrence and I talked about this in person.
Comment 40 Terrence Cole [:terrence] 2011-10-25 10:23:39 PDT
Created attachment 569413 [details] [diff] [review]
v7: With working Windows memory decommit.

This appeared to work in Windows, but crashed once in an opt build and had many assertions in debug build.  However, tip had the same assertions in a debug build.  I think my windows build environment may just be wrong somehow.  

I have commented out the "only in GC_SHRINK" check (so we always decommit when collecting) in the attached patch -- could I get a try run of this patch/configuration to see if any of the windows buildbots explode?  If they don't explode it will be strong evidence that the problem is on my end and increase our confidence in the correctness of this patch.

Still TODO: update about:memory counters.  The sizes we report in about:memory do not seem to match up with any of the OS reported memory sizes.  Is this expected.
Comment 41 Justin Lebar (not reading bugmail) 2011-10-25 10:30:09 PDT
> Still TODO: update about:memory counters.  The sizes we report in about:memory do not seem to match 
> up with any of the OS reported memory sizes.  Is this expected.

Is what, exactly, expected?

I'd add an additional js-gc-heap-chunk-decommitted field and make js-gc-heap-chunk-{dirty,clean}-unused track only committed pages.
Comment 42 Bill McCloskey (:billm) 2011-10-25 15:31:46 PDT
I pushed this to try:
https://tbpl.mozilla.org/?tree=Try&rev=65a4d5bb6d68
Comment 43 Justin Lebar (not reading bugmail) 2011-10-25 18:38:22 PDT
On Mac, we're going to want something like bug 693404 to unmap and remap the MADV_FREE'd pages before we measure RSS, so we get accurate measurements there.
Comment 44 Bill McCloskey (:billm) 2011-10-27 10:25:51 PDT
Pushed again:
https://tbpl.mozilla.org/?tree=Try&rev=cbffb065a02f
Comment 45 Terrence Cole [:terrence] 2011-10-31 10:11:49 PDT
Created attachment 570746 [details] [diff] [review]
v8: Simplified arenaAllocate

The use of JS_(UN)LIKELY is actually a significant perf increase with this branch structure, which it was not for the previous.  With the JS_(UN)LIKELY annotations, this version is actually just a tad faster.
Comment 46 Bill McCloskey (:billm) 2011-10-31 13:20:08 PDT
Comment on attachment 570746 [details] [diff] [review]
v8: Simplified arenaAllocate

Review of attachment 570746 [details] [diff] [review]:
-----------------------------------------------------------------

Nice patch. My comments are just nits. I think it's almost done, although we still have to do the about:memory stuff. Do you have time to do that? If not, I can take care of it soon.

::: js/src/ds/BitArray.h
@@ +40,5 @@
> +
> +#ifndef BitArray_h__
> +#define BitArray_h__
> +
> +#include "js/TemplateLib.h" 

It's not really necessary, but it would be nice to #include jstypes.h, since it's what defines JS_BITS_PER_WORD.

@@ +59,5 @@
> +    }
> +
> +    inline bool get(size_t offset) const {
> +        return map[offset >> tl::FloorLog2<JS_BITS_PER_WORD>::result] &
> +                   ((size_t)1 << (offset & (JS_BITS_PER_WORD - 1)));

It's kind of nice how the GC does this, with getMarkWordAndMask. Maybe we could do that here too.

::: js/src/jsgc.cpp
@@ +551,3 @@
>  
>      /* We clear the bitmap to guard against xpc_IsGrayGCThing being called on
>         uninitialized data, which would happen before the first GC cycle. */

While you're here, could you fix this comment to have the usual style?
  /*
   *
   */

@@ +555,5 @@
>  
> +    /* Initialize the arena tracking bitmap */ 
> +    decommittedArenas.clear(false);
> +
> +    /* Initialize the chunk info */

Period at the end of the comment, here and elsewhere.

@@ +565,1 @@
>      /* The rest of info fields are initialized in PickChunk. */

This comment should probably stay at the end of the function.

@@ +568,5 @@
> +    for (jsuint i = 0; i < ArenasPerChunk; i++) {
> +        arenas[i].aheader.init();
> +        arenas[i].aheader.next = (i + 1 < ArenasPerChunk)
> +                                 ? &arenas[i + 1].aheader
> +                                 : NULL;

This takes one extra branch per loop iteration, but hopefully it won't matter. It is clearer this way.

@@ +626,5 @@
> +            return i;
> +    for (jsuint i = 0; i < info.lastDecommittedArenaOffset; i++)
> +        if (decommittedArenas.get(i))
> +            return i;
> +    JS_ASSERT(false);

It's more idiomatic to use JS_NOT_REACHED here.

@@ +630,5 @@
> +    JS_ASSERT(false);
> +    return -1;
> +}
> +
> +ArenaHeader*

This should be |ArenaHeader *|. Only the tracer code leaves out the space (and not for long).

@@ +669,5 @@
> +    ArenaHeader *aheader = JS_LIKELY(info.numArenasFree > 0)
> +                           ? fetchNextFreeArena()
> +                           : fetchNextDecommittedArena();
> +    aheader->prepare(comp, thingKind);
> +    if (JS_UNLIKELY(info.numArenasUnused == 0))

This could check !hasAvailableArenas(), as before, right?

@@ +2621,5 @@
> +             */
> +            chunk->info.freeArenasHead = NULL;
> +
> +            while (aheader) {
> +                /* store next so that we have it after we decommit */

Please capitalize at the beginning and add a period at the end. Also below.

::: js/src/jsgc.h
@@ +365,5 @@
>  
>      JSCompartment   *compartment;
> +
> +    /*
> +     * ArenaHeader->next has two purposes: when unallocated, it points to the next

ArenaHeader::next

@@ +367,5 @@
> +
> +    /*
> +     * ArenaHeader->next has two purposes: when unallocated, it points to the next
> +     * available Arena's header. When allocated, it points to the next arena of the
> +     * same size class.

...of the same size class and compartment.

@@ +558,5 @@
> +    /* Free arenas are linked together with aheader.next */
> +    ArenaHeader     *freeArenasHead;
> +
> +    /*
> +     * Decommitted arenas are tracked by a bitmap in the chunk header.  We use

We use one space after a period. Please fix this elsewhere too.

@@ +567,5 @@
> +
> +    /*
> +     * We need to track num freed, decommitted (at OS level) and the total.  We
> +     * track free and the total and compute decommitted, since free and total 
> +     * unused are used in a fast path and decommitted is not.

For the names, I would suggest numArenasFree and numArenasFreeCommitted. Then the comment could be like this:
    numArenasFree: number of free arenas, either committed or decommitted
    numArenasFreeCommitted: number of free committed arenas
Right now, it's hard to tell what they mean.

@@ +572,5 @@
> +     */
> +    uint32          numArenasFree;
> +    uint32          numArenasUnused;
> +
> +    /* Number of gc cycles this chunk has survived. */

gc -> GC

@@ +580,5 @@
> +/*
> + * Our chunk overhead takes up less than 8 arenas (32KiB), so there is no point
> + * computing the exact number of bits we need to track the arenas in a chunk --
> + * it will always be the same number of bytes as tracking all possible arenas
> + * in a chunk.

Above this paragraph, it would be good to explain the basic calculation. Something about how you over-estimate the number of arenas per chunk, and use that to compute the size of the decommit bitmap. And then you compute the actual number of arenas per chunk by dividing the remaining space by the amount of per-arena data (including its mark bitmap).

Then there's more context for describing how the over-estimate is actual about as good as we're going to get in practice.

@@ +587,5 @@
> +const size_t MaxArenasPerChunk = ChunkSize / ArenaSize;
> +
> +/* Number of bytes we allocate to bitmap bits to track per-arena state. */
> +JS_STATIC_ASSERT(MaxArenasPerChunk % JS_BITS_PER_BYTE == 0);
> +const size_t ChunkArenaBitmapBytes = MaxArenasPerChunk / JS_BITS_PER_BYTE;

I like the name ChunkDecommitBitmapBytes better.

@@ +593,5 @@
> +/* Number of bytes an arena uses in a chunk (including cell bitmap bytes). */
> +const size_t BytesPerArenaWithHeader = ArenaSize + ArenaBitmapBytes;
> +
> +/* Number of bytes available to a store arenas (and cell bitmap bytes). */
> +const size_t ChunkArenaBytesAvail = ChunkSize - sizeof(ChunkInfo) - \

No need for the line continuation here. You only need that when using the preprocessor. Also, I like the name ChunkBytesAvailable better.

@@ +804,5 @@
>  }
>  #endif
>  
>  inline void
> +ArenaHeader::init()

This is the same as setAsNotAllocated now. I'd rather leave the code as before, without changing prepare or init.

@@ +826,5 @@
>      firstFreeSpanOffsets = FreeSpan::FullArenaOffsets;
>  }
>  
> +inline void
> +ArenaHeader::setAsNotAllocated()

Why was this moved outside the class definition? I'm not sure why any of these function are outside, but I'd rather not have more of them.

::: js/src/jsgcchunk.cpp
@@ +272,3 @@
>  #endif
>  
>  namespace js {

Could you just wrap this whole thing in namespace gc? Then you wouldn't need the scoping below.

@@ +273,5 @@
>  
>  namespace js {
>  
>  inline void *
>  FindChunkStart(void *p)

Can this function be static?

@@ +339,5 @@
>  } /* namespace js */
>  
> +#ifdef XP_WIN
> +bool
> +js::gc::CommitMemory(void *addr, size_t size)

These functions should go inside the js and the gc namespace declarations.

@@ +341,5 @@
> +#ifdef XP_WIN
> +bool
> +js::gc::CommitMemory(void *addr, size_t size)
> +{
> +    JS_ASSERT((size_t)addr % 4096UL == 0);

I think you want uintptr_t rather than size_t here in the case. Also, I prefer to use the uintptr_t(addr) syntax for casts like this, but it's up to you.

@@ +371,5 @@
> +    int result = madvise(addr, size, MADV_DONTNEED);
> +    return result != -1;
> +}
> +#else
> +JS_STATIC_ASSERT(1 == 0);

It's more typical to use something like |#error "No CommitMemory defined"|.
Comment 47 Terrence Cole [:terrence] 2011-10-31 18:44:40 PDT
Created attachment 570900 [details] [diff] [review]
v9: With review feedback.

This is not yet ready for re-evaluation: I will figure out the memory reporters for about:memory tomorrow, at which point it should be good for a final review.

(In reply to Bill McCloskey (:billm) from comment #46)
> Comment on attachment 570746 [details] [diff] [review] [diff] [details] [review]
> v8: Simplified arenaAllocate
> 
> Review of attachment 570746 [details] [diff] [review] [diff] [details] [review]:
> -----------------------------------------------------------------
> 
> @@ +669,5 @@
> > +    ArenaHeader *aheader = JS_LIKELY(info.numArenasFree > 0)
> > +                           ? fetchNextFreeArena()
> > +                           : fetchNextDecommittedArena();
> > +    aheader->prepare(comp, thingKind);
> > +    if (JS_UNLIKELY(info.numArenasUnused == 0))
> 
> This could check !hasAvailableArenas(), as before, right?

This was actually a significant slowdown -- !>0 is not equal to ==0 and GCC gets the first slower.  Replaced this with a new inline method noAvailableArenas.
 
> @@ +558,5 @@
> > +    /* Free arenas are linked together with aheader.next */
> > +    ArenaHeader     *freeArenasHead;
> > +
> > +    /*
> > +     * Decommitted arenas are tracked by a bitmap in the chunk header.  We use
> 
> We use one space after a period. Please fix this elsewhere too.

My English teacher twitches anxiously in her grave.
 
> @@ +593,5 @@
> > +/* Number of bytes an arena uses in a chunk (including cell bitmap bytes). */
> > +const size_t BytesPerArenaWithHeader = ArenaSize + ArenaBitmapBytes;
> > +
> > +/* Number of bytes available to a store arenas (and cell bitmap bytes). */
> > +const size_t ChunkArenaBytesAvail = ChunkSize - sizeof(ChunkInfo) - \
> 
> No need for the line continuation here. You only need that when using the
> preprocessor. Also, I like the name ChunkBytesAvailable better.

I have Python on the brain still, it seems.

> @@ +826,5 @@
> >      firstFreeSpanOffsets = FreeSpan::FullArenaOffsets;
> >  }
> >  
> > +inline void
> > +ArenaHeader::setAsNotAllocated()
> 
> Why was this moved outside the class definition? I'm not sure why any of
> these function are outside, but I'd rather not have more of them.

ArenaHeader is about half one way and half the other, so I made the things I touched consistent.  Looks like I moved things the wrong direction.  :-)
 
> ::: js/src/jsgcchunk.cpp
> @@ +272,3 @@
> >  #endif
> >  
> >  namespace js {
> 
> Could you just wrap this whole thing in namespace gc? Then you wouldn't need
> the scoping below.

It makes sense to me.  I thought we had changed our style to not wrap our cpp files with namespaces last week -- I'm probably just not understanding exactly what was decided.  Everything in jsgcchunk needs a thorough scrubbing to remove the C/jemalloc look.

> @@ +273,5 @@
> >  
> >  namespace js {
> >  
> >  inline void *
> >  FindChunkStart(void *p)
> 
> Can this function be static?

Fixed.  Everything in jsgcchunk also needs a thorough scrubbing to remove the Copy/Paste-itis.

> @@ +341,5 @@
> > +#ifdef XP_WIN
> > +bool
> > +js::gc::CommitMemory(void *addr, size_t size)
> > +{
> > +    JS_ASSERT((size_t)addr % 4096UL == 0);
> 
> I think you want uintptr_t rather than size_t here in the case. Also, I
> prefer to use the uintptr_t(addr) syntax for casts like this, but it's up to
> you.

My C roots are showing.
Comment 48 Nicholas Nethercote [:njn] 2011-10-31 19:23:13 PDT
> I will figure out the memory reporters for about:memory tomorrow

What's the plan?  It's not clear to me what the best thing to do is.
Comment 49 Justin Lebar (not reading bugmail) 2011-10-31 19:28:38 PDT
(In reply to Nicholas Nethercote [:njn] from comment #48)
> > I will figure out the memory reporters for about:memory tomorrow
> 
> What's the plan?  It's not clear to me what the best thing to do is.

From comment 41:
> I'd add an additional js-gc-heap-chunk-decommitted field and make 
> js-gc-heap-chunk-{dirty,clean}-unused track only committed pages.

Also, from comment 43:
> On Mac, we're going to want something like bug 693404 to unmap and remap the MADV_FREE'd pages 
> before we measure RSS, so we get accurate measurements there.
Comment 50 Justin Lebar (not reading bugmail) 2011-10-31 19:48:43 PDT
I'm not wild about the nomenclature in this patch, fwiw.  Everything I've read (e.g. [1], jemalloc) uses "committed" to mean "you can touch this without segfaulting".  So when I mmap a chunk, the whole thing is "committed", even though the mapping doesn't have any physical pages until I touch it.

[1] http://msdn.microsoft.com/en-us/library/windows/desktop/aa366887%28v=vs.85%29.aspx
Comment 51 Bill McCloskey (:billm) 2011-10-31 20:25:26 PDT
Justin, could you explain what went on in bug 693404? Reading it made my head hurt.

Regarding nomenclature, I think that on Linux and Mac we do exactly what jemalloc does. On Windows, my understanding is that MEM_RESET does pretty much the same thing as decommitting from a memory usage perspective. If that's not true, it's easy to change--we never actually touch the page anyway. But it seems confusing to invent a new word to describe a concept that differs from decommitted in only a very esoteric way.
Comment 52 Justin Lebar (not reading bugmail) 2011-10-31 20:39:58 PDT
My concern about nomenclature stems mainly from MSDN's description of MEM_RESET:

> Indicates that data in the memory range specified by lpAddress and dwSize is no longer of interest. 
> The pages should not be read from or written to the paging file. However, the memory block will be 
> used again later, so it should not be decommitted.

It seems weird to say that we're "decommitting" memory by calling something which says it's not doing that.  (jemalloc uses the words in the same way; it explicitly doesn't "decommit" memory when it calls madvise(MADV_DONTNEED).)

But I agree that making up a new name might not make things clearer.  It may depend on the name.  :)

> Justin, could you explain what went on in bug 693404? Reading it made my head hurt.

Er, yes.  I didn't explain things very well in that bug.

The problem is that on Mac, when you madvise(MADV_FREE), you're hinting to the OS, "I don't need these pages, so you can take them back whenever."  The OS is lazy about this, and only takes back the pages when there's memory pressure on the machine.  Until then, they count against your RSS.

This means that whatever memory gains you're getting from MADV_FREE, you can't measure with RSS.  You could say that your "true RSS" is RSS minus the number of pages you've MADV_FREE'd, but then when you have memory pressure and the OS frees those pages, you're under-counting your RSS!

The hack we did in jemalloc is to add a function which explicitly purges all MADV_FREE'd pages, by decommitting and recommitting them (with mmap(PROT_NONE) and mmap(PROT_READ | PROT_WRITE)).  We then call this function before we read RSS on Mac [1].

This way, when you open about:memory, you'll see the correct RSS (in about:memory and in the Mac task manager).  Telemetry also calls this function, so it'll report the correct result.

[1] https://bug693404.bugzilla.mozilla.org/attachment.cgi?id=567107 (see GetResident implementation).
Comment 53 Justin Lebar (not reading bugmail) 2011-10-31 20:42:32 PDT
I should add that I'm quite guilty of being loose with this nomenclature.  In about:memory, we use "heap-committed" to mean "heap that's in memory".  In fact, jemalloc_stats.committed is, on Linux and Mac, the type of "committed" in this patch, rather than the segfault type of committed.

Anyway, if we're going to abuse the word, we should at least be clear about it in the code.
Comment 54 Terrence Cole [:terrence] 2011-11-01 09:32:01 PDT
I agree with Justin that the word decommit is wrong here; it's what I've been using because that's what the patch I started with called it, and that was probably the original intent.  We know now that full decommission doesn't work well enough.  I switched to "advisory decommission", but didn't update the name of the function.  I'd love to change it now before this gets, err, committed.  What color should this bikeshed be: What is the right word for this memory state?
Comment 55 Justin Lebar (not reading bugmail) 2011-11-01 13:23:46 PDT
Based on an e-mail exchange with jasone, the author of jemalloc, we should double-check that VirtualAlloc(MEM_RESET) frees memory and is reflected in Firefox's RSS as we expect on all supported platforms.
Comment 56 Terrence Cole [:terrence] 2011-11-02 13:44:18 PDT
Created attachment 571440 [details] [diff] [review]
v10: Hooked up about:memory

Now that about:memory is hooked up, we can finally view the full effects of this patch in the real world.

Result: useless.

<1MB returned to the OS after closing all tabs after MemBench.  If we run the decommit aggressively on every GC, this improves to 54MB, but immediately drops to <1MB after a second cycle, even with only 130MB used of the 800MB allocated.  Unfortunately, our fragmentation appears to reside as heavily within arenas as it does within chunks.  The Memshrink work, will, doubtless, make this even worse by putting more objects in each arena.
Comment 57 Terrence Cole [:terrence] 2011-11-02 15:12:52 PDT
Created attachment 571473 [details] [diff] [review]
v11: rebased and fixed a critical bug

Things may not be as bleak as I thought.  The analysis appeared to be working -- it got some results.  However, Bill spotted a typo that would have usually, but not always, have kept us from counting all the decommitted memory.  Re-testing now.
Comment 58 Nicholas Nethercote [:njn] 2011-11-02 15:46:27 PDT
Comment on attachment 571473 [details] [diff] [review]
v11: rebased and fixed a critical bug

Review of attachment 571473 [details] [diff] [review]:
-----------------------------------------------------------------

You haven't updated the computation of gcHeapUnusedPercentage.  Maybe that's deliberate, but you've changed it's meaning.  I'm not sure if that's a good thing or not.  The description of "js-gc-heap-unused-fraction" needs changing to make clear it no longer counts decommitted memory.  Perhaps it should be renamed "js-gc-heap-committed-unused-fraction" (similar to "heap-committed-unallocated-fraction").  Or, the computation should include the decommitted counts.

You also haven't updated the computation of numDirtyChunks, you probably should.

I wonder if it's better to not change any of the counters under "explicit", and just add a "js-gc-heap-decommitted" counter to the "Other Measurements" list.  That would be less likely to introduce errors in the existing counters.

::: js/xpconnect/src/XPCJSRuntime.cpp
@@ +1573,4 @@
>  
>          js::IterateCompartmentsArenasCells(cx, data, CompartmentCallback,
>                                             ArenaCallback, CellCallback);
> +        js::IterateChunks(cx, data, ChunkCallback);

Don't you have to call this before using data->gcHeapChunkCleanDecommitted above?

@@ +1947,5 @@
> +                      JS_GC_HEAP_KIND,
> +                      data.gcHeapChunkCleanDecommitted + data.gcHeapChunkDirtyDecommitted,
> +                      "Memory in the address space of the garbage-collected JavaScript heap that "
> +                      "is currently returned to the OS.",
> +                      callback, closure);

The description should say "The same as 'explicit/js/gc-heap-decommitted'.  Shown here for easy comparison with other 'js-gc' reporters.", similar to the description for "js-gc-heap-chunk-dirty-unused".

You should also change the description of gc-heap-chunk-*-unused to make it clear that they're mutually exclusive with the gc-heap-decommitted.
Comment 59 Justin Lebar (not reading bugmail) 2011-11-02 16:08:16 PDT
> I wonder if it's better to not change any of the counters under "explicit", and
> just add a "js-gc-heap-decommitted" counter to the "Other Measurements" list.
> That would be less likely to introduce errors in the existing counters.

I think it's Good and Right if explicit only counts memory which counts against our RSS.  Nick and I are talking about this on IRC and I *think* he agrees, but I'm not sure.  :)
Comment 60 Gregor Wagner [:gwagner] 2011-11-02 16:28:44 PDT
Do you have any performance numbers for low-end machines? Using less memory should be a big win there but also interesting is the performance when we have to use the memory again.
Comment 61 Justin Lebar (not reading bugmail) 2011-11-02 17:03:21 PDT
So with respect to the memory reporter, Nick and I agreed that it's OK if you report it in explicit, so long as the not-in-core memory is a separate line item from the in-core memory.

The plan is to add a new memory reporter type for not-in-core (what we've been loosely calling "decommitted") memory, so we can distinguish it from in-core memory.  But we don't need to block this work on that.
Comment 62 Terrence Cole [:terrence] 2011-11-03 18:24:12 PDT
Created attachment 571863 [details]
This appears to work in win7.

In this screenshot, memory usage plunged in the windows system monitor after we closed all tabs and did a decommit cycle.
Comment 63 Terrence Cole [:terrence] 2011-11-03 18:32:25 PDT
Created attachment 571866 [details] [diff] [review]
v12: final version

I believe this should be the last version.  Nicholas and I discussed the changes I made to the XPConnect code on IRC, so I think that part is good now.  I would also like feedback on the big block of text I added to jsgc.h to explain the ArenasPerChunk calculation: is this still clear when read by someone who is not me?  I think we can land this for FF10 with time to spare.
Comment 64 Nicholas Nethercote [:njn] 2011-11-03 19:28:31 PDT
Comment on attachment 571866 [details] [diff] [review]
v12: final version

Review of attachment 571866 [details] [diff] [review]:
-----------------------------------------------------------------

The memory reporter stuff looks ok.  There's room for improvement in the terminology:  we talk about "used", "unused", and "decommitted".  More accurate would be "used", "used-and-committed", "unused-and-decommitted".  (And better names would obviate the need for the "decommitted is mutually exclusive with unused" part of the descriptions.)  Those names are rather wordy, though.  If you can think of better names, that'd be great.  If not, what you have in the patch will do.

::: js/xpconnect/src/XPCJSRuntime.cpp
@@ +1289,5 @@
> +{
> +    IterateData *data = static_cast<IterateData *>(vdata);
> +    for (uint32 i = 0; i < js::gc::ArenasPerChunk; i++)
> +        if (chunk->decommittedArenas.get(i))
> +            data->gcHeapChunkDirtyDecommitted += js::gc::ArenaSize;

Should decommittedArenas be called dirtyDecommittedArenas?

@@ +1561,5 @@
>          data->compartmentStatsVector.SetCapacity(rt->compartments.length());
>  
> +        data->gcHeapChunkCleanDecommitted =
> +            rt->gcChunkPool.countDecommittedArenas(rt) *
> +            js::gc::ArenaSize;

Should countDecommittedArenas() be called countCleanDecommittedArenas()?

@@ +1573,4 @@
>  
>          js::IterateCompartmentsArenasCells(cx, data, CompartmentCallback,
>                                             ArenaCallback, CellCallback);
> +        js::IterateChunks(cx, data, ChunkCallback);

A comment above this call saying that it sets data->gcHeapChunkDirtyDecommitted would be nice.

@@ +1640,5 @@
>          data->totalAnalysisTemp  += stats.typeInferenceMemory.temporary;
>      }
>  
>      size_t numDirtyChunks = (data->gcHeapChunkTotal -
>                               data->gcHeapChunkCleanUnused) /

I think you need to subtract data->gcHeapChunkCleanDecommitted as well?
Comment 65 Justin Lebar (not reading bugmail) 2011-11-03 22:25:57 PDT
I'm fine doing the Mac measurement fixes for this (comment 52) as a followup, but we do need to do it.

Can you please file a blocking bug?
Comment 66 Bill McCloskey (:billm) 2011-11-03 22:51:53 PDT
(In reply to Justin Lebar [:jlebar] from comment #65)
> I'm fine doing the Mac measurement fixes for this (comment 52) as a
> followup, but we do need to do it.
> 
> Can you please file a blocking bug?

This is only a problem on 10.5, right?
Comment 67 Justin Lebar (not reading bugmail) 2011-11-04 07:42:22 PDT
I'm pretty sure it's a problem on all versions of Mac OS.

We initially didn't turn jemalloc on for 10.5 because jemalloc appeared to be a regression.  But on 10.6, it just so happened that jemalloc's high water mark was close to the system malloc's memory usage, so we didn't see a regression.
Comment 68 Terrence Cole [:terrence] 2011-11-04 09:32:27 PDT
My understanding is that this will not be a _regression_ on MacOS, correct?  If not, I think we should push forward with this for the win it gives us on Windows and Linux.  We then have 6 weeks to hammer out real improvements to the entire allocation path, rather than just papering over this difference.  There are just 2 work days remaining in this merge window and it would be nice to have broader testing on at least one nightly before it becomes difficult to fix any regressions that crop up.
Comment 69 Justin Lebar (not reading bugmail) 2011-11-04 09:48:51 PDT
> My understanding is that this will not be a _regression_ on MacOS, correct? 

Correct.  It should be a memory win on MacOS, but we won't be able to measure it.  In a sense, we are regressing our ability to measure memory usage accurately, but I don't think that's enough to hold up this patch, which should be a big win.
Comment 70 Terrence Cole [:terrence] 2011-11-04 15:51:30 PDT
https://tbpl.mozilla.org/?tree=Try&rev=f236d314c5bc
Comment 71 Bill McCloskey (:billm) 2011-11-04 18:02:20 PDT
Comment on attachment 571866 [details] [diff] [review]
v12: final version

Review of attachment 571866 [details] [diff] [review]:
-----------------------------------------------------------------

Great, thanks a lot!

::: js/src/jsgc.cpp
@@ +2906,5 @@
>  
>      MarkAndSweep(cx, gckind);
>  
> +    if (gckind == GC_SHRINK)
> +        DecommitFreePages(cx);

Could you move these lines to SweepPhase, right after the chunk expiration stuff near the bottom? That way it will get timed as part of the PHASE_DESTROY stuff. It also makes more sense there.

::: js/src/jsgc.h
@@ +612,5 @@
> + */
> +const size_t BytesPerArenaWithHeader = ArenaSize + ArenaBitmapBytes;
> +const size_t ChunkDecommitBitmapBytes = ChunkSize / ArenaSize / JS_BITS_PER_BYTE;
> +const size_t ChunkBytesAvailable = ChunkSize - sizeof(ChunkInfo)
> +                                    - ChunkDecommitBitmapBytes;

This will fit on one line (we allow up to 100 characters).

@@ +727,5 @@
>      bool unused() const {
> +        return info.numArenasFree == ArenasPerChunk;
> +    }
> +
> +    bool noAvailableArenas() const {

Let's just have one function to do this and reverse the polarity, as we discussed.
Comment 72 Terrence Cole [:terrence] 2011-11-04 18:15:07 PDT
Created attachment 572145 [details] [diff] [review]
v13: final final version

Applied review feedback.
Comment 73 Terrence Cole [:terrence] 2011-11-04 18:24:58 PDT
Created attachment 572146 [details] [diff] [review]
v13: with a fixed reviewer string

Missed this when refreshing the patch header.
Comment 74 Terrence Cole [:terrence] 2011-11-04 18:41:22 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/940adaea65a1
Comment 75 Marco Bonardo [::mak] 2011-11-05 03:12:04 PDT
https://hg.mozilla.org/mozilla-central/rev/940adaea65a1
Comment 76 Marco Bonardo [::mak] 2011-11-05 05:05:25 PDT
this may have caused a 1.5% regression on Domaeo (DOM) on Linux, but any other platform (included Linux 64) seem to be unaffected, is that something that may be expected? May even be some kind of noise...
http://graphs-new.mozilla.org/graph.html#tests=[[73,131,14]]&sel=1320427594972.9185,1320494696222&displayrange=7&datatype=running
Comment 77 Justin Lebar (not reading bugmail) 2011-11-05 07:14:39 PDT
The change is very small, but it may be statistically significant.  Before the change there were no scores of 177, and after the change 3 out of 8 runs have that score.
Comment 78 Justin Lebar (not reading bugmail) 2011-11-07 07:07:18 PST
In normal use of the browser, I see 0mb decommitted in about:memory (latest trunk).  Is this normal?

Also, have you filed the followup bug for measurement on Mac?
Comment 79 Terrence Cole [:terrence] 2011-11-07 09:14:29 PST
(In reply to Justin Lebar [:jlebar] from comment #78)
> In normal use of the browser, I see 0mb decommitted in about:memory (latest
> trunk).  Is this normal?

Yes, this is normal.  The decommit code only runs on GC_SHRINK, which currently only happens when the browser is idle and allocates a new chunk.  The most significant impact right now is going to be for people who leave their browser on overnight with many tabs open.  There are plans underway, as in the bug you linked, to make GC_SHRINK happen more frequently.

> Also, have you filed the followup bug for measurement on Mac?

No, I thought you were going to do that.
Comment 80 Justin Lebar (not reading bugmail) 2011-11-07 09:41:36 PST
> There are plans underway, as in the bug you linked, to make GC_SHRINK happen more frequently.

Bug 700291 has to do only with about:memory, so normal users aren't going to get a memory boost from that.  Are there bugs on the other plans which are underway to make this GC_SHRINK happen more often?

> Also, have you filed the followup bug for measurement on Mac?

Sure, I'd be happy to.
Comment 81 Terrence Cole [:terrence] 2011-11-08 13:57:59 PST
(In reply to Marco Bonardo [:mak] from comment #76)
> this may have caused a 1.5% regression on Domaeo (DOM) on Linux, but any
> other platform (included Linux 64) seem to be unaffected, is that something
> that may be expected? May even be some kind of noise...
> http://graphs-new.mozilla.org/graph.html#tests=[[73,131,
> 14]]&sel=1320427594972.9185,1320494696222&displayrange=7&datatype=running

I was not able to reproduce this.  On a linux32 opt build I get before and after this patch times of:
before (79812:ec39a58d7f39) - 2379.68runs/s (Total)
after  (79813:940adaea65a1) - 2403.43runs/s (Total)

That noise must have been rats or something.  Nothing there now.
Comment 82 Igor Bukanov 2011-11-14 07:19:33 PST
I see that DecommitMemory can returns false. But do we have any indications that this can happen in practice? For example, we assume that munmap/VirtualFree is infallible, so I suppose that the decommit should be similar.
Comment 83 Terrence Cole [:terrence] 2011-11-14 11:55:18 PST
You are quite correct!  The reason I left it returning bool is that it is not clear that we want a strictly advisory decommit on all platforms.  We are currently doing only advisory decommit because stricter decommission can fail on Linux, and MEM_RESET has equivalent semantics on Windows: it was the path of least resistance.  On MacOS, however, the advisory decommit does not show up in RSS, so we probably want to investigate doing a stronger, fallable decommit there at some point.  I left this as future work because the stronger requirements will require significantly more testing, and it is on a platform I do not use regularly.  I don't consider this a super-high priority, but the decommit isn't on a fast-path, so I didn't bother changing it after we decided to only go with the less-strict decommit at first.
Comment 84 Igor Bukanov 2011-11-14 14:16:29 PST
(In reply to Terrence Cole [:terrence] from comment #83)
> The reason I left it returning bool is that it is
> not clear that we want a strictly advisory decommit on all platforms.  We
> are currently doing only advisory decommit because stricter decommission can
> fail on Linux,

What is the stricter decomission?
Comment 85 Nicholas Nethercote [:njn] 2011-11-14 15:40:02 PST
Hughman noticed that the description for the "js-gc-heap-unused-fraction" reporter says this:

 "Fraction of the garbage-collected JavaScript heap that is unused. "
 "Computed as ('js-gc-heap-chunk-clean-unused' + "
 "'js-gc-heap-chunk-dirty-unused' + 'js-gc-heap-arena-unused') / "
 "'js-gc-heap'.",

It should include "js-gc-heap-decommitted" in the addition.  Terrence, can you fix this?  Thanks.
Comment 86 Terrence Cole [:terrence] 2011-11-14 16:11:03 PST
(In reply to Igor Bukanov from comment #84)
> What is the stricter decomission?

VirtualFree/munmap: removal of the pages from the process' memory mapping.  What we are currently doing is VirtualAlloc(..., MEM_RESET, ...) and mmadvise(..., MADV_DONTNEED, ...) which keeps the pages in the process' management, but tells the kernel that the pages are no longer in use.  The effect of this is that if the kernel needs to free up physical memory in the future, it will preferentially use these pages and not bother swapping them out before overwriting them.

(In reply to Nicholas Nethercote [:njn] from comment #85)
> Hughman noticed that the description for the "js-gc-heap-unused-fraction"
> reporter says this:
> 
>  "Fraction of the garbage-collected JavaScript heap that is unused. "
>  "Computed as ('js-gc-heap-chunk-clean-unused' + "
>  "'js-gc-heap-chunk-dirty-unused' + 'js-gc-heap-arena-unused') / "
>  "'js-gc-heap'.",
> 
> It should include "js-gc-heap-decommitted" in the addition.  Terrence, can
> you fix this?  Thanks.

On it.
Comment 87 Justin Lebar (not reading bugmail) 2011-11-14 16:32:22 PST
The stricter decommit is not needed with MADV_DONTNEED on Linux.  On Linux, MADV_DONTNEED'ed pages don't count against your RSS.  It is needed with MADV_FREE'd pages on Mac.  I don't know about Windows; we should test...
Comment 88 Terrence Cole [:terrence] 2011-11-14 16:32:53 PST
Created bug 702480.
Comment 89 Igor Bukanov 2011-11-16 09:08:13 PST
(In reply to Justin Lebar [:jlebar] from comment #87)
> The stricter decommit is not needed with MADV_DONTNEED on Linux.  On Linux,
> MADV_DONTNEED'ed pages don't count against your RSS. 

I saw few times  on Linux that when GC decommits, say, 300, MB of pages, then RSS shrinks just by 30-50 MB max. Does MADV_DONTNEED really works on Linux as advertised?
Comment 90 Justin Lebar (not reading bugmail) 2011-11-16 09:17:33 PST
> Does MADV_DONTNEED really works on Linux as advertised?

Perhaps not! Or perhaps its behavior is different in the browser than in the simple tests I've run.  (It's also possible that 250mb of the 300mb decommitted was not in physical memory.)

I filed bug 702979 on checking this out.
Comment 91 Randell Jesup [:jesup] 2011-11-23 07:21:01 PST
Just a tickler: note that (without having read the patch or every comment here) that saying you don't need the data in a page doesn't mean that the VM space is freed - 32-bit (Win32 in particular) will still be in danger of address-space exhaustion if the pages are not eventually freed (or used instead of new allocations).
Comment 92 Igor Bukanov 2011-11-23 08:12:18 PST
(In reply to Randell Jesup [:jesup] from comment #91)
> Just a tickler: note that (without having read the patch or every comment
> here) that saying you don't need the data in a page doesn't mean that the VM
> space is freed - 32-bit (Win32 in particular) will still be in danger of
> address-space exhaustion if the pages are not eventually freed (or used
> instead of new allocations).

The patch can only improve the situations - it tells the system that a particular page is unused and its physical memory can be used for other things. Without the patch that page will be sitting on the GC empty list to be only released when the whole 1MB GC chunk is released.

Note You need to log in before you can comment on or make changes to this bug.