TM: don't return GCChunks immediately

RESOLVED FIXED

Status

()

Core
JavaScript Engine
RESOLVED FIXED
7 years ago
7 years ago

People

(Reporter: gwagner, Assigned: gwagner)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(12 attachments, 12 obsolete attachments)

30.08 KB, patch
Details | Diff | Splinter Review
36.94 KB, patch
Details | Diff | Splinter Review
36.96 KB, patch
Details | Diff | Splinter Review
6.99 KB, image/png
Details
5.95 KB, image/png
Details
8.83 KB, image/png
Details
9.48 KB, image/png
Details
8.89 KB, image/png
Details
9.60 KB, image/png
Details
9.60 KB, image/png
Details
10.15 KB, image/png
Details
9.03 KB, patch
Details | Diff | Splinter Review
(Assignee)

Description

7 years ago
The measurements in bug 539532 show that we spend a lot of time returning GC pages to the OS during each GC. These pages have to be allocated again. 
The idea is to keep the pages around. This will reduce the GC pause time and object allocation time in general.
(Assignee)

Updated

7 years ago
Assignee: general → anygregor
Blocks: 539532
(Assignee)

Comment 1

7 years ago
Created attachment 423624 [details] [diff] [review]
WiP
(Assignee)

Comment 2

7 years ago
Created attachment 423842 [details] [diff] [review]
patch

Is there a better way to get the chunk address from the chunkinfo? The code in DestoyEmptyGCChunks is a little bit complicated.
Attachment #423624 - Attachment is obsolete: true
Attachment #423842 - Flags: review?(igor)

Comment 3

7 years ago
(In reply to comment #2)
> Created an attachment (id=423842) [details]
> patch

Hm, I thought the idea is to offload chunk destruction to another thread and see how that would affect performance on different platforms. For example, OS X is known for a slow VM, on a faster VM the overhead of having of up to 50 MB unused memory may slow things down.  Also, if munmap is bad even if it is done on another thread, an alternative is to use posix_memalign(big_chunk). That at least would return unused memory into the malloc heap.

> 
> Is there a better way to get the chunk address from the chunkinfo? The code in
> DestoyEmptyGCChunks is a little bit complicated.

Unused chunks can be linked through a link at the first chunk word. They do not need chunkinfo.

Comment 4

7 years ago
We don't want to free the memory. Under load the pages will be required again soon. Gregor's benchmark shows a SunSpider speedup. We never GC during sunspider runs (at least not as part of the SunSpider time). The savings is from avoiding page reallocation. In short, we don't want to free here. Not in the foreground. Not in the background.
(Assignee)

Comment 5

7 years ago
(In reply to comment #3)
> (In reply to comment #2)
> > Created an attachment (id=423842) [details] [details]
> > patch
> 
> Hm, I thought the idea is to offload chunk destruction to another thread and
> see how that would affect performance on different platforms. For example, OS X
> is known for a slow VM, on a faster VM the overhead of having of up to 50 MB
> unused memory may slow things down.  Also, if munmap is bad even if it is done
> on another thread, an alternative is to use posix_memalign(big_chunk). That at
> least would return unused memory into the malloc heap.

I agree that a background thread would reduce the pause time but some benchmarks clearly suffer from the slow reallocation. 
The top is the base64 benchmark with a 13% performance win on my Mac Pro.

It looks like an overall win on sunspider: 
TEST                   COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:           1.016x as fast    734.0ms +/- 0.3%   722.6ms +/- 0.2%     significant

=============================================================================

  3d:                  1.022x as fast    115.8ms +/- 0.8%   113.3ms +/- 0.3%     significant
    cube:              -                  37.7ms +/- 0.9%    37.3ms +/- 0.9% 
    morph:             1.080x as fast     21.6ms +/- 1.7%    20.0ms +/- 0.0%     significant
    raytrace:          1.009x as fast     56.5ms +/- 0.7%    56.0ms +/- 0.0%     significant

  access:              1.043x as fast     96.8ms +/- 0.8%    92.8ms +/- 0.6%     significant
    binary-trees:      1.026x as fast     19.7ms +/- 1.8%    19.2ms +/- 1.6%     significant
    fannkuch:          1.043x as fast     45.9ms +/- 0.5%    44.0ms +/- 0.0%     significant
    nbody:             1.102x as fast     19.4ms +/- 1.9%    17.6ms +/- 2.1%     significant
    nsieve:            ??                 11.8ms +/- 2.6%    12.0ms +/- 0.0%     not conclusive: might be *1.017x as slow*

  bitops:              1.032x as fast     31.8ms +/- 0.9%    30.8ms +/- 1.0%     significant
    3bit-bits-in-byte: ??                  1.2ms +/- 25.1%     1.4ms +/- 26.4%     not conclusive: might be *1.167x as slow*
    bits-in-byte:      1.078x as fast      9.7ms +/- 3.6%     9.0ms +/- 0.0%     significant
    bitwise-and:       ??                  1.9ms +/- 11.9%     2.0ms +/- 0.0%     not conclusive: might be *1.053x as slow*
    nsieve-bits:       1.033x as fast     19.0ms +/- 0.0%    18.4ms +/- 2.0%     significant

  controlflow:         -                   7.5ms +/- 5.0%     7.2ms +/- 4.2% 
    recursive:         -                   7.5ms +/- 5.0%     7.2ms +/- 4.2% 

  crypto:              1.018x as fast     45.8ms +/- 1.0%    45.0ms +/- 0.7%     significant
    aes:               1.023x as fast     26.2ms +/- 1.2%    25.6ms +/- 1.4%     significant
    md5:               1.008x as fast     13.0ms +/- 0.0%    12.9ms +/- 1.8%     significant
    sha1:              -                   6.6ms +/- 5.6%     6.5ms +/- 5.8% 

  date:                1.029x as fast    103.1ms +/- 0.5%   100.2ms +/- 0.3%     significant
    format-tofte:      1.041x as fast     53.4ms +/- 0.7%    51.3ms +/- 0.7%     significant
    format-xparb:      1.016x as fast     49.7ms +/- 0.7%    48.9ms +/- 0.5%     significant

  math:                -                  34.8ms +/- 1.6%    34.8ms +/- 0.9% 
    cordic:            ??                 15.8ms +/- 1.9%    15.9ms +/- 1.4%     not conclusive: might be *1.006x as slow*
    partial-sums:      -                  12.5ms +/- 3.0%    12.5ms +/- 3.0% 
    spectral-norm:     -                   6.5ms +/- 5.8%     6.4ms +/- 5.8% 

  regexp:              *1.016x as slow*   36.5ms +/- 1.0%    37.1ms +/- 0.6%     significant
    dna:               *1.016x as slow*   36.5ms +/- 1.0%    37.1ms +/- 0.6%     significant

  string:              -                 261.9ms +/- 0.3%   261.4ms +/- 0.2% 
    base64:            1.130x as fast     11.3ms +/- 3.1%    10.0ms +/- 0.0%     significant
    fasta:             1.037x as fast     50.1ms +/- 0.5%    48.3ms +/- 0.7%     significant
    tagcloud:          ??                 86.2ms +/- 0.3%    86.3ms +/- 0.4%     not conclusive: might be *1.001x as slow*
    unpack-code:       *1.031x as slow*   79.6ms +/- 0.5%    82.1ms +/- 0.3%     significant
    validate-input:    -                  34.7ms +/- 1.0%    34.7ms +/- 1.0% 


> 
> > 
> > Is there a better way to get the chunk address from the chunkinfo? The code in
> > DestoyEmptyGCChunks is a little bit complicated.
> 
> Unused chunks can be linked through a link at the first chunk word. They do not
> need chunkinfo.

Sure they could be linked separately. But do we want another data structure?

Comment 6

7 years ago
The code works and is only called once during shutdown. Why does the complexity matter?

Comment 7

7 years ago
(Note: we should add an API hook to flush cashes during memory pressure, we could release empty pages in that hook.)

Comment 8

7 years ago
(In reply to comment #5)
> I agree that a background thread would reduce the pause time but some
> benchmarks clearly suffer from the slow reallocation. 
> The top is the base64 benchmark with a 13% performance win on my Mac Pro.
> 
> It looks like an overall win on sunspider: 
....
> ** TOTAL **:           1.016x as fast    734.0ms +/- 0.3%   722.6ms +/- 0.2%   
>  significant

I did the test on Linux for Intel Atom N270 1.6 GHz and Core-2-based Xenon 2.6 GHz. I see no difference with the patch for susnspider. Could we get the results on Windows?

Comment 9

7 years ago
(In reply to comment #4)
> In short, we don't want to free here. Not in the foreground. Not in the background.

That depends on the quality of VM. I have not observed any win on Linux. Thus for Linux the patch makes things worse overall. It does bring the speedup yet it makes the browser to consume more memory.

Comment 10

7 years ago
(In reply to comment #9)
> That depends on the quality of VM. I have not observed any win on Linux. Thus
> for Linux the patch makes things worse overall. It does bring the speedup yet
> it makes the browser to consume more memory.

I meant "It does not bring the speedup yet it makes the browser to consume more memory."

Comment 11

7 years ago
Did you measure how much more memory is consumed as a percentage of the total memory use over time?

Comment 12

7 years ago
(In reply to comment #11)
> Did you measure how much more memory is consumed as a percentage of the total
> memory use over time?

No. For reliable stats one need a browser session that ressemble real one in number of open windows and in duration. I will try to get such stats tomorrow.
(Assignee)

Comment 13

7 years ago
I can't see a slowdown on Linux on my Intel Q9300 2.5GHz Quadcore:

** TOTAL **:      -            788.8ms +/- 0.7%   785.1ms +/- 0.6% 

I don't have a windows machine to test right now.

Comment 14

7 years ago
(In reply to comment #11)
> Did you measure how much more memory is consumed as a percentage of the total
> memory use over time?

After some testing a memory consumption for my browser session in GC arenas is about 10-15MB. Opening Google docs and maps adds about 5-7 MB to that. Running http://www.chromeexperiments.com/detail/canopy/ adds 10-20 MB to that with an observed peak about 40MB. Closing the corresponding tabs returns the memory to the initial 10-15. Thus if the GC would not return the memory back to the system, this would waste 20MB. That memory would not be available not only for other application, but also for the browser itself. 

Hence the suggestion to try to use posix_memalign or even plain malloc. The corresponding free would at least make the memory available for other browser needs while avoiding mmap/munmap overhead. Once the code has supported that option, but only to allocate 4K arenas. It would be nice to try to use it with current 64K or bigger chunks.

Comment 15

7 years ago
I don't think this approach makes sense for high-throughput applications. You add a bunch of unnecessary malloc/free and mmap/unmap roundtrips (malloc implementations will unmap when you release enough memory in a row). I don't see a point to aggressively save 10-20mb footprint. We should rather GC & shrink when there is real memory pressure.

Comment 16

7 years ago
(In reply to comment #15)
> I don't think this approach makes sense for high-throughput applications. You
> add a bunch of unnecessary malloc/free and mmap/unmap roundtrips (malloc
> implementations will unmap when you release enough memory in a row). I don't
> see a point to aggressively save 10-20mb footprint. We should rather GC &
> shrink when there is real memory pressure.

I made a quick hack to change the GC to use posix_memalign again but allocate in chunks. With the crrent 16 GC arenas per chunk I see no difference, but with 8 arenas and 32K chunks I got in sunspider on Xeon X3353 2.66GHz CPU 

** TOTAL **:           1.016x as fast    763.5ms +/- 0.7%   751.8ms +/- 1.3%     significant

That is, on Linux it make sence to use posix_memalign/free compared with mmap even when munmap does not release the memory.

Comment 17

7 years ago
To confirm that it is important to call "free" to share GC-ed memory with the rest of of the system allocator I changed the patch from comment 2 to use memalign/free. Then the gains from the comment 16 disappeared on Linux. 

I also see similar results with running sunspider in jemalloc-compiled xpcshell. There with posix_memalign/free the speedup is 1.2-1.3% and, again, no speedup if the memory is only released at shutdown.
(Assignee)

Comment 18

7 years ago
How about delaying the free? We could count how many GC cycles the chunk is empty and once it reaches a certain threshold we give it back to the OS.
Yeah, I'm more confortable with some sort of decay model here.  I wanted to try some with arrays when I did the dense array work, so that I could amortize the growth over fewer allocations without having long-term memory impact, but never got to it.

We don't get reliable memory pressure signals, and some of our memory pressure is from the market rather than the OS.

Comment 20

7 years ago
(In reply to comment #18)
> How about delaying the free?

Well, my tests indicates that doing the free in time makes things faster on Linux both with GLIBC allocator and jemalloc.

Comment 21

7 years ago
Created attachment 424051 [details] [diff] [review]
hack to restore posix_memalign

Gregor, could you try this on Mac to see how it would affect sunspider?
(Assignee)

Updated

7 years ago
Attachment #424051 - Attachment mime type: application/octet-stream → text/plain

Comment 22

7 years ago
(In reply to comment #21)
> Created an attachment (id=424051) [details]
> hack to restore posix_memalign
> 
> Gregor, could you try this on Mac to see how it would affect sunspider?

Also, could you try different values of GC_ARENAS_PER_CHUNK?
(Assignee)

Comment 23

7 years ago
(In reply to comment #21)
> Created an attachment (id=424051) [details]
> hack to restore posix_memalign
> 
> Gregor, could you try this on Mac to see how it would affect sunspider?

Yes this makes SS faster on Mac. It goes down from 734 to 716ms. Adding my path is 713ms.

GC_ARENAS_PER_CHUNK
8   717ms
32  720
64  715
128 715

Comment 24

7 years ago
Do we know why its faster? Better cache utilization? We are running strictly more cycles, so it would be worth while finding out where the speedup comes from. Maybe we can have both. More cycles and better cache effects. In what order are you re-using chunks in your patch gregor?
Attachment #424051 - Attachment is patch: true
I will test these patches on Windows. But do we have a better benchmark? It seems that pause times for long-running JS are much more interesting here than SS scores.
There are pause-time benchmarks in the GC suite from Q4, I think, but I don't know where exactly.  Sayre probably will!

Comment 27

7 years ago
(In reply to comment #24)
> Do we know why its faster? Better cache utilization?

gregor's data indicates that on Mac the situation is intuitive one. There VM is slow so either avoiding munmap or replacing it with free brings similar benefits. Now, using free is perhaps faster then not using munmap (715ms versus 722 in the comment 5) since now the GC data is allocated together with the rest of data allocated using malloc. Still not using free at all makes things even faster - 713 ms.

I will recheck my results on Linux tomorrow, but for now the data indicates that the decision to remove posix_memalign support from js_GC was a wrong one. We should have gone in another direction, that is, to remove mmap code.  

 We are running strictly
> more cycles, so it would be worth while finding out where the speedup comes
> from. Maybe we can have both. More cycles and better cache effects. In what
> order are you re-using chunks in your patch gregor?

Comment 28

7 years ago
I am not in favor of mmap over posix_memalign, not at all. I am in favor of 1 mechanism. I still don't understand why we are faster with posix_memalign. I think we should try to understand that. Also, have these numbers been confirmed in the browser? A single-threaded shell has vastly different performance behavior from a multi-threaded browser when it comes to malloc/free (ballpark 5% sunspider difference).

Comment 29

7 years ago
(In reply to comment #25)
> I will test these patches on Windows. 

Note that the patch from the comment 21 still uses VirtualAlloc on Windows. So to test posix_memalign one would need jemalloc shell and hack the patch farther to call posix_memalign when MOZ_MEMORY is defined on Windows.

> But do we have a better benchmark?

To test just the GC performance I often use the following synthetic benchmark or its variations:

function test(N)
{
    var N = 6e6;
    var a = [];
    var time0 = Date.now();
    for(var i = 0; i != N; ++i) {
        a[i] = i + 0.1;
    }
    var time0 = Date.now() - time0;
    gc();
    var time1 = Date.now();
    gc();
    time1 = Date.now() - time1;
    a = null;
    var time2 = Date.now();
    gc();
    time2 = Date.now() - time2;  
    return [time0, time1, time2];
}

//warmup                                                                                                                                                          
test();

var min_times = [Infinity, Infinity, Infinity];

for (var i = 0; i != 10; ++i) {
    var times = test();
    min_times[0] = Math.min(min_times[0], times[0]);
    min_times[1] = Math.min(min_times[1], times[1]);
    min_times[2] = Math.min(min_times[2], times[2]);
}
print("Alloc time: "+min_times[0]);
print("GC time all used: "+min_times[1]);
print("GC time all free: "+min_times[2]);

Comment 30

7 years ago
(In reply to comment #28)
> Also, have these numbers been confirmed
> in the browser?

I have seen the same effect in JS_THREADSAFE js shell and xpcshell from browser build. The problem with running sunspider in the browser itself is that the noise is big and hides the small effect.

Comment 31

7 years ago
Ok, then lets measure JS_THREADSAFE in the shell. I have seen the same malloc/free penalty there as in the browser. It should be a fair enough approximation for this benchmark.
(In reply to comment #25)
> But do we have a better benchmark?

Would :alix's animation/GC demo be of use for that? (Or by chance is that what's being used already?)
http://people.mozilla.com/~afranquet/clock.html

That link is via this page -- http://hacks.mozilla.org/2010/01/javascript-speedups-in-firefox-3-6/ -- which also has this description: 

"Besides animating the dial, this demo creates one million 100-character strings per second, so it requires frequent GC. [...] The estimated GC delay meter gives the average estimated GC delay, based on the assumption that if a frame has a delay of 1.7 times the average delay or more, then exactly one GC ran during that frame."
Ah, yeah, that might work.  One would hope that it would show up in the pause rates there, given the binge/purge allocation behaviour that it manifests.

(I think that demo really creates a million String objects per second, each one pointing at the same 100 character string primitive, so it's not really churning through 100MB/sec, but more like 8MB/sec on a 32-bit machine, right?)

Comment 34

7 years ago
I did the test again and they confirm the initial observation that delaying free slowing down things on Linux. Here for reference susnspider results on Xeon X3353 2.66GHz CPU with xpcshell builda gainst jemalloc.

applying patch from the comment 21 to hg revison 3c3b005de959 improve timing from 742 to 731:

TEST                   COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:           1.015x as fast    742.0ms +/- 0.4%   730.9ms +/- 0.5%     significant

=============================================================================

  3d:                  ??                122.4ms +/- 1.6%   124.0ms +/- 2.0%     not conclusive: might be *1.013x as slow*
    cube:              ??                 38.5ms +/- 3.8%    38.9ms +/- 3.8%     not conclusive: might be *1.010x as slow*
    morph:             ??                 29.5ms +/- 2.4%    30.2ms +/- 2.9%     not conclusive: might be *1.024x as slow*
    raytrace:          *1.009x as slow*   54.4ms +/- 0.4%    54.9ms +/- 1.4%     significant

  access:              1.045x as fast     93.8ms +/- 0.9%    89.8ms +/- 1.4%     significant
    binary-trees:      1.112x as fast     25.6ms +/- 1.5%    23.0ms +/- 1.2%     significant
    fannkuch:          -                  41.3ms +/- 0.9%    41.0ms +/- 2.2% 
    nbody:             1.053x as fast     16.2ms +/- 1.7%    15.4ms +/- 1.3%     significant
    nsieve:            1.035x as fast     10.7ms +/- 2.2%    10.3ms +/- 1.7%     significant

  bitops:              -                  31.4ms +/- 1.5%    31.1ms +/- 1.7% 
    3bit-bits-in-byte: *1.152x as slow*    1.3ms +/- 10.1%     1.5ms +/- 10.2%     significant
    bits-in-byte:      ??                  9.4ms +/- 3.1%     9.6ms +/- 2.6%     not conclusive: might be *1.023x as slow*
    bitwise-and:       1.074x as fast      3.8ms +/- 4.7%     3.5ms +/- 5.5%     significant
    nsieve-bits:       1.033x as fast     16.9ms +/- 1.3%    16.4ms +/- 1.2%     significant

  controlflow:         ??                  7.4ms +/- 2.0%     7.5ms +/- 2.1%     not conclusive: might be *1.019x as slow*
    recursive:         ??                  7.4ms +/- 2.0%     7.5ms +/- 2.1%     not conclusive: might be *1.019x as slow*

  crypto:              1.019x as fast     45.1ms +/- 0.8%    44.3ms +/- 0.7%     significant
    aes:               1.012x as fast     25.6ms +/- 0.8%    25.3ms +/- 0.6%     significant
    md5:               1.026x as fast     12.9ms +/- 1.8%    12.5ms +/- 1.4%     significant
    sha1:              1.037x as fast      6.6ms +/- 2.2%     6.4ms +/- 2.2%     significant

  date:                1.011x as fast    109.3ms +/- 0.5%   108.1ms +/- 0.3%     significant
    format-tofte:      -                  62.5ms +/- 0.4%    62.5ms +/- 0.4% 
    format-xparb:      1.025x as fast     46.8ms +/- 1.1%    45.6ms +/- 0.5%     significant

  math:                ??                 34.1ms +/- 1.5%    34.3ms +/- 0.8%     not conclusive: might be *1.005x as slow*
    cordic:            -                  12.9ms +/- 2.1%    12.8ms +/- 1.2% 
    partial-sums:      ??                 15.5ms +/- 1.6%    15.7ms +/- 1.1%     not conclusive: might be *1.014x as slow*
    spectral-norm:     -                   5.7ms +/- 2.5%     5.7ms +/- 2.3% 

  regexp:              -                  47.1ms +/- 1.3%    47.0ms +/- 0.8% 
    dna:               -                  47.1ms +/- 1.3%    47.0ms +/- 0.8% 

  string:              1.027x as fast    251.4ms +/- 0.4%   244.9ms +/- 0.5%     significant
    base64:            ??                 11.3ms +/- 1.2%    11.5ms +/- 1.6%     not conclusive: might be *1.014x as slow*
    fasta:             -                  48.9ms +/- 0.6%    48.7ms +/- 0.5% 
    tagcloud:          1.035x as fast     79.7ms +/- 0.5%    77.0ms +/- 0.5%     significant
    unpack-code:       1.039x as fast     78.3ms +/- 0.9%    75.3ms +/- 1.2%     significant
    validate-input:    1.024x as fast     33.3ms +/- 0.9%    32.5ms +/- 0.7%     significant


applying the patch from the comment 2 on top of that slow things down from 731 to 738:


TEST                   COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:           *1.010x as slow*  730.9ms +/- 0.5%   738.3ms +/- 0.7%     significant

=============================================================================

  3d:                  -                 124.0ms +/- 2.0%   122.7ms +/- 2.0% 
    cube:              -                  38.9ms +/- 3.8%    38.0ms +/- 3.6% 
    morph:             -                  30.2ms +/- 2.9%    29.4ms +/- 3.0% 
    raytrace:          ??                 54.9ms +/- 1.4%    55.2ms +/- 1.6%     not conclusive: might be *1.006x as slow*

  access:              ??                 89.8ms +/- 1.4%    91.2ms +/- 0.9%     not conclusive: might be *1.016x as slow*
    binary-trees:      *1.036x as slow*   23.0ms +/- 1.2%    23.8ms +/- 1.9%     significant
    fannkuch:          -                  41.0ms +/- 2.2%    41.0ms +/- 1.2% 
    nbody:             -                  15.4ms +/- 1.3%    15.2ms +/- 1.2% 
    nsieve:            *1.072x as slow*   10.3ms +/- 1.7%    11.1ms +/- 1.1%     significant

  bitops:              *1.025x as slow*   31.1ms +/- 1.7%    31.8ms +/- 1.9%     significant
    3bit-bits-in-byte: ??                  1.5ms +/- 10.2%     1.5ms +/- 9.3%     not conclusive: might be *1.013x as slow*
    bits-in-byte:      -                   9.6ms +/- 2.6%     9.5ms +/- 2.2% 
    bitwise-and:       *1.26x as slow*     3.5ms +/- 5.5%     4.4ms +/- 5.2%     significant
    nsieve-bits:       ??                 16.4ms +/- 1.2%    16.4ms +/- 1.7%     not conclusive: might be *1.001x as slow*

  controlflow:         ??                  7.5ms +/- 2.1%     7.6ms +/- 2.3%     not conclusive: might be *1.005x as slow*
    recursive:         ??                  7.5ms +/- 2.1%     7.6ms +/- 2.3%     not conclusive: might be *1.005x as slow*

  crypto:              *1.015x as slow*   44.3ms +/- 0.7%    44.9ms +/- 1.9%     significant
    aes:               ??                 25.3ms +/- 0.6%    25.4ms +/- 1.4%     not conclusive: might be *1.004x as slow*
    md5:               ??                 12.5ms +/- 1.4%    12.7ms +/- 2.1%     not conclusive: might be *1.016x as slow*
    sha1:              *1.056x as slow*    6.4ms +/- 2.2%     6.8ms +/- 8.3%     significant

  date:                ??                108.1ms +/- 0.3%   108.4ms +/- 1.4%     not conclusive: might be *1.003x as slow*
    format-tofte:      -                  62.5ms +/- 0.4%    62.1ms +/- 1.5% 
    format-xparb:      *1.016x as slow*   45.6ms +/- 0.5%    46.3ms +/- 1.4%     significant

  math:                ??                 34.3ms +/- 0.8%    34.4ms +/- 1.6%     not conclusive: might be *1.004x as slow*
    cordic:            ??                 12.8ms +/- 1.2%    12.9ms +/- 1.8%     not conclusive: might be *1.005x as slow*
    partial-sums:      -                  15.7ms +/- 1.1%    15.7ms +/- 1.7% 
    spectral-norm:     ??                  5.7ms +/- 2.3%     5.8ms +/- 2.7%     not conclusive: might be *1.011x as slow*

  regexp:              1.017x as fast     47.0ms +/- 0.8%    46.2ms +/- 1.4%     significant
    dna:               1.017x as fast     47.0ms +/- 0.8%    46.2ms +/- 1.4%     significant

  string:              *1.025x as slow*  244.9ms +/- 0.5%   251.1ms +/- 0.5%     significant
    base64:            1.049x as fast     11.5ms +/- 1.6%    10.9ms +/- 1.9%     significant
    fasta:             -                  48.7ms +/- 0.5%    48.3ms +/- 1.0% 
    tagcloud:          *1.029x as slow*   77.0ms +/- 0.5%    79.2ms +/- 0.6%     significant
    unpack-code:       *1.062x as slow*   75.3ms +/- 1.2%    80.0ms +/- 1.0%     significant
    validate-input:    ??                 32.5ms +/- 0.7%    32.6ms +/- 0.7%     not conclusive: might be *1.004x as slow*
Sunspider on Vista for tip TM vs. first patch applied:

TEST                   COMPARISON            FROM                 TO
 DETAILS

=============================================================================

** TOTAL **:           1.017x as fast    865.3ms +/- 0.3%   850.5ms +/- 0.2%
 significant

=============================================================================

  3d:                  1.009x as fast    138.2ms +/- 0.7%   136.9ms +/- 0.8%
 significant
    cube:              1.034x as fast     36.6ms +/- 1.0%    35.4ms +/- 1.0%
 significant
    morph:             -                  45.7ms +/- 0.8%    45.4ms +/- 2.0%
    raytrace:          ??                 55.9ms +/- 0.7%    56.1ms +/- 1.5%
 not conclusive: might be *1.004x as slow*

  access:              1.021x as fast    122.9ms +/- 0.7%   120.4ms +/- 0.8%
 significant
    binary-trees:      -                  21.9ms +/- 1.0%    21.6ms +/- 3.6%
    fannkuch:          1.018x as fast     66.2ms +/- 1.1%    65.0ms +/- 0.7%
 significant
    nbody:             1.040x as fast     23.4ms +/- 1.6%    22.5ms +/- 1.7%
 significant
    nsieve:            -                  11.4ms +/- 3.2%    11.3ms +/- 3.1%

  bitops:              ??                 32.8ms +/- 2.0%    33.5ms +/- 2.3%
 not conclusive: might be *1.021x as slow*
    3bit-bits-in-byte: ??                  1.4ms +/- 26.4%     1.6ms +/- 23.1%
   not conclusive: might be *1.143x as slow*
    bits-in-byte:      *1.083x as slow*    8.4ms +/- 4.4%     9.1ms +/- 4.5%
 significant
    bitwise-and:       -                   2.1ms +/- 10.8%     2.0ms +/- 0.0%
    nsieve-bits:       -                  20.9ms +/- 2.5%    20.8ms +/- 2.7%

  controlflow:         ??                  7.2ms +/- 4.2%     7.5ms +/- 5.0%
 not conclusive: might be *1.042x as slow*
    recursive:         ??                  7.2ms +/- 4.2%     7.5ms +/- 5.0%
 not conclusive: might be *1.042x as slow*

  crypto:              ??                 49.6ms +/- 1.0%    49.7ms +/- 1.0%
 not conclusive: might be *1.002x as slow*
    aes:               -                  28.1ms +/- 0.8%    27.9ms +/- 0.8%
    md5:               ??                 13.5ms +/- 2.8%    13.8ms +/- 2.2%
 not conclusive: might be *1.022x as slow*
    sha1:              -                   8.0ms +/- 0.0%     8.0ms +/- 0.0%

  date:                -                 133.7ms +/- 0.4%   133.7ms +/- 0.4%
    format-tofte:      1.008x as fast     62.3ms +/- 0.6%    61.8ms +/- 0.5%
 significant
    format-xparb:      ??                 71.4ms +/- 0.7%    71.9ms +/- 0.6%
 not conclusive: might be *1.007x as slow*

  math:                -                  47.0ms +/- 1.9%    46.2ms +/- 1.0%
    cordic:            1.015x as fast     27.4ms +/- 1.3%    27.0ms +/- 0.0%
 significant
    partial-sums:      -                  14.0ms +/- 5.4%    13.6ms +/- 2.7%
    spectral-norm:     -                   5.6ms +/- 6.6%     5.6ms +/- 6.6%

  regexp:              1.066x as fast     45.2ms +/- 1.8%    42.4ms +/- 0.9%
 significant
    dna:               1.066x as fast     45.2ms +/- 1.8%    42.4ms +/- 0.9%
 significant

  string:              1.030x as fast    288.7ms +/- 0.5%   280.2ms +/- 0.3%
 significant
    base64:            ??                 11.4ms +/- 3.2%    11.5ms +/- 3.3%
 not conclusive: might be *1.009x as slow*
    fasta:             1.039x as fast     58.6ms +/- 1.3%    56.4ms +/- 0.7%
 significant
    tagcloud:          1.032x as fast     95.4ms +/- 1.6%    92.4ms +/- 0.4%
 significant
    unpack-code:       1.032x as fast     86.3ms +/- 0.6%    83.6ms +/- 0.7%
 significant
    validate-input:    1.019x as fast     37.0ms +/- 1.3%    36.3ms +/- 1.0%
 significant
Again on Vista, TM tip vs. posix_memalign patch applied:

TEST                   COMPARISON            FROM                 TO
 DETAILS

=============================================================================

** TOTAL **:           1.011x as fast    865.3ms +/- 0.3%   855.5ms +/- 0.4%
 significant

=============================================================================

  3d:                  1.009x as fast    138.2ms +/- 0.7%   136.9ms +/- 0.4%
 significant
    cube:              1.014x as fast     36.6ms +/- 1.0%    36.1ms +/- 0.6%
 significant
    morph:             1.013x as fast     45.7ms +/- 0.8%    45.1ms +/- 0.5%
 significant
    raytrace:          -                  55.9ms +/- 0.7%    55.7ms +/- 0.6%

  access:              1.016x as fast    122.9ms +/- 0.7%   121.0ms +/- 0.6%
 significant
    binary-trees:      1.019x as fast     21.9ms +/- 1.0%    21.5ms +/- 1.8%
 significant
    fannkuch:          -                  66.2ms +/- 1.1%    65.6ms +/- 0.6%
    nbody:             1.026x as fast     23.4ms +/- 1.6%    22.8ms +/- 2.0%
 significant
    nsieve:            -                  11.4ms +/- 3.2%    11.1ms +/- 2.0%

  bitops:              ??                 32.8ms +/- 2.0%    33.0ms +/- 3.1%
 not conclusive: might be *1.006x as slow*
    3bit-bits-in-byte: ??                  1.4ms +/- 26.4%     1.7ms +/- 20.3%
   not conclusive: might be *1.21x as slow*
    bits-in-byte:      ??                  8.4ms +/- 4.4%     8.6ms +/- 5.8%
 not conclusive: might be *1.024x as slow*
    bitwise-and:       -                   2.1ms +/- 10.8%     2.0ms +/- 0.0%
    nsieve-bits:       -                  20.9ms +/- 2.5%    20.7ms +/- 2.3%

  controlflow:         ??                  7.2ms +/- 4.2%     7.3ms +/- 6.6%
 not conclusive: might be *1.014x as slow*
    recursive:         ??                  7.2ms +/- 4.2%     7.3ms +/- 6.6%
 not conclusive: might be *1.014x as slow*

  crypto:              -                  49.6ms +/- 1.0%    49.0ms +/- 1.0%
    aes:               1.014x as fast     28.1ms +/- 0.8%    27.7ms +/- 1.2%
 significant
    md5:               -                  13.5ms +/- 2.8%    13.5ms +/- 2.8%
    sha1:              1.026x as fast      8.0ms +/- 0.0%     7.8ms +/- 3.9%
 significant

  date:                -                 133.7ms +/- 0.4%   133.3ms +/- 0.5%
    format-tofte:      1.011x as fast     62.3ms +/- 0.6%    61.6ms +/- 0.6%
 significant
    format-xparb:      ??                 71.4ms +/- 0.7%    71.7ms +/- 0.8%
 not conclusive: might be *1.004x as slow*

  math:                -                  47.0ms +/- 1.9%    46.1ms +/- 1.4%
    cordic:            -                  27.4ms +/- 1.3%    27.2ms +/- 1.1%
    partial-sums:      -                  14.0ms +/- 5.4%    13.6ms +/- 2.7%
    spectral-norm:     -                   5.6ms +/- 6.6%     5.3ms +/- 6.5%

  regexp:              1.051x as fast     45.2ms +/- 1.8%    43.0ms +/- 1.1%
 significant
    dna:               1.051x as fast     45.2ms +/- 1.8%    43.0ms +/- 1.1%
 significant

  string:              1.010x as fast    288.7ms +/- 0.5%   285.9ms +/- 0.6%
 significant
    base64:            ??                 11.4ms +/- 3.2%    11.5ms +/- 3.3%
 not conclusive: might be *1.009x as slow*
    fasta:             -                  58.6ms +/- 1.3%    58.5ms +/- 1.9%
    tagcloud:          -                  95.4ms +/- 1.6%    93.8ms +/- 1.1%
    unpack-code:       -                  86.3ms +/- 0.6%    85.7ms +/- 1.2%
    validate-input:    1.016x as fast     37.0ms +/- 1.3%    36.4ms +/- 1.0%
 significant
I measured GC pause times on the Canopy Chrome experiment. My procedure:

  - Measure time for each call to js_GC with rdtsc() and a class that takes
    timestamps in ctor/dtor
  - Run Canopy for several seconds
  - To analyze data, throw away pause times much shorter than usual
    (less than 20M cycles).
  - Report average and max pause time in ms

Results:

                      mean (ms)       max (ms)
TM tip:                  21.5           26.9
first patch:             17.5           25.7
posix_memalign patch:    22.2           26.1
(Assignee)

Comment 38

7 years ago
In order to get a better understanding for Linux I was measuring clock cycles (measured with rdtsc) for the clock benchmark Alex mentioned.
I take the first 10 GC runs once I start the benchmark.
All numbers represent clockcycles * 1E6:
chunk and destroy represent clock-cycles spend in createChunk and destroyChunk function until the next GC.
GC total means clock cycles spend in js_GC.
never return means the approach where I never return chunks to the OS.
Looks like a combination of memalign and delayed returning pages is the way to go for Linux but not for Windows as Davids number show.

Tip:
GC total: 59.984250
create chunk: 1.283946, destroy: 5.357551
GC total: 65.148690
create chunk: 1.702142, destroy: 8.668120
GC total: 65.102940
create chunk: 1.668936, destroy: 8.571407
GC total: 66.635302
create chunk: 1.785963, destroy: 8.642302
GC total: 67.303733
create chunk: 1.823751, destroy: 9.096835
GC total: 65.943045
create chunk: 1.647631, destroy: 8.743144
GC total: 71.483768
create chunk: 1.833246, destroy: 11.906154
GC total: 69.165578
create chunk: 1.959102, destroy: 9.234659
GC total: 62.773845
create chunk: 1.602108, destroy: 8.140037
GC total: 69.744675
create chunk: 1.851726, destroy: 9.220635

memalign
GC total: 56.798535
create chunk: 1.078626, destroy: 4.140181
GC total: 67.185240
create chunk: 2.071883, destroy: 7.399031
GC total: 66.639907
create chunk: 2.088105, destroy: 7.072140
GC total: 66.712673
create chunk: 2.148167, destroy: 7.076758
GC total: 65.022098
create chunk: 1.877782, destroy: 7.274674
GC total: 62.263410
create chunk: 3.300468, destroy: 6.702548
GC total: 62.094457
create chunk: 3.134920, destroy: 6.511204
GC total: 62.282565
create chunk: 3.174125, destroy: 6.583060
GC total: 62.116913
create chunk: 3.093809, destroy: 6.525590
GC total: 60.745748
create chunk: 1.510684, destroy: 3.266253


Tip + never return:
GC total: 53.453002
create chunk: 1.550957, destroy: 0.000000
GC total: 56.476117
create chunk: 0.585223, destroy: 0.000000
GC total: 62.370593
create chunk: 0.256883, destroy: 0.000000
GC total: 55.096665
create chunk: 0.000000, destroy: 0.000000
GC total: 55.641825
create chunk: 0.000000, destroy: 0.000000
GC total: 55.118168
create chunk: 0.000000, destroy: 0.000000
GC total: 55.079535
create chunk: 0.000000, destroy: 0.000000
GC total: 55.285597
create chunk: 0.000000, destroy: 0.000000
GC total: 58.120110
create chunk: 0.000000, destroy: 0.000000
GC total: 58.995465
create chunk: 0.000000, destroy: 0.000000

memalign + never return:
GC total: 34.332472
create chunk: 0.463917, destroy: 0.000000
GC total: 43.483253
create chunk: 0.956575, destroy: 0.000000
GC total: 44.499712
create chunk: 0.051564, destroy: 0.000000
GC total: 44.070758
create chunk: 0.000000, destroy: 0.000000
GC total: 44.366948
create chunk: 0.000000, destroy: 0.000000
GC total: 44.334638
create chunk: 0.000000, destroy: 0.000000
GC total: 44.319345
create chunk: 0.000000, destroy: 0.000000
GC total: 50.329492
create chunk: 0.302865, destroy: 0.000000
GC total: 43.602270
create chunk: 0.000000, destroy: 0.000000
GC total: 45.702323
create chunk: 0.000000, destroy: 0.000000

Comment 39

7 years ago
Created attachment 424198 [details] [diff] [review]
memalign + free on background thread

The new patch moves the free call to the background thread.

Comment 40

7 years ago
(In reply to comment #38)
> In order to get a better understanding for Linux I was measuring clock cycles
> (measured with rdtsc) for the clock benchmark Alex mentioned.

Could you post that measuring code?

Comment 41

7 years ago
Created attachment 424209 [details] [diff] [review]
memalign + free on background thread v2

The new patch increases the chunk size to 128K and uses as before memalign + free on the background thread.

For the benchmark from the comment 29 that exercise only GC allocator with no malloc calls I have with the patch in jemalloc+xpcshell:

base:
Alloc time: 209
GC time all used: 58
GC time all free: 15

memalign + background free:
Alloc time: 214
GC time all used: 59
GC time all free: 6

mmap + never return memory
Alloc time: 200
GC time all used: 59
GC time all free: 7

This shows that on Linux mmap/munmap is faster than memalign allocation but that is completely offset by slower finalization. Still mmap/no-release wins on all counts.

If I change the benchmark so allocation uses the same number of GC and malloc allocations:

function test(N)
{
    var N = 6e6;
    var a = [];
    var time0 = Date.now();
    for(var i = 0; i != N; ++i) {
        a[i] = "aa".toUpperCase();
    }
    var time0 = Date.now() - time0;
    gc();
    var time1 = Date.now();
    gc();
    time1 = Date.now() - time1;
    a = null;
    var time2 = Date.now();
    gc();
    time2 = Date.now() - time2;
    return [time0, time1, time2];
}

// warmup                                                                                                                                                          
test();


var min_times = [Infinity, Infinity, Infinity];

for (var i = 0; i != 100; ++i) {
    var times = test();
    min_times[0] = Math.min(min_times[0], times[0]);
    min_times[1] = Math.min(min_times[1], times[1]);
    min_times[2] = Math.min(min_times[2], times[2]);
}
print("Alloc time: "+min_times[0]);
print("GC time all used: "+min_times[1]);
print("GC time all free: "+min_times[2]);

The results:

base:
Alloc time: 760
GC time all used: 97
GC time all free: 227

memalign + background free:
Alloc time: 745
GC time all used: 97
GC time all free: 197

mmap + never return memory
Alloc time: 722
GC time all used: 97
GC time all free: 196

That is, now memalign wins with the base on allocation while mmap/no-free is still the fastest. My theory regarding memalign allocation speedup is that memalign somehow warmup jemalloc code and data caches. This speedup is not enough to offset the wins from mmap/no-free in this benchmark but in sunspider it is enough to make the allocation faster.


Still if one focuses purely on GC mark and finally timing then memalign+free on the background thread on Linux is at least as fast as mmap/no-free but has an advantage that is does not leave unused memory arround.

Gregor/David, could you repeat the tests with this patch?
Attachment #424051 - Attachment is obsolete: true
Attachment #424198 - Attachment is obsolete: true

Comment 42

7 years ago
Created attachment 424219 [details] [diff] [review]
background munmap

The patch moves munmup calls to the background thread. On Linux it worse then other proposals. There is no changes in sunspider, allocation tests shows slowdown and the mark/sweep phases are not faster then other patches.
(Assignee)

Comment 43

7 years ago
MacPro OS X 10.6.2

I calculated the (arith) mean of 10 GC calls once I started the benchmarks.
All numbers (except Count) represent (rdtsc) cycles * 1E6 

GC Total: accumulated cycles in js_GC (mean)
Alloc since last GC: accumulate cycles in NewGCChunk (mean)
Count: number of NewGCChunk calls between last GC call and current call (no mean value)
per alloc: current accumulated NewGCChunk cycles / Count (no mean value)
Mark, Sweep, Finalize Obj, Doubles: cycles in mark, sweep... (mean)
Destroy: cycles in DestroyGCArenas (mean)
Count: number of DestroyGCArenas calls between last GC call and current one (no mean value)
per Dist: current accumulated DestroyGCArenas cycles / Count (no mean value)


Tip: mmap + 16 arenas per chunk

Clock
GC Total: 86.277183
Alloc since last GC: 11.231231, Count: 490 per alloc: 0.022659
Mark: 20.773317, Sweep: 63.651558
Finalize Obj: 40.383799, Doubles: 0.002311
Destroy: 21.227316, Count: 491, per Dest: 0.037003

Canopy
GC Total: 108.236756
Alloc since last GC: 97.377330, Count: 985 per alloc: 0.100792
Mark: 35.407740, Sweep: 69.874921
Finalize Obj: 19.989628, Doubles: 6.034316
Destroy: 41.343824, Count: 984, per Dest: 0.03649

It is very interesting that the time per allocation increases dramatically if we call mmap over 600 times.


Attachment 424209 [details] [diff]
32 arenas per chunk
posix_memalign ?
free on background thread ?

Clock
GC Total: 84.069372
Alloc since last GC: 5.045846, Count: 254 per alloc: 0.020286
Mark: 20.815956, Sweep: 61.436992
Finalize Obj: 40.309536, Doubles: 0.002079
Destroy: 19.095992, Count: 254, per Dest: 0.065833

Canopy
GC Total: 99.394880
Alloc since last GC: 49.907864, Count: 495 per alloc: 0.115329
Mark: 35.076904, Sweep: 61.967926
Finalize Obj: 18.955368, Doubles: 5.883493
Destoy: 34.705608, Count: 497, per Dest: 0.064539


Attachment 424219 [details] [diff]
Background mmunmap

Clock
GC Total: 71.826731
Alloc since last GC: 11.787724, Count: 490 per alloc: 0.022092
Mark: 20.867233, Sweep: 48.612205
Finalize Obj: 43.314811, Doubles: 0.002473
Destoy: 3.230002, Count: 486, per Dest: 0.054199

Canopy:
GC Total: 71.387655
Alloc since last GC: 95.360449, Count: 1048 per alloc: 0.103057
Mark: 35.334600, Sweep: 33.680048
Finalize Obj: 18.939819, Doubles: 6.190844
Destoy: 6.139078, Count: 1012, per Dest: 0.059629

Allocation time kills here!!!!

never release chunks

Clock
GC Total: 73.801036
Alloc since last GC: 0.000000, Count: 0 per alloc: 0.000000
Mark: 23.834874, Sweep: 47.563324
Finalize Obj: 42.371644, Doubles: 0.003180
Destoy: 3.149600, Count: 0, per Dest: 0.000000

Canopy
GC Total: 58.777888
Alloc since last GC: 1.598225, Count: 0 per alloc: 0.000000
Mark: 27.906838, Sweep: 28.843181
Finalize Obj: 16.121037, Doubles: 5.332098
Destoy: 5.215941, Count: 0, per Dest: 0.000000
(Assignee)

Comment 44

7 years ago
And the same for Ubuntu 9.10 2.6.30

Tip: mmap + 16 arenas per chunk

Clock:
GC Total: 70.850034
Alloc since last GC: 2.257380, Count: 466 per alloc: 0.004596
Mark: 19.527767, Sweep: 50.059779
Finalize Obj: 35.943663, Doubles: 0.001456
Destroy: 12.141015, Count: 467, per Dest: 0.020657

Canopy:
GC Total: 72.883884
Alloc since last GC: 8.125352, Count: 779 per alloc: 0.011639
Mark: 26.552469, Sweep: 44.960441
Finalize Obj: 8.169489, Doubles: 3.656163
Destroy: 19.895777, Count: 776, per Dest: 0.021073
GC Total: 71.099658


Attachment 424209 [details] [diff]
32 arenas per chunk
posix_memalign
free on background thread

Clock:
GC Total: 71.766398
Alloc since last GC: 1.622382, Count: 235 per alloc: 0.007643
Mark: 18.974574, Sweep: 45.849586
Finalize Obj: 34.197816, Doubles: 0.002181
Destroy: 2.356967, Count: 234, per Dest: 0.000039

Canopy:
GC Total: 78.486300
Alloc since last GC: 2.612945, Count: 388 per alloc: 0.006740
Mark: 25.261387, Sweep: 49.146618
Finalize Obj: 8.961127, Doubles: 4.070297
Destroy: 4.361608, Count: 393, per Dest: 0.000038

Sweep is unstable between 30 and 60!


Attachment 424219 [details] [diff]
Background mmunmap

Clock:
GC Total: 70.966204
Alloc since last GC: 2.115593, Count: 474 per alloc: 0.004514
Mark: 19.397501, Sweep: 50.368943
Finalize Obj: 35.884453, Doubles: 0.001558
Destroy: 2.411190, Count: 473, per Dest: 0.021089

Canopy:
GC Total: 71.389255
Alloc since last GC: 9.345766, Count: 752 per alloc: 0.011788
Mark: 26.901378, Sweep: 43.113534
Finalize Obj: 7.819234, Doubles: 3.657568
Destroy: 3.910578, Count: 0, per Dest: 0.000000


never release Chunk

Clock:
GC Total: 61.238915
Alloc since last GC: 0.000000, Count: 0 per alloc: 0.000000
Mark: 19.263335, Sweep: 40.673392
Finalize Obj: 36.356200, Doubles: 0.001970
Destroy: 2.414596, Count: 0, per Dest: 0.000000

Canopy:
GC Total: 53.569553
Alloc since last GC: 0.000000, Count: 0 per alloc: 0.000000
Mark: 27.715081, Sweep: 24.418610
Finalize Obj: 8.086296, Doubles: 3.615567
Destroy: 3.709132, Count: 0, per Dest: 0.000000
(Assignee)

Comment 45

7 years ago
Created attachment 424382 [details] [diff] [review]
patch

A quick hack that delays the chunk release. If a chunk is empty for 3 GC runs, it will be returned to the OS.

I can't see at all how this is slower on Linux:
Tip:
Alloc time: 381
GC time all used: 78
GC time all free: 13

This patch:
Alloc time: 323
GC time all used: 78
GC time all free: 5

Comment 46

7 years ago
For code simplicity I would like to try yet another idea - the background free and allocation. The idea is to have a free arena pool that the background thread would try to replenish when it would drop beyond certain limit. Also the background thread would also free a part of the pool if it becomes too big.

I hope that with big enough minimal pool it would allow to use posix_memalig (_aligned_malloc on Windows when jemalloc is not available) to allocate the arenas directly and remove the code complexity associated with the current chunk management while approaching the performance of never releasing arenas.

Comment 47

7 years ago
I like #46. We should prepare pools in the background and have them ready for the foreground thread to pickup instead of allocating and preparing them on the spot as we run out of pools in the foreground. If we this in the background, why do we need posix_memalign? Once mmap/munmap are not on the hot path we should be able to use them.

Comment 48

7 years ago
(In reply to comment #47)
> Once mmap/munmap are not on the hot path we should be able to use them.

On Windows (at least on XP and Vista) the VirtualAlloc allocation granularity is 64K (see http://blogs.msdn.com/oldnewthing/archive/2003/10/08/55239.aspx for reasons behind it). Allocating anything less just waists memory. Thus, if we use arenas that are less in size, then the code would still need to deal with chunks. I would like to avoid that to simplify the GC.

Comment 49

7 years ago
Sounds good. I am all for getting rid of chunks.

Comment 50

7 years ago
(In reply to comment #46)
> I hope that with big enough minimal pool it would allow to use posix_memalig
> (_aligned_malloc on Windows when jemalloc is not available)

That does not work. _aligned_malloc(size, alignment) on Windows is just a wrapper on top of malloc that calls malloc(size + alignment + extra_word). Thus using it like _aligned_malloc(GC_PAGE_SIZE, GC_PAGE_SIZE) just waste memory.

The bottom line is that on Windows the minimal alignment that one can get from the system is 64K and one form or another of chunked allocation is necessary as we cannot require that jemalloc must be used, right? So I will try to keep chunks while adding their background allocation and trimming.

Comment 51

7 years ago
Created attachment 425802 [details] [diff] [review]
background mmap/munmap

This patch implements background allocation and release of GC chunks using mmap/munmap.
Attachment #424209 - Attachment is obsolete: true
Attachment #424219 - Attachment is obsolete: true

Comment 52

7 years ago
Created attachment 425803 [details] [diff] [review]
background oversized malloc and free

This patch replaces mmap/munmap for chunks with oversized malloc. This wastes about one arena per chunk, but the plus is that any released chunk is immediately available for other caller of malloc. As with the previous patch, both the allocation and release of chunks is done in the background.

Comment 53

7 years ago
Created attachment 425804 [details] [diff] [review]
foreground oversized malloc + background free

The new patch effectively disables the background allocation in the previous patch so all chunk allocations is done using straight oversized malloc call.

Comment 54

7 years ago
Here are the benchmark results on Linux for 4-core Xeon X3353 2.66 GHz CPU for the last 3 patches compared TM tip (revison ce654228dabe). I have used jemalloc-enabled xpcshell to stay as close to a browser as possible. The table lists the speed up factors compared with the base, so a number less than 1 is a slowdown .

                       susnspider    alloc1    free1    alloc2   free2
background mmap/munmap      1.000      0.84     1.14      1.00    2.14
background malloc/free      1.016      0.91     1.14      1.01    2.14
malloc/background free      1.011      1.02     1.14      0.98    2.14

Here alloc1 and free1 are numbers for the benchmark from the comment 41 and alloc2 and free2 are for the benchmark from the comment 29. In the first benchmark there is a 50% mix of GC and malloc allocations. In the second benchmark only GC allocations are exercised.

The numbers indicates that background munmap/free makes things better accross all patches. 

The pure allocation measurements shows a slowdown with background mmap/malloc done on separated thread. One possibility for that could be an extra memory bandwidth that is required to populate caches on two threads. 

To explain sunspider results that shows a win for malloc over mmap one has to remember that after each test the sunspider driver runs the gc(). When that returns the background thread would continue to call the free. It seems that such free improves malloc performance on another thread. I do not have a reason for that, but I have observed that effect with jemalloc and GLIBC-malloc in other benchmarks and patches. 

From code simplicity point of view on Linux the winner is oversized malloc done in foreground with background free. It wastes some space, but some of that can be recovered with moving JSGCArenInfo there.
(Assignee)

Comment 55

7 years ago
I tried attachment 425804 [details] [diff] [review] on the clock benchmark on my Mac Pro. I just filed a bug 545729 because canopy does not work right now.
I also noticed a big GC pause difference between the first start of the browser after a recompile and restarts afterwards. The GC Pause time decreases by about 20% after the first restart. These numbers are now collected after a restart.

Again all numbers except call counts are cycles * 1E6 and a mean value of 10 GC runs after the benchmark is running for about one minute.

Tip
GC Total: 72.612444
NewGCChunk: 9.6939
NewGCChunk calls: 457, per alloc: 0.023627
Mark: 15.872284, Sweep: 55.327894
Finalize Obj: 35.034302, Doubles: 0.002316
Desrtoy Chunk: 18.502619, Count: 458, per Dest: 0.037205

attachment 425804 [details] [diff] [review]
GC Total: 55.300570
NewGCChunk: 4.147308
NewGCChunk called: 447 per alloc: 0.010420
Mark: 16.602647, Sweep: 34.473739
Finalize Obj: 32.812583, Doubles: 0.001686
Destroy GCChunk calls: 421, per Dest: 0.081875

return GCChunks if they remain empty for 2 GC runs
GC Total: 53.816886
NewGCChunk: 0, 
Count: 0 per alloc: 0.000000
Mark: 16.472776, Sweep: 35.943617
Finalize Obj: 31.837134, Doubles: 0.002064
Destroy GCChunk: 0
Count: 0, per Dest: 0.000000


The GC pause time is reduced thats good but we still give back all free chunks and reallocate them right away. All benchmarks that generate short-lived objects will suffer due to the bad allocation job OS X does. Also SS suffers as shown in comment 5.
I don't understand the code simplicity argument. JSReplenishGCChunkPoolTask is simpler than adding a count variable to each chunk?
Returning chunks after they remained empty between two GC calls shouldn't be that complicated.

Comment 56

7 years ago
(In reply to comment #55)
> The GC pause time is reduced thats good but we still give back all free chunks
> and reallocate them right away. All benchmarks that generate short-lived
> objects will suffer due to the bad allocation job OS X does. Also SS suffers as
> shown in comment 5.

That is a platform-specific. On Linux I see the opposite. So I guess we need to get the results on Windows before we can proceed. Also note that a particular way to delay chunk release is orthogonal to whether we use mmap or malloc. 

Could you also try the attachment  attachment 425803 [details] [diff] [review]?

> I don't understand the code simplicity argument. JSReplenishGCChunkPoolTask is
> simpler than adding a count variable to each chunk?

Sorry I was not clear about that. I was referring to the difference between the attachment 425803 [details] [diff] [review] and the attachment 425804 [details] [diff] [review]. The former implements background chunk allocation on top of the latter. That background allocation brings extra complexity that AFAICS does not justify the gains on Linux.

Comment 57

7 years ago
Engine returns GC chunks to the OS way to fast.
This fix would yield big speedup's in a threaded world where GC is run much more often.

Ideas on when to expect this to land?
thanks,

Mike M.
(Assignee)

Comment 58

7 years ago
Well as you can see with the discussion in this bug, its not so easy.
It seems on Mac its a huge performance win if we don't return the chunks right away but on Linux it's a little bit different. Neither Igor, nor me has a windows machine to get some numbers. So we are waiting for a windows user but I agree we should be more active here again.

Comment 59

7 years ago
I'm your windows guy.

Update your patch for tip (make sure to use js_malloc() instead of malloc() and js_free() instead of free())

Just tell me exactly how to test it and I'll get you some numbers for WIN32.
Does your testing method involve threads?

Mike M.
(Assignee)

Comment 60

7 years ago
Created attachment 433224 [details]
Current Tip Canopy

I want to go back to this bug again and show how important it would be to solve the problem for Mac.
This Graph represents GC times of the Canopy benchmark for current Tip.
(Assignee)

Comment 61

7 years ago
Created attachment 433225 [details]
Delayed Return of GC Chunks on Mac

And the same for my patch where I return GC chunks if they remain empty for 2 GC runs. The average GC time reduces from about 120 [1E6 cycles] to 80.
We should decide which direction we want to go here....

My patch shouldn't need a refresh but I can create one if it does.

Comment 62

7 years ago
Can we get some numbers for windows? That would be great. As long we don't make things slower there, we should take the patch.

Comment 63

7 years ago
I can get you numbers.
Just tell me what i need to do to get those timings. Be as specific as possible.

Greg, updating the patch(s) might help too.  Seems to be two needed for this?

Comment 64

7 years ago
(In reply to comment #61)
> We should decide which direction we want to go here....

I would like to move to the background thread the finalization of doubles and strings. For most practical cases this will effectively moves the chunk release from the GC thread. When this will be done we may revisit this bug to see if background allocation or delayed release of the chunks would be necessary.

Comment 65

7 years ago
Igor,

In an embedding with many threads (where GC runs way more often) this patch will probably make a big difference. I'm not sure if background finalization will have the same impact.

Your thoughts?

Comment 66

7 years ago
(In reply to comment #65)
> In an embedding with many threads (where GC runs way more often) this patch
> will probably make a big difference. I'm not sure if background finalization
> will have the same impact.

I am not against proposed changes. I just would like to do them after implementing the background finalization as that may require to rewrite any proposed patch here.

Comment 67

7 years ago
Igor,

Do you have a bug# for the background finalization you are speaking about?
If so are you in active development or is this an idea for future.
ETA?

Thanks.
(Assignee)

Comment 68

7 years ago
(In reply to comment #66)
> (In reply to comment #65)
> > In an embedding with many threads (where GC runs way more often) this patch
> > will probably make a big difference. I'm not sure if background finalization
> > will have the same impact.
> 
> I am not against proposed changes. I just would like to do them after
> implementing the background finalization as that may require to rewrite any
> proposed patch here.

When will this patch be ready?
So if I understand correctly you want to move the finalization and chunk release to the background thread.
This means for the canopy benchmark for example that we free and alloc the same amount of memory in parallel. I don't think this will be very fast. Didn't we see the impact on the GC benchmark we had to remove from the test-suite?

Comment 69

7 years ago
(In reply to comment #67)
> Igor,
> 
> Do you have a bug# for the background finalization you are speaking about?
> If so are you in active development or is this an idea for future.
> ETA?

Bug 543036

Comment 70

7 years ago
(In reply to comment #68)
> When will this patch be ready?

For doubles (bug 543036) I should have the patch before Monday. Strings shold be ready during the next week.

> This means for the canopy benchmark for example that we free and alloc the same
> amount of memory in parallel.

We already do the free/malloc in parallel. But this is still faster then free from one thread as it minimizes the GC pause and a possible slowdown is spread over all allocations. Still for the GC chunks we may need to do something special. But then again, lets consider this after the background changes.
(Assignee)

Comment 71

7 years ago
Created attachment 433974 [details]
Tip: Canopy

Added GCChunk count.
(Assignee)

Comment 72

7 years ago
Created attachment 433975 [details]
Canopy with this patch.

Added GCChunk count.

Comment 73

7 years ago
In the bug 553812 I am going to bump the chunk size to 2 MB among other refactorings. The patch will also use vm_allocate on MAC. That will alter the stats here.
Depends on: 553812

Comment 74

7 years ago
Gregor, could you try on Mac thepatch from the bug 553812 to see how it performs?
(Assignee)

Comment 75

7 years ago
Created attachment 434631 [details]
Canopy with patch from bug 553812

Performance for the patch of bug 553812. Igor asked in this bug and since all other graphs are in this bug I post the result here.

I can't see a reduction of the GC pause time. We allocate and deallocate less chunks but we still give them back and allocate them right away.

It might be a improvement that we perform less GC runs but I have to verify that.
(Assignee)

Comment 76

7 years ago
Created attachment 434735 [details]
Clock Benchmark Tip

Comparison of the current tip, Igors patch for bug 553812 and the patch from this bug with the delayed chunk deallocation.
The main difference is that the chunk destruction gets cheaper. Its 20E6 cycles for Tip, 10E6 cycles for Igors patch and 4E6 cycles for this patch.
The overall GC pause is about 68E6 cycles for Tip, 58E6 cycles with Igors patch and 53 with the delayed Chunk deallocation.


PS: adding multiple file at once would be a nice feature :)
(Assignee)

Comment 77

7 years ago
Created attachment 434737 [details]
Clock Benchmark with patch from bug 553812
(Assignee)

Comment 78

7 years ago
Created attachment 434738 [details]
Clock Benchmark with delayed chunk deallocation

Comment 79

7 years ago
Gregor: thanks for benchmarking!

(In reply to comment #75)
> I can't see a reduction of the GC pause time. We allocate and deallocate less
> chunks but we still give them back and allocate them right away.

Yes, sure. But with bigger chunks a simple strategy of having one extra chunk available (perhaps via background allocation) may just work. Plus these chunks can be shared with jemalloc chunks so making a common pool of chunks for jemalloc and the GC could be very beneficial.

Comment 80

7 years ago
We have a proposed patch that is a strict improvement and a lot of maybes. I side with the patch in hand, and possible future improvements on top of that.

Comment 81

7 years ago
Would it be possible to make this value:

static const jsuword GC_EMPTY_CHUNK_SURVIVES = 3;

settable via API?

If you are running 100 concurrent threads GC gets call A LOT.
The thrashing is pretty bad.  It might be nice to be able to tune this by hand depending on the embedding.

Other than that...land it.

Comment 82

7 years ago
(In reply to comment #80)
> We have a proposed patch that is a strict improvement and a lot of maybes.

The patch is a strict improvement on Mac. On Linux with its better VM the situation is not that clear. That is why we need data from a Windows.

> I side with the patch in hand, and possible future improvements on top of that.

On top of the patch from the bug 553812 I would like to implement a pool of chunks that is shared between jemalloc and the GC. For such pool any internal GC pooling would be counterproductive. Also, even for platforms where the pooling is necessary (MAC does not use jemalloc), big 1MB chunks may need/allow for a different strategy of pooling. That is why I would like to land first platform-neutral changes and then add platform-specific improvements.

Comment 83

7 years ago
Can we land patch from bug 553682? That would help me get you some numbers for windows.

> On top of the patch from the bug 553812 I would like to implement a pool of
> chunks that is shared between jemalloc and the GC. For such pool any internal
> GC pooling would be counterproductive.

Igor that might help the browser but not spider monkey embedders in general.  We don't use jemalloc.  
(If JS_USE_CUSTOM_ALLOCATOR patch ever lands we might be able to) Your scope is too narrow.
Also please do some multi-threaded testing on Linux before you make up your mind.

Try testing with many threads running and GC firing every couple of seconds.

Comment 84

7 years ago
(In reply to comment #83)
> Can we land patch from bug 553682? That would help me get you some numbers for
> windows.

That is already done.

> Igor that might help the browser but not spider monkey embedders in general. 
> We don't use jemalloc.  

This emphases that what is good for the browser may not be good for other embed dings. I suspect that the ultimate solution would be to do the chunk pooling outside the engine via some easy to use API especially given huge platform differences in mmap implementations. 

One way to do that would be to support rerouting of chunk allocation/deallocation through an embedding supplied callback. This will support both the browser implementing common pooling via jemalloc or a heavy multi-threaded embedding that just preallocate all the GC heap memory for best possible performance. If embedding has enough memory and does not care if it is not swappable it could event try on Windows VirtualAloc(MEM_LARGE_PAGES).

Comment 85

7 years ago
> This emphases that what is good for the browser may not be good for other embed
> dings. I suspect that the ultimate solution would be to do the chunk pooling
> outside the engine via some easy to use API especially given huge platform
> differences in mmap implementations. 
> One way to do that would be to support rerouting of chunk
> allocation/deallocation through an embedding supplied callback.

Yes.  That was the purpose of bug 549532 (custom allocator support).
Once it lands, I will create a follow up patch that will handle the need for aligned memory need by GC such as VirtualAlloc.  That patch got too big so I left it out for now.
Then you can put in jemalloc or any other custom allocator, for only GC memory, or all allocations if you wish.

That will help a lot.  However the "ultimate solution" is to have thread specific GC (or greatly reduced locking) AND a custom allocator in play.
(Assignee)

Comment 86

7 years ago
(In reply to comment #81)
> Would it be possible to make this value:
> 
> static const jsuword GC_EMPTY_CHUNK_SURVIVES = 3;
> 
> settable via API?
> 

Sounds like a good idea. I will refresh it later today.
We could even set it to 0 for linux if we don't find a solution for the platform problem.

Comment 87

7 years ago
(In reply to comment #86)
> (In reply to comment #81)
> > Would it be possible to make this value:
> > 
> > static const jsuword GC_EMPTY_CHUNK_SURVIVES = 3;
> > 
> > settable via API?
> > 
> 
> Sounds like a good idea. I will refresh it later today.
> We could even set it to 0 for linux if we don't find a solution for the
> platform problem.

I suggest to add instead an API that would allow to override NewChunk/DestroyChunk calls and implement the pooling outside the engine. This way the browser can implement the chunk release on memory pressure without the need to provide yet another hook into the engine.

Comment 88

7 years ago
> I suggest to add instead an API that would allow to override
> NewChunk/DestroyChunk calls and implement the pooling outside the engine.

See comment #85.

> This way the browser can implement the chunk release on memory pressure
> without the need to provide yet another hook into the engine.

I think the patch is still valid. Even with a custom allocator.
An allocator could introduce more locking contention.  And not very many embedders are going to use one. Its sort of "advanced" I think.
It would be nice to improve speed without the need for a custom allocator.

Comment 89

7 years ago
(In reply to comment #88)
> An allocator could introduce more locking contention.

The lock contention is not an issue for GC chunks.

>  And not very many
> embedders are going to use one. Its sort of "advanced" I think.
> It would be nice to improve speed without the need for a custom allocator.

Since on Linux the benefits are not that clear with the proposed API the engine is going to gain a few more lines of platform-dependent code while exposing a chunk management policy that may not be that useful outside a browser. Plus it may not be useful even with the browser on Windows and Linux if we could implement a shared pool of chunks for the GC and jemalloc. Also consider that in the browser we have all the necessary infrastructure to do the time-based and memory-pressure based chunk allocations that are hard to do in the engine.

These are the grounds for my belief that a chunk allocation callback would be a better idea. I do not think that a simple to use API together with working examples would make things very complex for embedders. For example, the documentation or a test source could provide an example how to write a trivial custom "allocator" that just VirtualAlloc all the necessary memory in one call and releases it only on the shutdown. For server-style embeddings that would be the fastest way to deal with chunks.

Updated

7 years ago
Depends on: 556589
(Assignee)

Comment 90

7 years ago
Well I want to come back to the starting point of this bug where we found out that for certain webpages or benchmark sweeping really kills us. 

Lets take the numbers from bug 539532 and 556133 [in cycles 1E6] for OS X:

                    mark       sweep       total
  Firefox Tip 1:      25         110         135
  Firefox Tip 2:      25         210         235
  WebKit:              7           0           7

Firefox Tip 1 is cross-compiled with all the flags that were needed a few months ago to build the browser on 10.6.
Firefox Tip 2 is a simple optimized browser build without cross compilation.

I don't have the time right now to measure every detail again why we are so slow but I can only see a reduction of the GC pause time by avoiding work and not adding further complexity.

Comment 91

7 years ago
(In reply to comment #90)
> Well I want to come back to the starting point of this bug where we found out
> that for certain webpages or benchmark sweeping really kills us. 
> 
> Lets take the numbers from bug 539532 and 556133 [in cycles 1E6] for OS X:
> 
>                     mark       sweep       total
>   Firefox Tip 1:      25         110         135
>   Firefox Tip 2:      25         210         235
>   WebKit:              7           0           7
> 
> Firefox Tip 1 is cross-compiled with all the flags that were needed a few
> months ago to build the browser on 10.6.
> Firefox Tip 2 is a simple optimized browser build without cross compilation.

Could you clarify what is the difference between tip1 and tip2? Is it different OS or compiler flags? 

> 
> I don't have the time right now to measure every detail again why we are so
> slow but I can only see a reduction of the GC pause time by avoiding work and
> not adding further complexity.

I agree with that. But if the pause time is platform-dependent, then putting more platform-dependent code would just increase the complexity. That is why I would prefer first try platform neutral optimization like using bigger chunk size, better layout of data structures etc only then consider platform-specific tuning.
(Assignee)

Comment 92

7 years ago
> Could you clarify what is the difference between tip1 and tip2? Is it different
> OS or compiler flags? 

Tip1 and 2 were compiled with OS X 10.6, 2x2.66GHz Dual-Core Intel Xeon MacPro.
Tip 1 was compiled with following compiler flags in the .mozconfig:

CC="gcc-4.2 -arch i386"
CXX="g++-4.2 -arch i386"
ac_add_options --target=i386-apple-darwin8.0.0
ac_add_options --with-macos-sdk=/Developer/SDKs/MacOSX10.5.sdk
ac_add_options --enable-macos-target=10.5
ac_add_options --disable-crashreporter

HOST_CC="gcc-4.2"
HOST_CXX="g++-4.2"
RANLIB=ranlib
AR=ar
AS=$CC
LD=ld
STRIP="strip -x -S"
CROSS_COMPILE=1

For Tip 2 I removed everything and only have a disable-debug, enable-optimize and enable-gctimer.
(Assignee)

Comment 93

7 years ago
> 
> I agree with that. But if the pause time is platform-dependent, then putting
> more platform-dependent code would just increase the complexity. That is why I
> would prefer first try platform neutral optimization like using bigger chunk
> size, better layout of data structures etc only then consider platform-specific
> tuning.

I don't think we need platform specific code here. We are talking about 30% GC pause time reduction in the browser versus 1% slowdown on Sunspider for certain Linux distributions in the shell. I can't even see a slowdown for SS on my Ubuntu linux machine. 

With patch:  Total:                  715.9ms +/- 0.3%
Tip:         Total:                  715.8ms +/- 0.4%

I am not a big fan of optimizations where we give back an reallocate memory right away rather than reusing memory. I would love to see numbers from mobile devices because that sounds like a major bottleneck.

Comment 94

7 years ago
(In reply to comment #93)
> I don't think we need platform specific code here.

If we would implement a common pool for jemalloc and the GC, there would be inevitably platform dependent code in the engine even just to disable internal chunk pooling on Linux and Windows. Given this and that shared pool for GC and jemalloc would need to be implemented outside the engine, I would prefer to put GC-only pooling in the same place.

Comment 95

7 years ago
(In reply to comment #93)
> 
> I don't think we need platform specific code here. We are talking about 30% GC
> pause time reduction in the browser versus 1% slowdown on Sunspider for certain
> Linux distributions in the shell. I can't even see a slowdown for SS on my
> Ubuntu linux machine. 

This bug seems stuck, but I can't see why. Gregor has provided a lot of data. What are the concrete steps we need to take to complete this work?

Updated

7 years ago
Blocks: 557538

Comment 96

7 years ago
(In reply to comment #95)
> This bug seems stuck, but I can't see why. Gregor has provided a lot of data.
> What are the concrete steps we need to take to complete this work?

This is my bad, but I have not realized until late in the game that using pooling of GC chunks should be done outside the engine as different platforms requires different strategies. 

For example, when using jemalloc we should use a common pool of GC chunks and jemalloc chunks for optimum performance. Another observation is that with own chunk pooling we should provide a way to release the allocated chunks on the memory pressure, like real OOM especially in view of infallible malloc. Custom chunk allocation would provide just that.

So as a way forward I would like first to implement the bug 557538 and move the suggested pooling outside the engine while properly integrating it with infallible malloc.

Comment 97

7 years ago
(In reply to comment #96)
> 
> So as a way forward I would like first to implement the bug 557538

How complex is that bug? Do you have an estimate of when you'll get to it?

Comment 98

7 years ago
(In reply to comment #97)
> (In reply to comment #96)
> > 
> > So as a way forward I would like first to implement the bug 557538
> 
> How complex is that bug? Do you have an estimate of when you'll get to it?

I have a patch but that is build on top of the bug 553812. I guess I can rebase it and ask for an review tomorrow.
(Assignee)

Comment 99

7 years ago
(In reply to comment #98)
> (In reply to comment #97)
> > (In reply to comment #96)
> > > 
> > > So as a way forward I would like first to implement the bug 557538
> > 
> > How complex is that bug? Do you have an estimate of when you'll get to it?
> 
> I have a patch but that is build on top of the bug 553812. I guess I can rebase
> it and ask for an review tomorrow.

How are these two bugs related? How does the deallocation and reallocation of chunks for the Mac browser get optimized? Is the idea to reimplement this idea with delayed chunk returns in an outside defined callback?

Comment 100

7 years ago
(In reply to comment #99)
> How are these two bugs related? How does the deallocation and reallocation of
> chunks for the Mac browser get optimized? Is the idea to reimplement this idea
> with delayed chunk returns in an outside defined callback?

The current patch needs integration with the allocation code to release pooled GC chunks on memory pressure especially given that the browser has switched to use malloc that never returns null. One or another way this has to be implemented and implementing the chunk management together with the code that implements infallible malloc makes more sense.

So yes, I would like to move the the pooling outside the engine via a callback.
Does it get harder to do that if this patch has landed?  It seems like it will change our GC timing quite a bit, in pleasant ways, and that will let us start to get data on the next hurdles to tackle.  If landing this patch doesn't represent a regression for the browser, and it doesn't make moving to a callback model massively more difficult, I would be very much in favour of landing it, getting the baseline win, and gathering more data while the refactoring and jemalloc chunk-sharing patches wend their way through the system.

We have a patch here with a lot of good data, and it seems like we've stalled on it mostly because we think there's something even *better* in the future.  I think we should try to avoid that, especially where we're talking about a major win on a major pain point!
(Assignee)

Comment 102

7 years ago
Created attachment 437774 [details] [diff] [review]
patch

bugfix.
Attachment #423842 - Attachment is obsolete: true
Attachment #424382 - Attachment is obsolete: true
Attachment #437774 - Flags: review?(igor)
Attachment #423842 - Flags: review?(igor)

Comment 103

7 years ago
(In reply to comment #101)
> Does it get harder to do that if this patch has landed?  It seems like it will
> change our GC timing quite a bit, in pleasant ways, 

It helps definitely MAC OSX, but on other platforms this is less clear. Theoretically with fast enough mmap/munmap the patch can even harm things since the system cannot use the physical memory taken by chunks (which will be cached after the GC) and use it for malloc chunks. But OK, lets proceed with this and do alternative solutions later.

Comment 104

7 years ago
(In reply to comment #102)
> Created an attachment (id=437774) [details]
> patch

We need to add two things here. A public API like JS_ClearCaches that would for a start clear pooled chunks and a preprocessor define to switch on the pooling that for now will be always true. The later will help to move the pooling outside the engine later. I will promptly review the patch with this changes.
Why not put the ifdef in later when we have the patch for the pooling, and the public API in when we have a consumer ready for it?  Again, I don't think we need those things for this to avoid being a regression, so why hold up the patch on it?

When we have something that is a win, and does not have regressions, we should get it *in*, not hold it up because we might want to do even more later.  It is no harder to add the #ifdef afterwards, nor to add the public API afterwards, and working with smaller patches reduces risk.  We really need to get out of this world where possible-future-addition-perfection keeps us from taking things that are, on their own, improvements.

Comment 106

7 years ago
I agree with shaver.

1.) We shouldn't add to the public API without a caller
2.) the #ifdef should be on the patch that uses it

Comment 107

7 years ago
Comment on attachment 437774 [details] [diff] [review]
patch

>+AddChunkToEmptyList(JSRuntime *rt, JSGCChunkInfo *ci)
>+{
>+    ci->next = rt->gcEmptyChunkList;
>+    rt->gcEmptyChunkList = ci;
>+}

In bug 553812 I have seen rather bad effects of using a list to store empty arenas as working over the list brings all cachelines and TLB entries pages containing the list link fields stored in the arenas. So to avoid that for GC chunks and for simpler code use js::vector to store empty chunks. 

This way working over the empty chunk list would touch only the vector of pointers itself. Any OOM failure when adding chunks to the vector should be treated as a signal to release the chunk immediately.
Attachment #437774 - Flags: review?(igor)

Comment 108

7 years ago
Igor, I would like to see profile data for #107. Especially after your change to use 1MB chunks, we have very few chunks (8-16 in a browser session, maybe). I have a hard time believing that can measurably impact TLB performance.

Lets optimize real performance problems (like the one the patch does), not imaginary ones. Even if this was a measurable problem (I am confident its not), lets take a follow-up patch. I think we have given Gregor enough run-around.
The data for comment 107 came from bug 550373 comment 18, a synthetic benchmark. Indeed we need to study real Firefox GC chunk populations including with hundreds of tabs open (8-16 in a browser session? not with that many tabs, I suspect -- but let's measure).

A bug over 100 comments with a patch that wins good performance, with synthetic benchmarks wanting a data structure change, suggests doing the data structure change in a followup bug. And doing the real Firefox population measurements.

Usually you can do such measurements by hand. For this and many other studies, we also have TestPilot. We should consider using it.

/be
(Assignee)

Comment 110

7 years ago
Created attachment 438668 [details] [diff] [review]
patch

new version due to changes in GC code.
Added vector for empty chunks.
Attachment #437774 - Attachment is obsolete: true

Comment 111

7 years ago
Comment on attachment 438668 [details] [diff] [review]
patch

>+    
>+    js::Vector<JSGCChunkInfo*, 0, js::SystemAllocPolicy> gcEmptyChunkVector;

Nit: I suggest gcEmptyChunks. We don't use Hungarian notation around here :)

>+static const jsuword GC_EMPTY_CHUNK_SURVIVES = 3;
>+

Nit: I don't much like "survives". GC_EMPTY_CHUNK_RESERVE = 3 maybe? Its also overloaded. RESERVOIR maybe? Pick something.

>+    size_t          gcSurvived;

Nit: Same here. Name sucks. I suggest something more expressive.

>     JSGCChunkInfo *ci = rt->gcChunkList;
>+
>+    if (!ci && !rt->gcEmptyChunkVector.empty()) {
>+        ci = rt->gcEmptyChunkVector.back();
>+        rt->gcEmptyChunkVector.popBack();
>+        JS_ASSERT(ci);
>+        ci->gcSurvived = 0;
>+        ci->init(rt);
>+    }
>+
>     jsuword chunk;
>     if (!ci) {

Merge these conditions somehow. This seems redundant if you hit the fast path above.

>         if (ci->numFreeArenas == GC_ARENAS_PER_CHUNK) {
>             ci->removeFromList(rt);
>-            ci->next = emptyChunkList;
>-            emptyChunkList = ci;
>+            rt->gcEmptyChunkVector.append(ci);

Not setting gcSurvived = 0 here?

Comment 112

7 years ago
RESERVOIR comment above is nonsense of course. So looks like thats an age counter? Or generations?
(Assignee)

Comment 113

7 years ago
Created attachment 438673 [details] [diff] [review]
patch

Age is better. Now with append failure handling.
Attachment #438668 - Attachment is obsolete: true

Comment 114

7 years ago
(In reply to comment #113)
> Igor, I would like to see profile data for #107. Especially after your change
> to use 1MB chunks, we have very few chunks (8-16 in a browser session, maybe).

I was referring to pre-big chunk state of things.

> I have a hard time believing that can measurably impact TLB performance.

There is code simplicity argument also. Vector is just more suitable data structure for doing appending/bulk removal than the list. I should have seen that long time ago. See the bug 559141 which eliminates the doubly-linked list of chunks based on your suggestion. As another proof consider that doing the delayed chunk release on top of that patch would mean adding few lines of code to chunk scanning loop without the need to add any helper methods.

Comment 115

7 years ago
Oh I completely agree Vector is much better and Gregor is adding that. I am just not buying the TLB argument, and that it could be done as a separate patch (will be one patch now though).

Comment 116

7 years ago
(In reply to comment #115)
> I am
> just not buying the TLB argument

With big chunks those arguments are moot indeed. But with smaller ones it is not that straightforward. For example, for smaller chunks the cachegring has shown that during double finalization walking over the list of empty arenas to see if the GC can release the chunks that contains those arenas was the second biggest source of L1/L2 cache misses during the GC (the biggest source of the misses came from arena->hasMakedDoubles checks). Clearly those misses were also TLB msses since the GC arena takes the whole CPU page.
(Assignee)

Comment 117

7 years ago
Created attachment 438919 [details] [diff] [review]
patch
Attachment #438673 - Attachment is obsolete: true

Comment 118

7 years ago
Comment on attachment 438919 [details] [diff] [review]
patch

>+        if (!rt->gcEmptyChunks.empty()) {
>+            ci = rt->gcEmptyChunks.back();
>+            rt->gcEmptyChunks.popBack();
>+            JS_ASSERT(ci);
>+            chunk = ci->getChunk();

Get a reference straight to gcEmptyChunks if you talk to it repeatedly to avoid the rt-> dereference every time. Its also easier to read.

>+            ci->gcChunkAge = 0;
>+            if (!rt->gcEmptyChunks.append(ci)) {
>+                jsuword chunk = ci->getChunk();
>+                PutGCChunk(rt, (void *)chunk);
>+            }

PutGCChunk is the worst name in the world. Please make that something else. Free or release. Anything.

>+void DestroyEmptyGCChunks(JSRuntime *rt, bool releaseAll) 
>+{
>+    size_t newLength = 0;
>+    JSGCChunkInfo **array = rt->gcEmptyChunks.begin();
>+    size_t length = rt->gcEmptyChunks.length();
>+    
>+    for(size_t i = 0; i < length; ++i){
>+        JSGCChunkInfo *ci = array[i];

The proper pattern is probably

for (JSGCChunkInfo *ci = ->begin(); ci != ->end(); ++ci)

   ^ also space here

>+             ci->gcChunkAge++;
>+             array[newLength++] = ci;

You can keep this one I guess.

r=me with nits picked at your leisure.
Attachment #438919 - Flags: review+
(Assignee)

Comment 119

7 years ago
Created attachment 438929 [details] [diff] [review]
patch
Attachment #438919 - Attachment is obsolete: true
(Assignee)

Comment 120

7 years ago
Created attachment 438930 [details] [diff] [review]
patch

white space fix
Attachment #438929 - Attachment is obsolete: true
(Assignee)

Comment 121

7 years ago
http://hg.mozilla.org/tracemonkey/rev/15955e3513e0 r=gal
(Assignee)

Updated

7 years ago
Whiteboard: fixed-in-tracemonkey

Comment 122

7 years ago
http://hg.mozilla.org/mozilla-central/rev/15955e3513e0
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Depends on: 604676
(Assignee)

Updated

7 years ago
No longer depends on: 604676
You need to log in before you can comment on or make changes to this bug.