Closed Bug 548388 Opened 11 years ago Closed 11 years ago

GC Benchmark Suite


(Core :: JavaScript Engine, defect)

Not set





(Reporter: sayrer, Assigned: gwagner)


(Blocks 1 open bug)


(Whiteboard: fixed-in-tracemonkey)


(7 files, 4 obsolete files)

We have a bunch of scattered GC benchmarks. Let's consolidate them in one suite.
Assignee: general → gwagner
Assignee: gwagner → anygregor
For motivation a short excerpt from:
JSMeter: Characterizing Real-World Behavior of JavaScript Programs

can be found:

Real applications allocate a significant amount of memory, ranging from one to almost twenty megabytes of data, in the relatively short interactions we had with them. As with bytecode execution behavior, google is the most lean of the applications, while bingmap and amazon allocate the most data. Of the real applications that have the most application-like characteristics, bingmap, facebook, gmail, and googlemap, we see that allocating megabytes of data in a short period of time is common.

Benchmarks: The overall allocation of the benchmark programs is highly variable, with many benchmarks hardly allocating any data at all (e.g., richards deltablue, controlflow, math − cordic, etc.) and others allocating ten or more megabytes (e.g., earley, splay, and regexp). Only six of the benchmarks allocate more data than google, the real application that allocates the least data. The SunSpider benchmarks, in particular, have total allocation behavior that is highly unrepresentative of the real applications, and as a result, performance comparisons based on them will be highly skewed to the performance of code execution without regard to the efficiency of the object representation or memory management.

Lessons: One conclusion that can be reached across both the real applications and the benchmarks is that the only object types of significance are script functions, strings, arrays, and objects. The other types rarely, if ever contribute substantively to the overall memory allocation of the applications.
Another conclusion we can reach from the real web applications is that many make substantial use of all four major data types, with the mix of types varying between the applications.
Some other points to consider from the paper:

Live Heap Content:
The real applications allocate a diverse collection of strings, functions, objects and arrays with strings being the most short-lived and functions being the most long-lived.
Some real applications have short-lived heaps that are destroyed when one page is unloaded and regenerated when a new page is loaded.
Live heap contents in the benchmarks do not reflect real applications.

Object Allocation Discussion:
First, we observed that the mix of types allocated by the real applications is much different than most of the benchmarks, containing a large quantity of script functions and strings. Objects are less frequently allocated in the real applications and the lifetime of objects is considerably longer then that of strings in many cases.
• Second, our analysis of the contents of the live heaps suggests that current web applications fall into two categories: those with page transitions that clear the JavaScript heap, and those that do not. In applications that do not have many page transitions, such as gmail, we observe that arrays and objects are relatively long-lived compared to strings. Of applications with many page transitions, such as amazon, by definition almost all objects are short-lived. Such sites do not require sophisticated memory management and would benefit most from a very fast and simple allocator. Being able to predict what class a site falls into and using an appropriate allocator might have performance benefits.
• Finally, in considering object lifetimes, we see that strings are by far the shortest lived types in JavaScript and that functions are commonly long-lived. Except for earley and splay, object lifetimes in the V8 and SunSpider benchmarks are extremely short- lived, suggesting that performance results of these benchmarks will not reliably reflect the effectiveness of the JavaScript engine’s memory management implementation. Even in earley, object lifetimes are significantly shorter than is observed in many of the real web applications, while in splay objects are almost never freed.
Back to this bug...

I want to start with some basic benchmarks that measure a single GC functionality.
- mark performance
- sweep performance
- finalize objects (with and without dslots)
- allocation performance

Later I want to add real workload benchmarks and add generational stuff.
Including benchmarks that simulate page transition and recreation.
Igor posted this benchmark in another bug and I used it for measuring the marking performance.
Measures the time to create, traverse and free a big object graph.
Attached file DSlot Benchmark
This benchmarks measures the overhead of alloc and dealloc Objects with and without dslots. 1E5 Objects are created and 0-5 properties are set.
Attached image Dslot GC Graph
The GC Graph shows the overhead of dslots deallocation and that the GC pause is mainly caused by the object finalization. 
The finalization of Objects with only 1-3 slots are set is still very expensive.
I just saw that the clock benchmark is no longer online. Does anybody know who wrote this benchmark? The URL was:
Attached file Clock GC benchmark
(In reply to comment #7)
> I just saw that the clock benchmark is no longer online. Does anybody know who
> wrote this benchmark? The URL was:

Heh, turns out I wrote it. I put it in the attachment.
Attachment #434576 - Attachment mime type: text/plain → text/html
Attached image Clock GC Graph Tip
The internals of the GC for the Clock benchmark. The GC pause is dominated by Object finalization and Chunk destruction.
I tried to get an idea of the scalability of the current GC approach.
I opened ca. 30 tabs with popular websites including benchmark pages that include high throughput.
What I see in the graph is that long living objects and high throughput leads to a GC pause time explosion. 
A generational GC looks like a good solution for this problem.

Furthermore object finalization is also a very significant factor.
I know it's not that easy but others solved it with lazy finalization.

Looks like an optimal GC for the web needs these two features.
Attached patch first draft (obsolete) — Splinter Review
A first try how it might look like...
I am still not sure how the output should look like.
what are meaningful things to measure? Right now I calculate the max and mean of the total, marking and sweep time.

A sample output looks like:

Total max: '69.275696'
Total mean: '64.365382'
Mark max: '0.832184'
Mark mean: '0.711289'
Sweep max: '68.342440'
Sweep mean: '63.494899'

I still have to rewrite the benchmark files and yeah it's my first python code...
Attached patch update (obsolete) — Splinter Review
now with JSON output.
Attachment #442507 - Attachment is obsolete: true
A sample output looks like:
   clock.js: {"Total max": 69.5, "Total mean": 65.1, "Mark max":  1.0, "Mark mean":  0.7, "Sweep max": 68.4, "Sweep mean": 64.1}
  dslots.js: {"Total max": 61.9, "Total mean": 32.1, "Mark max":  0.1, "Mark mean":  0.1, "Sweep max": 61.6, "Sweep mean": 31.8}
   empty.js: {"Total max":  0.5, "Total mean":  0.2, "Mark max":  0.1, "Mark mean":  0.0, "Sweep max":  0.1, "Sweep mean":  0.1}
objGraph.js: {"Total max": 60.0, "Total mean": 32.9, "Mark max": 46.8, "Mark mean": 14.9, "Sweep max": 59.7, "Sweep mean": 17.8}
Attached patch update (obsolete) — Splinter Review
now with comparison mode...

                      loops.js: faster:  48.03 < baseline  48.30 ( -0.56%)
                   objGraph.js: faster:  32.01 < baseline  32.10 ( -0.29%)
                      clock.js: faster:  64.84 < baseline  65.00 ( -0.25%)
                     dslots.js: SLOWER:  32.01 > baseline  32.00 ( +0.04%)
Attachment #442785 - Attachment is obsolete: true
Attached patch update (obsolete) — Splinter Review
The basic framework. More benchmarks to come.
Attachment #442864 - Attachment is obsolete: true
Attachment #443271 - Flags: review?(jorendorff)
jorendorff: any comments? Is that what you expected or should it look completely different? 
I might have to focus more on micro-benchmarks that measure just one thing like marking time or sweep time because I calculate based on the average GC pause time. 
I am also working to get real web examples. I try to get a tool that can reproduce the JS stuff offline.
Also the average gc pause time might not be the best thing to measure but the longest pause time is to noisy and the shortest pause time is mostly just an "empty" gc.
Attachment #443271 - Flags: review?(jorendorff) → review+
dmandelin wrote in an email:
Since these are measurements, not tests, I would suggest something like
js/src/metrics/gc. Hopefully we can grow more types of measurements. We
could also consider moving the 't' and 'v8' directories there.

I will move and push it today.
Attached patch updateSplinter Review
moved suite into metrics/gc.
Attachment #443271 - Attachment is obsolete: true
Whiteboard: fixed-in-tracemonkey
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.