Closed Bug 679887 Opened 13 years ago Closed 13 years ago

TI: GC pause regression for many tabs

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: gwagner, Unassigned)

Details

Attachments

(6 files, 1 obsolete file)

One thing I noticed when comparing MC and TI is an average GC pause time regression when all 150 tabs are loaded. I was running my script 4 times with MC and TI and this behavior was always there.

With MC I see that a global GC takes around 700ms - 740ms one example: 717.0ms total (63ms waiting,  533ms marking,   100ms sweeping)

With TI this is between 860 and 900ms. One example: 871.1 total (45.5 waiting,  638.1 marking,  172.1 sweeping)

Brian did you look at your GCTimer output?

I also compared the GCTimer output for the V8 benchmarks. They look much better now. The background finalization regression is gone now!
I still see a marking time regression (~20%) for the splay benchmark. However the overall score for splay increased.
The mark time regression is weird, this may in part be fallout from me trying to use a single TypeObject::trace instead of splitting out into MarkTypeObject and ScanTypeObject versions.

The sweep regression is harder, and is I think mostly due to the need to scan type object properties and observed types in scripts for the objects they contain, which are all weak references.  The best thing for this I think is to wait for incremental GC, which I hope --- I need to understand incremental GC better --- can be used to reduce type sweeping in the same way as normal sweeping.
(In reply to Brian Hackett from comment #1)
> The mark time regression is weird, this may in part be fallout from me
> trying to use a single TypeObject::trace instead of splitting out into
> MarkTypeObject and ScanTypeObject versions.
> 
> The sweep regression is harder, and is I think mostly due to the need to
> scan type object properties and observed types in scripts for the objects
> they contain, which are all weak references.  The best thing for this I
> think is to wait for incremental GC, which I hope --- I need to understand
> incremental GC better --- can be used to reduce type sweeping in the same
> way as normal sweeping.

Making such big changes will always result in some kind of tradeoff between big wins and small regressions. I guess Bill should comment on the incremental GC question. It would be good to know what the critical parameters are for incremental GC. Is it marking, sweeping or maybe even the thread-synchronization time that is also remarkable here with 40ms?

BTW: The overall V8 improvements are great but I also realized that some v8 benchmarks in the browser get a lower score (raytrace and regexp?). I don't see a GC time regression there. Do you have an explanation for that?
(In reply to Gregor Wagner from comment #2)
> BTW: The overall V8 improvements are great but I also realized that some v8
> benchmarks in the browser get a lower score (raytrace and regexp?). I don't
> see a GC time regression there. Do you have an explanation for that?

There are patches in other bugs that should help with raytrace (bug 638794) and regexp (bug 663138), but haven't landed yet.  The general problem for these two is that for operations which aren't completely JIT'ed or are JIT'ed with imprecise information, we incur some overhead to make sure we preserve correct type information for the heap, and also aren't able to leverage the type info to generate faster code.  By improving precision and/or code generation we can solve both these problems.
Incremental GC won't help sweeping times at all; right now only marking is done incrementally. That could change, of course. Many GCs do incremental sweeping. But since we already have background finalization, I don't see an obvious need. The only reason to do incremental sweeping would be if there's some finalizer that takes a long time and must be done on the main thread.
Brian any comments on this bug? Do you want to merge with this GC regression?
I just created a script that opens 50 tabs and remeasured with a new jaeger build.
http://gregor-wagner.com/tmp/mem50

For 50 tabs of the most popular websites I still get a 25% pause time regression.
It seems to me OK to merge with this regression.  Most of the regression is during marking, which will be going away soon with incremental GC.  I'm hoping to investigate the mark regression this week, but I'm not sure if I'll be able to.
(In reply to Brian Hackett from comment #6)
> It seems to me OK to merge with this regression.  Most of the regression is
> during marking, which will be going away soon with incremental GC.  I'm
> hoping to investigate the mark regression this week, but I'm not sure if
> I'll be able to.

Is incremental GC also in the same release version?

I am a little bit worried because this is a regression for the main GC pause time and not only on the background thread. Small devices like netbooks usually suffer much more if we regress sweeping. We should check how it behaves there.
It is not ok to merge with a major GC regression. We need a bandaid, or we have to wait until incremental GC is ready.
This may be news to some, but we have zero regression tolerance as a rule. Exceptions may be when the compensating fix is coming within 24 hours, for sure -- or else the regressing change will be backed out.

Incremental GC doesn't fit that exception model.

/be
Keep in mind we will have the TI-free branch. If it's a regression that most users won't see (so far, it has been described only with >50 tabs open), it seems OK to land, fix on trunk, and revert if needed.

But we at least need more information before landing. Does this bite in more normal scenarios, as well? What do we see on FoTN and things like that? How fixable is it? And we need to know more about the sweep regression--incremental GC won't help there.
(In reply to David Mandelin from comment #10)
> Keep in mind we will have the TI-free branch. If it's a regression that most
> users won't see (so far, it has been described only with >50 tabs open), it
> seems OK to land, fix on trunk, and revert if needed.
> 
> But we at least need more information before landing. Does this bite in more
> normal scenarios, as well? What do we see on FoTN and things like that? How
> fixable is it? And we need to know more about the sweep
> regression--incremental GC won't help there.

I don't think the regression is related to the number of tabs. The problem is that we don't really have a reliable way to measure real web workload. I looked at the beginning of the GCTimer output for opening 50 tabs and even the first GCs show a regression. This means even with 3-4 tabs I see the slowdown.

MC:
     AppTime, Total
           0,   13.1,
        8997,   12.6,
       13958,   20.6,
       17986,   42.1,
       22061,   79.1,
       26144,   91.9,

TI:
     AppTime, Total
           0,   15.9,   
        8239,   26.3,
       12361,   54.7,
       16640,  102.5,
       20865,  130.8,
       24999,  117.6,
Attached patch patch (obsolete) — Splinter Review
Patch reducing mark overhead.
The above patch does a few things to reduce the mark regression: it adds a mark stack for type objects, doesn't collect lazy type objects on GC (this costs memory, and will need some revision so that we only occasionally collect lazy type objects), and tweaks ScanObject and JSObject::scanSlots to help GCC generate better code.  It needs a bit more work to be correct, but no more logic should be needed on GC hot paths.

With this applied, I tried out a 20 tab workload using Gregor's script and also the raytrace demo at http://29a.ch/2010/6/2/realtime-raytracing-in-javascript that allocates tons of short lived objects and is very pause heavy.  I went with the smaller tab workload, and haven't tried FOTN yet, because I am on a slow connection.  I imagine numbers will scale linearly with the number of tabs.  Will maybe check out FOTN tomorrow.

--- 20 tab workload

Methodology: wait until all 20 tabs had finished loading, opened about:memory?verbose, and hit the GC button 5 times.  The attachment has the times for these last five GCs along with the memory output.

mozilla-central average: 99ms
type-inference average: 111ms

The pause time regression here is now 12%.  Sweeps take 7ms longer on TI.  Also, despite m-c's larger 'explicit' the GC heaps are about the same size (the difference is less mjit code with TI), so these are I think valid to compare directly.

I think that comparing GC pause times by doing global collections is pretty lousy for benchmarking (contrast with memory usage measurements, where many-tab workloads are the best way).  These global GCs should be happening mainly when the browser is idle and the user doesn't notice the pauses (if they aren't, that's a bug in the GC design and not a TI issue).  If the user doesn't notice these pauses, why do we care about them?

To that end I tried the raytrace demo above.

--- raytrace demo

Methodology: view the demo with medium detail for a minute or so, get the times for the last five GCs (see attachment).  I get 20fps on m-c and 28fps on TI.

mozilla-central average: 120.8ms
type-inference average: 124.4ms

The pauses take 3% longer on TI, and sweeps take .4ms longer.  Before writing this patch, the TI pauses were about 20% longer, so this patch does in fact do something.

---

Sweep time hasn't changed, though I get a somewhat smaller sweep regression than in comment 0 (extrapolating to 150 tabs it would be about 50ms).  The extra time taken during sweeping is for removing weak references from type sets in type objects and scripts.  There isn't much I think can be done directly to speed this up, but it can be avoided entirely if all scripts and type objects are eagerly marked by the GC (this could of course only be done during certain GCs).  This also allows keeping analysis information and mjit code around.

It would be pretty simple to add this (we already have to do this for GCs which are triggered during analysis/compilation), but I don't know yet if it makes sense to do so, in lieu of the above comments on global GCs.  The raytrace user-visible-pause GCs are per-compartment collections, and when the GC only needs to look at one compartment the type-sweeping overhead should only be 0-2 ms.
Methodology: start the browser (with above patch), go to the FOTN site, watch it and quit once the credits start to roll.  Attached are the last 30 GCs for m-c and TI.  Averaging the first five GCs in this set (the last ones are much shorter, as the browser was shutting down), gives me:

mozilla-central: 104.5ms
type inference: 103.6ms

Differences between the two seem well within the noise for both mark and sweep.
Comment on attachment 556056 [details] [diff] [review]
patch v2

Review of attachment 556056 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks. The results look great.

We talked on IRC about cleaning up JSCompartment::markTypes. Other than that, just some small stuff.

::: js/src/jsgcmark.cpp
@@ +212,5 @@
> +    if (type == &types::emptyTypeObject)
> +        return;
> +    Mark(trc, type);
> +
> +    if (IS_GC_MARKING_TRACER(trc)) {

Why does this only happen when it's a GC tracer?

@@ +675,3 @@
>      Class *clasp = obj->getClass();
>      if (clasp->trace) {
> +        if (clasp == &js_ArrayClass) {

It seems like you could still use isDenseArray here. Is there a reason not to?

@@ +750,5 @@
>      /* If obj has no map, it must be a newborn. */
>      if (obj->isNewborn())
>          return;
>  
> +    MarkTypeObject(trc, obj->gctype(), "type");

I don't like the name gctype. It sounds like it's going to return the GC think kind or something. Could you rename it to something like typeFromGC()? Or else make MarkChildren be a friend of JSObject?
Attachment #556056 - Flags: review?(wmccloskey) → review+
Attached patch followupSplinter Review
Followup adding a template method for iterating the arenas and cells of a specific kind in a compartment.  Places where we iterate over such arenas/cells are replaced with uses of this method.  This differs from IterateCompartmentsArenasCells in that it has different locking requirements and using a template op rather than a function pointer can be inlined by the compiler, avoiding an indirect dispatch.

The IS_GC_MARKING_TRACER test avoids double marking these members for the different control flow path which marking vs. callback tracers take, comment added.

The js_ArrayClass thing may be overwrought, the idea is that compiled naively isDenseArray will need an extra dereference since the clasp is already held in a local variable.  Any reasonable compiler should be able to CSE the two heap accesses, but compiler optimizations involving the heap seem like they can be kind of touch and go, and every dereference counts when processing these mark stacks.
Attachment #556104 - Flags: review?(wmccloskey)
Comment on attachment 556104 [details] [diff] [review]
followup

Review of attachment 556104 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks Brian. This is a really nice cleanup.

::: js/src/jsgcinlines.h
@@ +224,5 @@
>  }
>  
> +/*
> + * Invoke ArenaOp and CellOp on every arena and cell in a compartment which
> + * have the specified thing kind.

The wording is a bit contorted here. Maybe "In a given compartment, invoke ArenaOp and CellOp on every arena and cell that have the specified thing kind."?
Attachment #556104 - Flags: review?(wmccloskey) → review+
I don't think we can afford a zero-regression policy. JS performance is now many-dimensional, and it's not practical to insist that no dimension ever regresses, even a little bit. We've done this before: the JM landing regressed some test cases and some benchmark subscores, mostly because it was not possible to make the profiler decide perfectly which JIT to use. But that landing was definitely a step forward.

In this case, with TI, we get 10%+ improvement on V8 and a big improvement on Kraken. That's a nice interim gain while we're waiting for IonMonkey. Also, JM+TI will be the fastest JS engine of all on various interesting microbenchmarks and example real programs. And we will get more data sooner on TI in the real world to use in the future IM+TI. I think that's worth a small-moderate regression on some "unofficial" GC benchmarks. GC *will* improve dramatically as we go forward.

A significant regression in user experience would not be OK.
Has the patch landed?
Yes, this is on trunk.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: