Last Comment Bug 551598 - Investigate and (try to) improve GC performance.
: Investigate and (try to) improve GC performance.
Status: NEW
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: Trunk
: All All
: -- enhancement with 1 vote (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-03-10 16:38 PST by Jacob Bramley [:jbramley]
Modified: 2014-07-24 11:07 PDT (History)
12 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Instrumentation (12.96 KB, patch)
2010-03-10 16:59 PST, Jacob Bramley [:jbramley]
no flags Details | Diff | Review

Description Jacob Bramley [:jbramley] 2010-03-10 16:38:58 PST
By hacking in profiling code using gettimeofday, I can get a break-down of the performance of calls to js_GC(). The following numbers came from a trivial JavaScript loop allocating (and then forgetting) Objects. They were run on an ARM platform:

js_GC() {                                                              
   Time since last invocation: 4642215 us                             
   Time until mark phase: 233 us                                      
   Mark phase time: 506 us                                            
     js_SweepAtomState: 76 us                                         
     CloseNativeIterators: 7 us                                       
     js_SweepWatchPoints: 5 us                                        
         FinalizeArenaList lead-in: 4 us                              
         FinalizeArenaList big loop: 332316 us                        
           16379 iterations were executed.                            
       FinalizeArenaList<JSObject, FinalizeObject>: 332362 us         
         FinalizeArenaList lead-in: 5 us                              
         FinalizeArenaList big loop: 17 us                            
           3 iterations were executed.
       FinalizeArenaList<JSFunction, FinalizeFunction>: 29 us
       FinalizeArenaList<JSXML, FinalizeXML>: 4 us
         FinalizeArenaList lead-in: 4 us
         FinalizeArenaList big loop: 11 us
           1 iterations were executed.
       FinalizeArenaList<JSString, FinalizeString>: 333 us
       Loop: FinalizeArenaList<JSString, FinalizeExternalString>: 7 us
       Metering code (?): 6 us
     Misc arena-handling code: 332744 us
     js_SweepScopeProperties: 32 us
     js_SweepScriptFilenames: 7 us
     DestroyGCArenas: 34486 us
   Sweep phase time: 367388 us
   Post-GC clean-up: 7 us
}

The positions of the mark/sweep divisions (etc) are based on comments in the source (and obvious function calls). The output is awkward to read because it's nested and the name of the block as at the _end_ of the nest. It was a quick & dirty instrumentation hack, so I didn't bother too much about making it look pretty.

----

As Andreas mentioned, sweeping is clearly where we are weak. On x86, the sweep function doesn't stand out as much, but it's still by far the largest component in the profile:

js_GC() {                                                              
    Time since last invocation: 868579 us                              
    Time until mark phase: 44 us                                       
    Mark phase time: 59 us                                             
      js_SweepAtomState: 14 us                                         
      CloseNativeIterators: 2 us                                       
      js_SweepWatchPoints: 2 us                                        
          FinalizeArenaList lead-in: 2 us                              
          FinalizeArenaList big loop: 35948 us                         
            16379 iterations were executed.                            
        FinalizeArenaList<JSObject, FinalizeObject>: 35963 us          
          FinalizeArenaList lead-in: 2 us                              
          FinalizeArenaList big loop: 4 us                             
            3 iterations were executed.
        FinalizeArenaList<JSFunction, FinalizeFunction>: 8 us
        FinalizeArenaList<JSXML, FinalizeXML>: 2 us
          FinalizeArenaList lead-in: 2 us
          FinalizeArenaList big loop: 3 us
            1 iterations were executed.
        FinalizeArenaList<JSString, FinalizeString>: 81 us
        Loop: FinalizeArenaList<JSString, FinalizeExternalString>: 3 us
        Metering code (?): 2 us
      Misc arena-handling code: 36061 us
      js_SweepScopeProperties: 6 us
      js_SweepScriptFilenames: 3 us
      DestroyGCArenas: 11895 us
    Sweep phase time: 47990 us
    Post-GC clean-up: 3 us
}

----

This gives us a clear candidate for optimizations: FinalizeArenaList. DestroyGCArenas is probably worth looking at too.
Comment 1 Jacob Bramley [:jbramley] 2010-03-10 16:59:42 PST
Created attachment 431757 [details] [diff] [review]
Instrumentation

In the interest of reproducibility, here is the instrumentation I used (with a couple of extra things that I added after gathering the figures).
Comment 2 Boris Zbarsky [:bz] 2010-03-10 19:43:37 PST
Jacob, do you mind attaching the JS testcase you used as well?

I wonder how much of this time is calling FinalizeObject....
Comment 3 Jacob Bramley [:jbramley] 2010-03-11 15:37:06 PST
The test-case is trivial:

----

function f() {}

for (var i = 0; i < 10000000; i++) {
    var a = new String("I am a string.");
    var b = "I am also a string.";
    f(a,b);
}

----

It isn't clever, and certainly isn't representative, but the sweep phase is still too slow (at least on ARM).
Comment 4 Jacob Bramley [:jbramley] 2010-03-11 15:39:26 PST
(In reply to comment #2)
> I wonder how much of this time is calling FinalizeObject....

Actually, the instrumentation patch I posted reports this, though that's something I added after gathering the numbers. I think finalize() is a fairly hefty component.
Comment 5 Boris Zbarsky [:bz] 2010-03-11 20:33:19 PST
Hm.  So if shark's not lying to me, then a pretty large fraction of the time under FinalizeObject seems to be the !obj->map test (which also accounts for something like 10% of total L2 misses during GC as well).  The clasp->finalize test is the next hotspot there (1.5% of total L2 misses).  At least on Mac FinalizeObject ends up inlined into js_GC, so the timer percentages are of total self js_GC time, but those are 40% and 17% respectively.

5% of the js_GC time is the !nextFree check after setting nextFree to a->info.freeList in FinalizeArenaList (41% of the L2 misses are here(?)).

I wonder whether it's worth running under something like cachegrind to get a better idea of what's going on here in L2 terms, and whether we can issue some sort of prefetch instructions that would help....
Comment 6 Boris Zbarsky [:bz] 2010-03-11 20:33:56 PST
Oh, and the numbers in comment 5 are for an x86 with a 6MB L2; I can't imagine that the L2 miss situation is any better on ARM.
Comment 7 Igor Bukanov 2010-03-12 08:26:52 PST
(In reply to comment #5)
> Hm.  So if shark's not lying to me, then a pretty large fraction of the time
> under FinalizeObject seems to be the !obj->map test (which also accounts for
> something like 10% of total L2 misses during GC as well).

That we can fix if failed or dead objects would use a special non-null map.


>  The clasp->finalize
> test is the next hotspot there (1.5% of total L2 misses).

That can be also dealt with if would allocate objects with and without finalize hook from separated GC arenas. But that increases inevitably the heap fragmentation and the size of working sets.
Comment 8 Boris Zbarsky [:bz] 2010-03-12 08:41:42 PST
> That we can fix if failed or dead objects would use a special non-null map.

Maybe.  If the time is actually spent in an L2 miss fetching the bits of *obj, then it doesn't matter what the actual value of map is, I'd think.  Unless we think that an unconditional access would get reordered so as to start the fetch much sooner?
Comment 9 Igor Bukanov 2010-03-12 08:53:11 PST
(In reply to comment #0)
> By hacking in profiling code using gettimeofday, I can get a break-down of the
> performance of calls to js_GC(). The following numbers came from a trivial
> JavaScript loop allocating (and then forgetting) Objects.

This test is not what happen on a typical web site, see the reference in bug 548388. That paper shows that objects typically are long-lived. That, of cause, does not mean that we should not optimize their finalization, but it is important that any optimization in that area does not slow down, for example, the marking phase of the GC.
Comment 10 Benoit Jacob [:bjacob] (mostly away) 2010-10-05 08:55:46 PDT
Hi, just letting you know that I have filed a separate bug about a different instrumentation patch: bug 601949.

It does something different, and covers different use cases, than the patch discussed here. It's mostly for non-JS people wanting to know whether GC pauses are responsible for a performance/smoothness issue they are having.

Note You need to log in before you can comment on or make changes to this bug.