551598 - Investigate and (try to) improve GC performance.

Reporter

Description

•

15 years ago

By hacking in profiling code using gettimeofday, I can get a break-down of the performance of calls to js_GC(). The following numbers came from a trivial JavaScript loop allocating (and then forgetting) Objects. They were run on an ARM platform: js_GC() { Time since last invocation: 4642215 us Time until mark phase: 233 us Mark phase time: 506 us js_SweepAtomState: 76 us CloseNativeIterators: 7 us js_SweepWatchPoints: 5 us FinalizeArenaList lead-in: 4 us FinalizeArenaList big loop: 332316 us 16379 iterations were executed. FinalizeArenaList<JSObject, FinalizeObject>: 332362 us FinalizeArenaList lead-in: 5 us FinalizeArenaList big loop: 17 us 3 iterations were executed. FinalizeArenaList<JSFunction, FinalizeFunction>: 29 us FinalizeArenaList<JSXML, FinalizeXML>: 4 us FinalizeArenaList lead-in: 4 us FinalizeArenaList big loop: 11 us 1 iterations were executed. FinalizeArenaList<JSString, FinalizeString>: 333 us Loop: FinalizeArenaList<JSString, FinalizeExternalString>: 7 us Metering code (?): 6 us Misc arena-handling code: 332744 us js_SweepScopeProperties: 32 us js_SweepScriptFilenames: 7 us DestroyGCArenas: 34486 us Sweep phase time: 367388 us Post-GC clean-up: 7 us } The positions of the mark/sweep divisions (etc) are based on comments in the source (and obvious function calls). The output is awkward to read because it's nested and the name of the block as at the _end_ of the nest. It was a quick & dirty instrumentation hack, so I didn't bother too much about making it look pretty. ---- As Andreas mentioned, sweeping is clearly where we are weak. On x86, the sweep function doesn't stand out as much, but it's still by far the largest component in the profile: js_GC() { Time since last invocation: 868579 us Time until mark phase: 44 us Mark phase time: 59 us js_SweepAtomState: 14 us CloseNativeIterators: 2 us js_SweepWatchPoints: 2 us FinalizeArenaList lead-in: 2 us FinalizeArenaList big loop: 35948 us 16379 iterations were executed. FinalizeArenaList<JSObject, FinalizeObject>: 35963 us FinalizeArenaList lead-in: 2 us FinalizeArenaList big loop: 4 us 3 iterations were executed. FinalizeArenaList<JSFunction, FinalizeFunction>: 8 us FinalizeArenaList<JSXML, FinalizeXML>: 2 us FinalizeArenaList lead-in: 2 us FinalizeArenaList big loop: 3 us 1 iterations were executed. FinalizeArenaList<JSString, FinalizeString>: 81 us Loop: FinalizeArenaList<JSString, FinalizeExternalString>: 3 us Metering code (?): 2 us Misc arena-handling code: 36061 us js_SweepScopeProperties: 6 us js_SweepScriptFilenames: 3 us DestroyGCArenas: 11895 us Sweep phase time: 47990 us Post-GC clean-up: 3 us } ---- This gives us a clear candidate for optimizations: FinalizeArenaList. DestroyGCArenas is probably worth looking at too.

Jacob Bramley [:jbramley]

Reporter

Comment 1

•

15 years ago

Attached patch Instrumentation — Details — Splinter Review

In the interest of reproducibility, here is the instrumentation I used (with a couple of extra things that I added after gathering the figures).

Boris Zbarsky [:bzbarsky]

Comment 2

•

15 years ago

Jacob, do you mind attaching the JS testcase you used as well? I wonder how much of this time is calling FinalizeObject....

Jacob Bramley [:jbramley]

Reporter

Comment 3

•

15 years ago

The test-case is trivial: ---- function f() {} for (var i = 0; i < 10000000; i++) { var a = new String("I am a string."); var b = "I am also a string."; f(a,b); } ---- It isn't clever, and certainly isn't representative, but the sweep phase is still too slow (at least on ARM).

Jacob Bramley [:jbramley]

Reporter

Comment 4

•

15 years ago

(In reply to comment #2) > I wonder how much of this time is calling FinalizeObject.... Actually, the instrumentation patch I posted reports this, though that's something I added after gathering the numbers. I think finalize() is a fairly hefty component.

Boris Zbarsky [:bzbarsky]

Comment 5

•

15 years ago

Hm. So if shark's not lying to me, then a pretty large fraction of the time under FinalizeObject seems to be the !obj->map test (which also accounts for something like 10% of total L2 misses during GC as well). The clasp->finalize test is the next hotspot there (1.5% of total L2 misses). At least on Mac FinalizeObject ends up inlined into js_GC, so the timer percentages are of total self js_GC time, but those are 40% and 17% respectively. 5% of the js_GC time is the !nextFree check after setting nextFree to a->info.freeList in FinalizeArenaList (41% of the L2 misses are here(?)). I wonder whether it's worth running under something like cachegrind to get a better idea of what's going on here in L2 terms, and whether we can issue some sort of prefetch instructions that would help....

Boris Zbarsky [:bzbarsky]

Comment 6

•

15 years ago

Oh, and the numbers in comment 5 are for an x86 with a 6MB L2; I can't imagine that the L2 miss situation is any better on ARM.

Igor Bukanov

Comment 7

•

15 years ago

(In reply to comment #5) > Hm. So if shark's not lying to me, then a pretty large fraction of the time > under FinalizeObject seems to be the !obj->map test (which also accounts for > something like 10% of total L2 misses during GC as well). That we can fix if failed or dead objects would use a special non-null map. > The clasp->finalize > test is the next hotspot there (1.5% of total L2 misses). That can be also dealt with if would allocate objects with and without finalize hook from separated GC arenas. But that increases inevitably the heap fragmentation and the size of working sets.

Boris Zbarsky [:bzbarsky]

Comment 8

•

15 years ago

> That we can fix if failed or dead objects would use a special non-null map. Maybe. If the time is actually spent in an L2 miss fetching the bits of *obj, then it doesn't matter what the actual value of map is, I'd think. Unless we think that an unconditional access would get reordered so as to start the fetch much sooner?

Igor Bukanov

Comment 9

•

15 years ago

(In reply to comment #0) > By hacking in profiling code using gettimeofday, I can get a break-down of the > performance of calls to js_GC(). The following numbers came from a trivial > JavaScript loop allocating (and then forgetting) Objects. This test is not what happen on a typical web site, see the reference in bug 548388. That paper shows that objects typically are long-lived. That, of cause, does not mean that we should not optimize their finalization, but it is important that any optimization in that area does not slow down, for example, the marking phase of the GC.

Benoit Jacob [:bjacob] (mostly away)

Comment 10

•

15 years ago

Hi, just letting you know that I have filed a separate bug about a different instrumentation patch: bug 601949. It does something different, and covers different use cases, than the patch discussed here. It's mostly for non-JS people wanting to know whether GC pauses are responsible for a performance/smoothness issue they are having.

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Assignee: general → nobody

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

Matthew Gaudet (he/him) [:mgaudet]

Updated

•

2 years ago

Status: NEW → RESOLVED

Closed: 2 years ago

Resolution: --- → INCOMPLETE

Bugzilla

Investigate and (try to) improve GC performance.

Categories

(Core :: JavaScript Engine, enhancement)

Tracking

()

People

(Reporter: jbramley, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Updated

Updated

Attachment

General

Description

File Name

Content Type