Closed Bug 594008 Opened 14 years ago Closed 14 years ago

Vector storage for non-pointer data must not come out of the FixedMalloc heap

Categories

(Tamarin Graveyard :: Library, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED
Q3 11 - Serrano

People

(Reporter: lhansen, Assigned: lhansen)

References

Details

Attachments

(1 file)

Currently the Vector storage for int, uint, and double comes out of the FixedMalloc heap. This royally messes with the GC policy, because there's a large volume of heap that is really managed by the GC that is not seen by the GC. As a consequence the collection frequency is higher than it ought to be. While probably an outlier, the jsbench/FFT.as benchmark has about 30 collections, while jsbench/typed/FFT.as has about 300. (The numbers are from memory, accurate numbers will be posted later.) Of course, the latter has a much smaller heap size, but only because it runs the collector absurdly often - a consequence of the amount of live heap appearing to be quite small, when it is not.
Note that String follows the pattern of Vector in the sense that it is an RCObject that references pointer-free memory. But for String we use GC memory, and we do so explicitly to make the policy code work the right way.
See Also: → 593766
Generally, a program using Vector.<Number> rather than Array containing boxed doubles should use about 2/3 the memory, if that's the dominant data structure. There's some room for noise because of method closures and other baseline memory, and also because the policy is not constant but changes slightly with the heap size. So say the Vector case should use between 2/3 and 1/1 of the Array case (ignoring heap turbulence effects).
Baseline performance data, 5 iterations: test avm avg 95%_conf Metric: time Dir: jsbench/ Crypt 3404 3408 0.1% Euler 6256 6302.4 0.5% FFT 6445 6521.2 1.1% HeapSort 2956 2963.2 0.3% LUFact 6972 7025 0.7% Moldyn 7840 7848.8 0.1% RayTracer 6442 6462.2 0.2% SOR 27964 28134 0.8% Series 7249 7288.6 0.5% SparseMatmult 11173 11197.4 0.2% Dir: jsbench/typed/ Crypt 833 835.2 0.2% Euler 7767 7778.2 0.1% FFT 3634 3644.6 0.3% HeapSort 1407 1408 0.1% LUFact 6925 6939.2 0.2% Moldyn 3551 3583.2 0.9% RayTracer 1042 1051 1.0% SOR 19740 19798.4 0.3% Series 6637 6639.4 0.0% SparseMatmult 2086 2087.8 0.1%
Baseline memory data, 5 iterations: test avm avg 95%_conf Metric: memory Dir: jsbench/ Crypt 43.8M 43.8M 0.0% Euler 4.4M 4.0M 7.2% FFT 31.3M 28.2M 20.5% HeapSort 5.1M 5.1M 0.0% LUFact 32.2M 20.9M 54.3% Moldyn 1.3M 993K 21.2% RayTracer 1.1M 1.1M 5.8% SOR 52.0M 51.5M 0.7% Series 11.0M 10.7M 2.5% SparseMatmult 13.1M 11.6M 8.0% Dir: jsbench/typed/ Crypt 34.7M 34.7M 0.0% Euler 1.4M 1.1M 25.9% FFT 532K 514K 8.5% HeapSort 4.2M 4.2M 0.0% LUFact 2.4M 2.2M 5.7% Moldyn 1.4M 1.1M 24.7% RayTracer 404K 404K 0.0% SOR 8.5M 8.4M 0.6% Series 1004K 984K 4.8% SparseMatmult 4.8M 4.0M 12.4% The variability in the untyped LUFact results makes my skin crawl, but it could be minor turbulence leading to a large data structure being allocated just before a gc is finished rather than just after it's finished, so it need not be a big deal.
Attached patch Candidate patchSplinter Review
Performance data with patch, 5 iterations: test avm avg 95%_conf Metric: time Dir: jsbench/typed/ Crypt 836 838.2 0.2% Euler 7679 7730.8 0.9% FFT 4443 4454 0.3% HeapSort 1409 1410.8 0.1% LUFact 7669 7709 0.7% Moldyn 3542 3550 0.2% RayTracer 1035 1036.6 0.4% SOR 24122 24146.2 0.1% Series 6723 6725.2 0.0% SparseMatmult 2093 2096.4 0.2% Memory data with patch, 5 iterations: test avm avg 95%_conf Metric: memory Dir: jsbench/typed/ Crypt 34.7M 34.7M 0.0% Euler 1.7M 1.1M 40.2% FFT 15.2M 10.2M 30.9% HeapSort 4.2M 4.2M 0.0% LUFact 7.9M 6.1M 30.5% Moldyn 1.4M 1.1M 22.0% RayTracer 404K 404K 0.0% SOR 28.4M 21.5M 35.4% Series 1.3M 1.1M 14.8% SparseMatmult 5.9M 5.9M 0.0% So memory numbers are up for FFT and somewhat more in line with expectation, however performance is down quite a bit, possibly because we're hitting the write barrier more since the interval between collections is so much greater (the number of collections is down by an order of magnitude). Frankly it's not a great explanation, since most assignments should be into a Vector.<Number>, there should be little or no write barrier activity. LUFact and SOR are also slower, SOR greatly so. Memory numbers are up significantly on both, so it's probably the same effect. Not asking for a review yet, need to figure out why there is a slowdown when none was expected.
Blocks: 576307
I guess going from 500KB to 10MB we could also be seeing significant cache effects (FFT). For SOR that's less obvious, but the heap grew from 2.5x to 3.5x so it's not at all out of the question.
Comment on attachment 472624 [details] [diff] [review] Candidate patch I figure it wouldn't hurt to review while investigating slowdowns - the patch ought to be pretty much right anyway.
Attachment #472624 - Flags: review?(edwsmith)
Comment on attachment 472624 [details] [diff] [review] Candidate patch not related, but I noticed in ObjectVectorArray::grow, it is using VMPI_memcpy to copy Atoms. safe?
Attachment #472624 - Flags: review?(edwsmith) → review+
(In reply to comment #9) > Comment on attachment 472624 [details] [diff] [review] > Candidate patch > > not related, but I noticed in ObjectVectorArray::grow, it is using VMPI_memcpy > to copy Atoms. safe? I believe so. (I went over a bunch of scenarios in my head and couldn't think of anything that would break it.) We use the idiom elsewhere as well.
Regarding the slowdown: On my MacBook Pro the new code uses 17% more time than the old code. The new code has 6% more L2 cache misses than the old code, if Shark is to be believed. The L2 profile shows a lot more misses in the GC code, and tantalizingly a lot in GCAlloc::GetUsageInfo (about 1.5% of all misses). Since the heap is about 20x larger that's not terribly surprising. The machine has 6MB L2 cache. The Mac Pro quoted earlier has 256KB L2 per core and 8MB L3 per CPU (2 quad-core CPUs). The slowdown there was 22%. The L2 profiling built into Shark does not work on those CPUs.
(In reply to comment #7) > I guess ... we could also be seeing significant cache > effects. For SOR that's less obvious, but the heap grew from 2.5x to > 3.5x so it's not at all out of the question. In fact, with SOR being at 8.5MB it should be barely missing in the 8MB L3 cache. At 20-28MB it should be missing a lot.
After fixing bug #594265 so that the reported memory numbers reflect reality, and sizing the problem for FFT so that the heap fits in the L2 cache (3.9MB for the new code; 2.4MB for the old code - not 514K as previously reported erroneously) the results are (MBP): old code 224ms new code 235ms That's still 5% more time on the new code, and at this point L2 cache effects ought to be ruled out. The L1 cache on this system is small, 64KB split between instruction and data per core (not clear if it's a shared cache or 32x2). Most non-vector allocation in this program is boxed doubles (a weakness in the optimizer causes this), ie, not reference counted storage, so it's highly unlikely that work stays well within the L1 cache in either configuration.
With the same problem size on the MacPro the new code uses 1.7% more time than the old code. I'm going to call this cache effects and ship the code.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
just for the record, nehalem has separate I1/D1 cache. so, 32x2. http://en.wikipedia.org/wiki/Nehalem_%28microarchitecture%29
Flags: flashplayer-bug+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: