Closed
Bug 531091
Opened 15 years ago
Closed 6 years ago
in v8 benchmark splay the large amount of time spent in convertDoubleToString
Categories
(Tamarin Graveyard :: Virtual Machine, defect)
Tracking
(Not tracked)
RESOLVED
WONTFIX
Future
People
(Reporter: dschaffe, Unassigned)
References
Details
(Whiteboard: PACMAN)
Attachments
(4 files, 3 obsolete files)
7.68 KB,
patch
|
edwsmith
:
review+
|
Details | Diff | Splinter Review |
7.85 KB,
patch
|
stejohns
:
review+
edwsmith
:
superreview+
|
Details | Diff | Splinter Review |
364 bytes,
patch
|
lhansen
:
review+
|
Details | Diff | Splinter Review |
2.19 KB,
patch
|
lhansen
:
review+
|
Details | Diff | Splinter Review |
should speedup avmplus:MathUtils:convertDoubleToString()
tamarin spends much time generating the data structure in the splay test. The GeneratePayload function contains:
return {
array : [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ],
string : 'String for key ' + key + ' in leaf node'
};
To optimize I modified to create payloaddata class instead of returning an Object. changing to:
return {
array : [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ],
string : 'String for key in leaf node',
key : key
};
makes a big difference.
v8 ~ 6000
tamarin ~ 750
tamarin(with change) ~ 3300
Comment 1•15 years ago
|
||
One assumes that should be 7500, not 750?
Comment 2•15 years ago
|
||
Bigger is better in the v8 test suite, so that is probably correct.
Reporter | ||
Comment 3•15 years ago
|
||
750 is correct. the v8 measurement is number of operations for a time interval (bigger is better). removing the DoubleToString went from 750 to 3300 (about 4x speedup).
Reporter | ||
Updated•15 years ago
|
Flags: flashplayer-triage+
Flags: flashplayer-qrb?
Updated•15 years ago
|
Flags: flashplayer-qrb? → flashplayer-qrb+
Target Milestone: --- → Future
Updated•15 years ago
|
Whiteboard: PACMAN
Comment 4•15 years ago
|
||
Got pacman fever and started looking at this. With a cache of 1 approach in AvmCore::doubleToString I get a %100 speedup. The benchmark converts 8000 doubles to strings but each one is done like 30 times in a row before moving on the next one. After that fix its all memory management:
16.4% 16.4% avmshell MMgc::GC::MarkItem(MMgc::GCWorkItem&)
5.4% 5.4% avmshell MMgc::GCAlloc::Alloc(int)
5.0% 5.0% avmshell avmplus::AvmCore::atomWriteBarrier(MMgc::GC*, void const*, int*, int)
3.2% 3.2% avmshell avmplus::AvmCore::decrementAtomRegion(int*, int)
Suggests a doubleToString cache might be worthwhile if we can't tune doubleToString significantly although I suspect we can.
One interesting observation is that an object literal of 2 results in a 16 element HT, seems like overkill but I saw InlineHashtable grow being called if I made it only result in a 8 element HT. Supposedly our load factor is .8 so this didn't make sense. The *2+1 in MethodEnv::op_newobject and the round up to the next power of two in the HT ctor are how it gets so big.
I'm revisiting the memory profiler, want to get a breakdown of MarkItem time by having the profiler show just "scanned memory"
Comment 5•15 years ago
|
||
Profiled our "optimzed" case and found a lot of time in setUintProperty for the array literal, this fixes that and results in ~%8 speedup. After that its just memory management:
9.8% 9.8% avmshell MMgc::GC::privateWriteBarrierRC(void const*, void const*, void const*)
8.9% 8.9% avmshell avmplus::Traits::destroyInstance(avmplus::ScriptObject*) const
8.8% 8.8% avmshell MMgc::GC::MarkItem(MMgc::GCWorkItem&)
7.3% 7.3% avmshell MMgc::GCAlloc::Alloc(int)
2.9% 2.9% avmshell MMgc::GCAlloc::Free(void const*)
writeBarrier traffic is probably a big hit since a lot of this is object initialization. It would be nice to get the JIT and GC to coordinate and avoid ZCT insertion/removal for new objects we know are gonna get ref counted.
Attachment #442394 -
Flags: review?(edwsmith)
Comment 6•15 years ago
|
||
This benchmark's profile is wildly different depending on whether we look at the original JS one or the AS3'ized optimized one. Which should we focus on?
Comment 8•15 years ago
|
||
The AS3 optimized version completely bypasses the string + double + string expression that gave rise to this bug. I'll focus on the optimized versions time spent in privateWriteBarrierRC when I next get a chance to work on this.
Comment 9•15 years ago
|
||
privateWriteBarrierRC initCalls 9162212/7601252 %82.963066
privateWriteBarrierRC repeatContainer calls 9162212/5731389 %62.554643
So %83 of the time the container is the last thing allocated and we can skip the WB check (but not the ref count) and %63 of the time the container was the same as it was the last time it was called.
Thoughts:
- have the JIT'd ctors inline the increment ref call and skip privateWriteBarrierRC altogether
- have a "batch" write barrier API for so the JIT can ask the GC to handle privateWriteBarrierRC(container, addr1, value1, addr2, value2, ....)
For #1 we'd need to guarantee the last alloc property holds so maybe the ctor and the alloc can be folded together into one operation (for non-native's obviously). Investigating...
Comment 10•15 years ago
|
||
Comment on attachment 442394 [details] [diff] [review]
Use bulk push
R- only because I think you can do better. if you know that inst will be dense, coming from newArray(argc), then you know it will use the first branch of the if inside AS3_push(). Factor that case out, assert that the array is dense, then call it directly.
Attachment #442394 -
Flags: review?(edwsmith) → review-
Comment 11•15 years ago
|
||
Simplifing to just make a call to increment the ref count barely moves the needle. Need to avoid or drastically reduce the # of calls to privateWriteBarrierRC altogether to make a difference (if we didn't have DRC we could easily make these plain stores for the init case).
Also nogc makes a %10 speedup, oddly the profile switches to have destroyInstance being the heaviest function, ha, nogc doesn't disable DRC.
Comment 12•15 years ago
|
||
(In reply to comment #4)
> Got pacman fever and started looking at this. With a cache of 1 approach in
> AvmCore::doubleToString I get a %100 speedup. The benchmark converts 8000
> doubles to strings but each one is done like 30 times in a row before moving on
> the next one. After that fix its all memory management:
Is a cache the only option here? if we can find out why we're calling
doubleToString in the first place and avoid it, even better. Felix has been
looking at several similar sitations where we can be using a faster access
path.
on the cache: is the repeated access pattern due to poor internal code
structure or due to unique behavior in the benchmark? if the latter, then i'd
want to see more comprehensive results before signing off on the benefit of the
cache.
Generally: post results on the whole suite when possible, even though one test
in particular caused the investigation. or at least run the whole suite and
post data on any interesting outliers.
Comment 13•15 years ago
|
||
(In reply to comment #12)
> Is a cache the only option here? if we can find out why we're calling
> doubleToString in the first place and avoid it, even better. Felix has been
> looking at several similar sitations where we can be using a faster access
> path.
Evaluating the expression: 'strconst + key + strconst', key is the result of Math.random2, see Dan's first comment for the real code. the expression is evaluated exactly 32 times for each key value (5 deep binary tree).
> on the cache: is the repeated access pattern due to poor internal code
> structure or due to unique behavior in the benchmark? if the latter, then i'd
> want to see more comprehensive results before signing off on the benefit of the
> cache.
the cache was mainly meant to show that there's upside in optimizing number to String conversion (and maybe String allocation/concatentation too). Maintaining a (size limited) general double to String cache did occur to me but I thought that should be a last resort after optimizing the existing code.
Comment 14•15 years ago
|
||
This one takes the init down to the level of the alloc and uses the optimized _ctor Atom write barriers. So in addition to skipping the dense check we also skip the capacity checking (which was happening on each push, even in AS3_push). We could go further and avoid the actual WB and do some more inlining but this is a good start.
Attachment #442394 -
Attachment is obsolete: true
Attachment #442545 -
Flags: review?(edwsmith)
Comment 15•15 years ago
|
||
Comment on attachment 442545 [details] [diff] [review]
take it to the limit (almost)
changing review request to feedback, ran complete benchmark suite and mostly noise, %13 speedup on splay but consistent %7 slowdown on jsbench/Moldyn.as that I need to investigate
Attachment #442545 -
Flags: review?(edwsmith) → feedback?(edwsmith)
Comment 16•15 years ago
|
||
perf anomalies figured out. AtomArray needs the inline cleanup treatment.
Updated•15 years ago
|
Attachment #442545 -
Flags: feedback?(edwsmith)
Comment 17•15 years ago
|
||
some appreciable speedups and libamvplus.a and avmshell release are smaller, win-win.
Comment 18•15 years ago
|
||
(In reply to comment #16)
> perf anomalies figured out. AtomArray needs the inline cleanup treatment.
Nice. Long overdue.
Comment 19•15 years ago
|
||
will be touching this file too so give it the inline scrub.
Attachment #442743 -
Flags: review?(edwsmith)
Updated•15 years ago
|
Attachment #442702 -
Flags: review?(edwsmith)
Comment 20•15 years ago
|
||
Comment on attachment 442743 [details] [diff] [review]
ditto for ArrayObject
withdrawing ArrayObject, I wrongly assumed trying to REALLY_INLINE virtual methods was silly but its not
Attachment #442743 -
Flags: review?(edwsmith)
Comment 21•15 years ago
|
||
Now vritual's are decl'd REALLY_INLINE too
Attachment #442743 -
Attachment is obsolete: true
Attachment #442787 -
Flags: superreview?(edwsmith)
Attachment #442787 -
Flags: review?(stejohns)
Updated•15 years ago
|
Attachment #442787 -
Flags: review?(stejohns) → review+
Comment 22•15 years ago
|
||
Comment on attachment 442787 [details] [diff] [review]
inline cleanup for ArrayObject
r+ is conditional that we've done a reality check for code size on a credible number of compilers & platforms
Comment 23•15 years ago
|
||
Code size and perf are good on mac32. Haven't looked elsewhere yet. Will verify mvc x86/arm as well.
Comment 24•15 years ago
|
||
It'll probably be a couple days before I get back to this so anyone not stuck in argo feel free to jump in
Updated•15 years ago
|
Attachment #442787 -
Flags: superreview?(edwsmith) → superreview+
Comment 25•15 years ago
|
||
Comment on attachment 442702 [details] [diff] [review]
give AtomArray the inline treatment
AtomArray.cpp:96: push(..) got changed to setAtInternal(...), is that intentional or accidentally slipped into the -inline patch? from what i can tell its not broken, but i'd expect the patch to just move code around, without changing it.
Attachment #442702 -
Flags: review?(edwsmith) → review+
Comment 26•15 years ago
|
||
Intentional change unintentionally included in this patch, will remove and put back in the final patch which will contain just the optimizations.
Comment 27•15 years ago
|
||
tamarin-redux: changeset: 4662:5e347a983982
Will report back with size report link for delta between 4662/4661 and then submit ArrayObject one (so we see results for each change).
Comment 28•15 years ago
|
||
apparently our size reporting page is broken but dan tells me the AtomArray change resulted in no change for msvc (well builtin supposedly got 576 bytes bigger and crt got 600 bytes smaller, apparently the sizereport script ain't perfect). So gcc and msvc show no change.
Comment 29•15 years ago
|
||
Doing this skips extra unecessary checkCapacity checks
Attachment #444807 -
Flags: review?(edwsmith)
Comment 30•15 years ago
|
||
With all 4 patches we get a %20 speedup in v8.5/optimzed/splay.abc
~4% due to first two inlining patches
~2% due to skiping extra checkCapacity calls
rest due to this bulk push patch
Attachment #444808 -
Flags: review?(edwsmith)
Updated•15 years ago
|
Attachment #444807 -
Attachment is patch: true
Attachment #444807 -
Attachment mime type: application/octet-stream → text/plain
Updated•15 years ago
|
Attachment #444807 -
Flags: review?(edwsmith) → review?(lhansen)
Updated•15 years ago
|
Attachment #444808 -
Flags: review?(edwsmith) → review?(lhansen)
Comment 31•15 years ago
|
||
ArrayObject inlining: tamarin-redux - changeset - 4664:8c63d785aef8
Updated•15 years ago
|
Attachment #444807 -
Flags: review?(lhansen) → review+
Comment 32•15 years ago
|
||
Comment on attachment 444808 [details] [diff] [review]
do a bulk push to dense array of one at a time
Nice.
Attachment #444808 -
Flags: review?(lhansen) → review+
Comment 33•15 years ago
|
||
http://hg.mozilla.org/tamarin-redux/rev/320898401b2e
http://hg.mozilla.org/tamarin-redux/rev/795528944a58
Assignee: treilly → nobody
Comment 34•15 years ago
|
||
Relinguishing bug, the bug is still valid since I focused on the optimzed test case and not the pure javascript test case which is pounding on convertDoubleToString. Either a faster convertDoubleToString or some kind of cache is required to optimize that.
Updated•15 years ago
|
Status: ASSIGNED → NEW
Comment 35•14 years ago
|
||
I rediscovered this problem yesterday when profiling typed\splay.as. Adding a cache for doubleToString like Tommy tried sped up the test case by 5x on Win32+Xeon but is probably not a valid solution. The BigInteger/D2A code is very slow.
Comment 36•14 years ago
|
||
FWIW, Splay was updated in v6 as follows:
Changed the Splay benchmark to avoid converting the same numeric key to a string over and over again and to avoid inserting and removing the same element repeatedly thus increasing pressure on the memory subsystem.
http://v8.googlecode.com/svn/data/benchmarks/v6/README.txt
Updated•6 years ago
|
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•