Closed Bug 531091 Opened 15 years ago Closed 6 years ago

in v8 benchmark splay the large amount of time spent in convertDoubleToString

Tracking

(Not tracked)

Status:

RESOLVED WONTFIX

Milestone:

Future

People

(Reporter: dschaffe, Unassigned)

References

Details

(Whiteboard: PACMAN)

Attachments

(4 files, 3 obsolete files)

Use bulk push 15 years ago Tommy Reilly 471 bytes, patch	edwsmith : review-	Details \| Diff \| Splinter Review
take it to the limit (almost) 15 years ago Tommy Reilly 2.61 KB, patch		Details \| Diff \| Splinter Review
give AtomArray the inline treatment 15 years ago Tommy Reilly 7.68 KB, patch	edwsmith : review+	Details \| Diff \| Splinter Review
ditto for ArrayObject 15 years ago Tommy Reilly 6.41 KB, patch		Details \| Diff \| Splinter Review
inline cleanup for ArrayObject 15 years ago Tommy Reilly 7.85 KB, patch	stejohns : review+ edwsmith : superreview+	Details \| Diff \| Splinter Review
s/push/setAtInternal 15 years ago Tommy Reilly 364 bytes, patch	lhansen : review+	Details \| Diff \| Splinter Review
do a bulk push to dense array of one at a time 15 years ago Tommy Reilly 2.19 KB, patch	lhansen : review+	Details \| Diff \| Splinter Review

Dan Schaffer

Reporter

Description

•

15 years ago

should speedup avmplus:MathUtils:convertDoubleToString() tamarin spends much time generating the data structure in the splay test. The GeneratePayload function contains: return { array : [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ], string : 'String for key ' + key + ' in leaf node' }; To optimize I modified to create payloaddata class instead of returning an Object. changing to: return { array : [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ], string : 'String for key in leaf node', key : key }; makes a big difference. v8 ~ 6000 tamarin ~ 750 tamarin(with change) ~ 3300

Lars T Hansen

Comment 1

•

15 years ago

One assumes that should be 7500, not 750?

Trevor Baker

Comment 2

•

15 years ago

Bigger is better in the v8 test suite, so that is probably correct.

Dan Schaffer

Reporter

Comment 3

•

15 years ago

750 is correct. the v8 measurement is number of operations for a time interval (bigger is better). removing the DoubleToString went from 750 to 3300 (about 4x speedup).

Dan Schaffer

Reporter

Updated

•

15 years ago

Flags: flashplayer-triage+

Flags: flashplayer-qrb?

Trevor Baker

Updated

•

15 years ago

Flags: flashplayer-qrb? → flashplayer-qrb+

Target Milestone: --- → Future

Lars T Hansen

Updated

•

15 years ago

Whiteboard: PACMAN

Tommy Reilly

Comment 4

•

15 years ago

Got pacman fever and started looking at this. With a cache of 1 approach in AvmCore::doubleToString I get a %100 speedup. The benchmark converts 8000 doubles to strings but each one is done like 30 times in a row before moving on the next one. After that fix its all memory management: 16.4% 16.4% avmshell MMgc::GC::MarkItem(MMgc::GCWorkItem&) 5.4% 5.4% avmshell MMgc::GCAlloc::Alloc(int) 5.0% 5.0% avmshell avmplus::AvmCore::atomWriteBarrier(MMgc::GC*, void const*, int*, int) 3.2% 3.2% avmshell avmplus::AvmCore::decrementAtomRegion(int*, int) Suggests a doubleToString cache might be worthwhile if we can't tune doubleToString significantly although I suspect we can. One interesting observation is that an object literal of 2 results in a 16 element HT, seems like overkill but I saw InlineHashtable grow being called if I made it only result in a 8 element HT. Supposedly our load factor is .8 so this didn't make sense. The *2+1 in MethodEnv::op_newobject and the round up to the next power of two in the HT ctor are how it gets so big. I'm revisiting the memory profiler, want to get a breakdown of MarkItem time by having the profiler show just "scanned memory"

Tommy Reilly

Comment 5

•

15 years ago

Attached patch Use bulk push (obsolete) — Details — Splinter Review

Profiled our "optimzed" case and found a lot of time in setUintProperty for the array literal, this fixes that and results in ~%8 speedup. After that its just memory management: 9.8% 9.8% avmshell MMgc::GC::privateWriteBarrierRC(void const*, void const*, void const*) 8.9% 8.9% avmshell avmplus::Traits::destroyInstance(avmplus::ScriptObject*) const 8.8% 8.8% avmshell MMgc::GC::MarkItem(MMgc::GCWorkItem&) 7.3% 7.3% avmshell MMgc::GCAlloc::Alloc(int) 2.9% 2.9% avmshell MMgc::GCAlloc::Free(void const*) writeBarrier traffic is probably a big hit since a lot of this is object initialization. It would be nice to get the JIT and GC to coordinate and avoid ZCT insertion/removal for new objects we know are gonna get ref counted.

Attachment #442394 - Flags: review?(edwsmith)

Tommy Reilly

Comment 6

•

15 years ago

This benchmark's profile is wildly different depending on whether we look at the original JS one or the AS3'ized optimized one. Which should we focus on?

Dan Smith

Comment 7

•

15 years ago

The AS3 optimized one is the one to start on.

Tommy Reilly

Comment 8

•

15 years ago

The AS3 optimized version completely bypasses the string + double + string expression that gave rise to this bug. I'll focus on the optimized versions time spent in privateWriteBarrierRC when I next get a chance to work on this.

Tommy Reilly

Comment 9

•

15 years ago

privateWriteBarrierRC initCalls 9162212/7601252 %82.963066 privateWriteBarrierRC repeatContainer calls 9162212/5731389 %62.554643 So %83 of the time the container is the last thing allocated and we can skip the WB check (but not the ref count) and %63 of the time the container was the same as it was the last time it was called. Thoughts: - have the JIT'd ctors inline the increment ref call and skip privateWriteBarrierRC altogether - have a "batch" write barrier API for so the JIT can ask the GC to handle privateWriteBarrierRC(container, addr1, value1, addr2, value2, ....) For #1 we'd need to guarantee the last alloc property holds so maybe the ctor and the alloc can be folded together into one operation (for non-native's obviously). Investigating...

Edwin Smith

Comment 10

•

15 years ago

Comment on attachment 442394 [details] [diff] [review] Use bulk push R- only because I think you can do better. if you know that inst will be dense, coming from newArray(argc), then you know it will use the first branch of the if inside AS3_push(). Factor that case out, assert that the array is dense, then call it directly.

Attachment #442394 - Flags: review?(edwsmith) → review-

Tommy Reilly

Comment 11

•

15 years ago

Simplifing to just make a call to increment the ref count barely moves the needle. Need to avoid or drastically reduce the # of calls to privateWriteBarrierRC altogether to make a difference (if we didn't have DRC we could easily make these plain stores for the init case). Also nogc makes a %10 speedup, oddly the profile switches to have destroyInstance being the heaviest function, ha, nogc doesn't disable DRC.

Edwin Smith

Comment 12

•

15 years ago

(In reply to comment #4) > Got pacman fever and started looking at this. With a cache of 1 approach in > AvmCore::doubleToString I get a %100 speedup. The benchmark converts 8000 > doubles to strings but each one is done like 30 times in a row before moving on > the next one. After that fix its all memory management: Is a cache the only option here? if we can find out why we're calling doubleToString in the first place and avoid it, even better. Felix has been looking at several similar sitations where we can be using a faster access path. on the cache: is the repeated access pattern due to poor internal code structure or due to unique behavior in the benchmark? if the latter, then i'd want to see more comprehensive results before signing off on the benefit of the cache. Generally: post results on the whole suite when possible, even though one test in particular caused the investigation. or at least run the whole suite and post data on any interesting outliers.

Tommy Reilly

Comment 13

•

15 years ago

(In reply to comment #12) > Is a cache the only option here? if we can find out why we're calling > doubleToString in the first place and avoid it, even better. Felix has been > looking at several similar sitations where we can be using a faster access > path. Evaluating the expression: 'strconst + key + strconst', key is the result of Math.random2, see Dan's first comment for the real code. the expression is evaluated exactly 32 times for each key value (5 deep binary tree). > on the cache: is the repeated access pattern due to poor internal code > structure or due to unique behavior in the benchmark? if the latter, then i'd > want to see more comprehensive results before signing off on the benefit of the > cache. the cache was mainly meant to show that there's upside in optimizing number to String conversion (and maybe String allocation/concatentation too). Maintaining a (size limited) general double to String cache did occur to me but I thought that should be a last resort after optimizing the existing code.

Tommy Reilly

Comment 14

•

15 years ago

Attached patch take it to the limit (almost) (obsolete) — Details — Splinter Review

This one takes the init down to the level of the alloc and uses the optimized _ctor Atom write barriers. So in addition to skipping the dense check we also skip the capacity checking (which was happening on each push, even in AS3_push). We could go further and avoid the actual WB and do some more inlining but this is a good start.

Attachment #442394 - Attachment is obsolete: true

Attachment #442545 - Flags: review?(edwsmith)

Tommy Reilly

Comment 15

•

15 years ago

Comment on attachment 442545 [details] [diff] [review] take it to the limit (almost) changing review request to feedback, ran complete benchmark suite and mostly noise, %13 speedup on splay but consistent %7 slowdown on jsbench/Moldyn.as that I need to investigate

Attachment #442545 - Flags: review?(edwsmith) → feedback?(edwsmith)

Tommy Reilly

Comment 16

•

15 years ago

perf anomalies figured out. AtomArray needs the inline cleanup treatment.

Tommy Reilly

Updated

•

15 years ago

Attachment #442545 - Flags: feedback?(edwsmith)

Tommy Reilly

Comment 17

•

15 years ago

Attached patch give AtomArray the inline treatment — Details — Splinter Review

some appreciable speedups and libamvplus.a and avmshell release are smaller, win-win.

Assignee: nobody → treilly

Attachment #442545 - Attachment is obsolete: true

Status: NEW → ASSIGNED

Steven Johnson

Comment 18

•

15 years ago

(In reply to comment #16) > perf anomalies figured out. AtomArray needs the inline cleanup treatment. Nice. Long overdue.

Tommy Reilly

Comment 19

•

15 years ago

Attached patch ditto for ArrayObject (obsolete) — Details — Splinter Review

will be touching this file too so give it the inline scrub.

Attachment #442743 - Flags: review?(edwsmith)

Tommy Reilly

Updated

•

15 years ago

Attachment #442702 - Flags: review?(edwsmith)

Tommy Reilly

Comment 20

•

15 years ago

Comment on attachment 442743 [details] [diff] [review] ditto for ArrayObject withdrawing ArrayObject, I wrongly assumed trying to REALLY_INLINE virtual methods was silly but its not

Attachment #442743 - Flags: review?(edwsmith)

Tommy Reilly

Comment 21

•

15 years ago

Attached patch inline cleanup for ArrayObject — Details — Splinter Review

Now vritual's are decl'd REALLY_INLINE too

Attachment #442743 - Attachment is obsolete: true

Attachment #442787 - Flags: superreview?(edwsmith)

Attachment #442787 - Flags: review?(stejohns)

Steven Johnson

Updated

•

15 years ago

Attachment #442787 - Flags: review?(stejohns) → review+

Steven Johnson

Comment 22

•

15 years ago

Comment on attachment 442787 [details] [diff] [review] inline cleanup for ArrayObject r+ is conditional that we've done a reality check for code size on a credible number of compilers & platforms

Tommy Reilly

Comment 23

•

15 years ago

Code size and perf are good on mac32. Haven't looked elsewhere yet. Will verify mvc x86/arm as well.

Tommy Reilly

Comment 24

•

15 years ago

It'll probably be a couple days before I get back to this so anyone not stuck in argo feel free to jump in

Edwin Smith

Updated

•

15 years ago

Attachment #442787 - Flags: superreview?(edwsmith) → superreview+

Edwin Smith

Comment 25

•

15 years ago

Comment on attachment 442702 [details] [diff] [review] give AtomArray the inline treatment AtomArray.cpp:96: push(..) got changed to setAtInternal(...), is that intentional or accidentally slipped into the -inline patch? from what i can tell its not broken, but i'd expect the patch to just move code around, without changing it.

Attachment #442702 - Flags: review?(edwsmith) → review+

Tommy Reilly

Comment 26

•

15 years ago

Intentional change unintentionally included in this patch, will remove and put back in the final patch which will contain just the optimizations.

Tommy Reilly

Comment 27

•

15 years ago

tamarin-redux: changeset: 4662:5e347a983982 Will report back with size report link for delta between 4662/4661 and then submit ArrayObject one (so we see results for each change).

Tommy Reilly

Comment 28

•

15 years ago

apparently our size reporting page is broken but dan tells me the AtomArray change resulted in no change for msvc (well builtin supposedly got 576 bytes bigger and crt got 600 bytes smaller, apparently the sizereport script ain't perfect). So gcc and msvc show no change.

Tommy Reilly

Comment 29

•

15 years ago

Attached patch s/push/setAtInternal — Details — Splinter Review

Doing this skips extra unecessary checkCapacity checks

Attachment #444807 - Flags: review?(edwsmith)

Tommy Reilly

Comment 30

•

15 years ago

Attached patch do a bulk push to dense array of one at a time — Details — Splinter Review

With all 4 patches we get a %20 speedup in v8.5/optimzed/splay.abc ~4% due to first two inlining patches ~2% due to skiping extra checkCapacity calls rest due to this bulk push patch

Attachment #444808 - Flags: review?(edwsmith)

Tommy Reilly

Updated

•

15 years ago

Attachment #444807 - Attachment is patch: true

Attachment #444807 - Attachment mime type: application/octet-stream → text/plain

Tommy Reilly

Updated

•

15 years ago

Attachment #444807 - Flags: review?(edwsmith) → review?(lhansen)

Tommy Reilly

Updated

•

15 years ago

Attachment #444808 - Flags: review?(edwsmith) → review?(lhansen)

Tommy Reilly

Comment 31

•

15 years ago

ArrayObject inlining: tamarin-redux - changeset - 4664:8c63d785aef8

Lars T Hansen

Updated

•

15 years ago

Attachment #444807 - Flags: review?(lhansen) → review+

Lars T Hansen

Comment 32

•

15 years ago

Comment on attachment 444808 [details] [diff] [review] do a bulk push to dense array of one at a time Nice.

Attachment #444808 - Flags: review?(lhansen) → review+

Tommy Reilly

Comment 33

•

15 years ago

http://hg.mozilla.org/tamarin-redux/rev/320898401b2e http://hg.mozilla.org/tamarin-redux/rev/795528944a58

Assignee: treilly → nobody

Tommy Reilly

Comment 34

•

15 years ago

Relinguishing bug, the bug is still valid since I focused on the optimzed test case and not the pure javascript test case which is pounding on convertDoubleToString. Either a faster convertDoubleToString or some kind of cache is required to optimize that.

Lars T Hansen

Updated

•

15 years ago

Status: ASSIGNED → NEW

Rick Reitmaier

Updated

•

15 years ago

Updated

•

15 years ago

Blocks: a2d-d2a

Werner Sharp

Comment 35

•

14 years ago

I rediscovered this problem yesterday when profiling typed\splay.as. Adding a cache for doubleToString like Tommy tried sped up the test case by 5x on Win32+Xeon but is probably not a valid solution. The BigInteger/D2A code is very slow.

Trevor Baker

Comment 36

•

14 years ago

FWIW, Splay was updated in v6 as follows: Changed the Splay benchmark to avoid converting the same numeric key to a string over and over again and to avoid inserting and removing the same element repeatedly thus increasing pressure on the memory subsystem. http://v8.googlecode.com/svn/data/benchmarks/v6/README.txt

Sylvestre Ledru [:Sylvestre]

Updated

•

6 years ago

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → WONTFIX

You need to log in before you can comment on or make changes to this bug.