Closed
Bug 708614
Opened 13 years ago
Closed 13 years ago
RayTracer-Num and RayTracer-Num-V significantly slower in fr-float than in flashruntime-redux on core-2-duo
Categories
(Tamarin Graveyard :: Virtual Machine, defect, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
Q2 12 - Cyril
People
(Reporter: lhansen, Assigned: lhansen)
References
Details
Attachments
(6 files, 2 obsolete files)
Identical bytecode compiled with new ASC without -abcfuture and the -swf switch without version=16. Results units are frames per second, higher=better: FRR FR-FLOAT RayTrace-Num 24 19 RayTrace-Num-V 36 25 The theory is that this kind of regression should not happen, indeed, since FR-FLOAT has vector optimizations it should be faster than FRR. Another curious thing is that in FR-FLOAT, the float code corresponding to RayTrace-Num-V, RayTrace-float-V, runs at 36 FPS easily.
Assignee | ||
Comment 1•13 years ago
|
||
Rebuilding the tests with an older ASC against older playerglobal/builtin files does not help. Definitely looks like a VM problem, not a compiler problem (but I need to look at the dumps to be sure).
Assignee | ||
Comment 2•13 years ago
|
||
Machine is 2008 MacBook Pro / Core-2-Duo / 2.6GHz / 4GB.
Assignee | ||
Updated•13 years ago
|
Assignee: nobody → lhansen
Assignee | ||
Comment 3•13 years ago
|
||
(Very preliminary.) Could look like there's a bug with relational operators, in this code: dist = closestApproach - Math.sqrt(halfCord2); if( dist < closestIntersectionDist ) we're looking at this kind of output: subd4 = subd addd15, Math_sqrt3 (in reg? -1) ... intToAtom1 = calli.none #intToAtom ( immi1/*0x1019030 core*/ ldi77 ) compare1 = calli.all #compare ( intToAtom1 ldi79 ) (in reg? -1) but dist and closestApproach and closestIntersectionDist and the output of Math.sqrt are all Number. More investigation needed, just posting this in case it triggers memories (I could be looking at the wrong code).
Assignee | ||
Comment 4•13 years ago
|
||
Assignee | ||
Comment 5•13 years ago
|
||
The issue in comment #3 turned out to be a bug in the test program, there is an untyped variable that is used as a loop limit (in all the programs). Fixing that, we see that there's a speedup across the board, but the speedup is greater on FRR (36->47 FPS for Num-V) than on FR-FLOAT (25->29 FPS). So there's clearly a performance drain here somewhere. (The float cases speed up quite a bit too and for the float4 program we're up to 57 fps, from below 50 before.) The untyped variable was there in the original program. This illustrates that the compiler MUST have some sort of "-fast" mode where it warns about untyped variables, Array, and so on.
Assignee | ||
Comment 6•13 years ago
|
||
Attachment #580050 -
Attachment is obsolete: true
Assignee | ||
Comment 7•13 years ago
|
||
Not really all that relevant, but worth keeping in mind: The performance difference is not observable on my Mac Pro / 2 x 2.93GHz Quad-Core Xeon / 16 GB. On this machine, RayTrace-Num reaches 31 fps and RayTrace-Num-V reaches 58fps, on both FRR and FR-FLOAT. The Num-V results match the float-V results on FR-FLOAT. Also on this machine the float4-V code is *much* slower than the float-V code (43 fps vs 58 fps), but that will be the subject of another bug.
Assignee | ||
Comment 8•13 years ago
|
||
In another experiment, the iterated Num-V kernel (the attached test case) has the same performance in tamarin-redux and tr-float, on both machines. As FR-FLOAT is derived from FRR there should be limited room for differences in the two, apart from the VM and the build settings.
Assignee | ||
Comment 9•13 years ago
|
||
(In reply to Lars T Hansen from comment #7) > The performance difference is not observable on my Mac Pro / 2 x 2.93GHz > Quad-Core Xeon / 16 GB. > > On this machine, RayTrace-Num reaches 31 fps and RayTrace-Num-V reaches > 58fps, on both FRR and FR-FLOAT. The Num-V results match the float-V > results on FR-FLOAT. > > Also on this machine the float4-V code is *much* slower than the float-V > code (43 fps vs 58 fps), but that will be the subject of another bug. NOTE: Those were 64-bit runs (because the "Open in 32-bit" setting does not follow the app when it's moved but is stored in the Finder somehow). With 32-bit, the float4-V code is on par with the float-V code on Xeon (whew). The RayTracer-Num and RayTracer-Num-V results are the same for the two players on this platform also for 32-bit runs, but the RayTracer-Num results are better (41 fps). Of course it remains a mystery why it would be 10fps slower on a 64-bit build... The others all peak at 58fps. I don't know what the max frame rate that the player supports is, I compiled the SWF for 100fps.
Assignee | ||
Updated•13 years ago
|
Summary: RayTracer-Num and RayTracer-Num-V significantly slower in fr-float than in flashruntime-redux → RayTracer-Num and RayTracer-Num-V significantly slower in fr-float than in flashruntime-redux on core-2-duo
Assignee | ||
Comment 10•13 years ago
|
||
Apart from constants and addresses the machine code generated for the kernel program is identical on the MacBook Pro and the Mac Pro (Debug-Debugger build with -Dnodebugger -Dverbose=jit, textual comparison of the output after replacing all hex constants).
Assignee | ||
Comment 11•13 years ago
|
||
There are quite a few differences in the code generated for the kernel by tamarin-redux and tr-float. The tr code consistently has more restores to merge edges before the calls to the vector getters and also has all the finddefs intact, of course. There are many other small differences, eg in the registers used. But I sense I won't find the answer this way.
Assignee | ||
Comment 12•13 years ago
|
||
The things I know it *not* to be: - computing and setting the RGB values - setPixel - frame rate interaction (tried compiling with various frame rates) - square root (replaced with AS3 version, they still differ) Optimization settings in projects look to be the same, ditto the compilers are the same - GCC 4.2. Once we've removed the square root calls from the program there isn't much else going on that isn't the JIT's doing (one finddef call for Number.POSITIVE_INFINITY which ASC does not constant-fold because I'm compiling without -abcfuture), and as previously observed the JIT's code looks pretty much the same for the two runtimes.
Assignee | ||
Comment 13•13 years ago
|
||
(Of course that instruction dump is from the shell, where no difference is observed, so take it for what it's worth.)
Assignee | ||
Comment 14•13 years ago
|
||
Release configs are not the same.
Assignee | ||
Comment 15•13 years ago
|
||
Changing the release config appears to remove the performance regression. This is weird, but there you have it. Will investigate a little further but I consider this issue likely closed.
Assignee | ||
Comment 16•13 years ago
|
||
With that fix in place: 2008 MacBook Pro / Core-2-Duo / 2.6GHz / 4GB, fps (higher is better: FR-FLOAT FRR Num 30 30 Num-V 46 45 float-V 52 float4-V 57 Methodology: start up, run 20 seconds, record frame rate. Verified that all were set to "Open as 32 bit".
Assignee | ||
Comment 17•13 years ago
|
||
I should mention: all compiled by the new ASC, but Num and Num-V were compiled without -abcfuture and without version=16. Here are the numbers when those programs are compiled with those switches (the resulting code only runs on FR-FLOAT): FR-FLOAT Num 2 [sic] Num-V 46 The effectively untyped nature of Num (through its use of Array) and the poor code generation for untyped numeric code in FR-FLOAT combine to make the performance very bad. Notably here, "old content" (compiled without -abcfuture) does not suffer a performance regression, as per our design.
Assignee | ||
Comment 18•13 years ago
|
||
Assignee | ||
Comment 19•13 years ago
|
||
Assignee | ||
Comment 20•13 years ago
|
||
Assignee | ||
Comment 21•13 years ago
|
||
Assignee | ||
Comment 22•13 years ago
|
||
BTW, there are some "ASC bug" annotations in RayTracer-float4-V.as that can probably be removed now.
Assignee | ||
Comment 23•13 years ago
|
||
Updated: Removed the now-redundant float4() casts, retested. Performance is still fine.
Attachment #580377 -
Attachment is obsolete: true
Assignee | ||
Comment 24•13 years ago
|
||
2009 Mac Pro / 2xQuad-core Xeon / 2.93GHz / 16GB, fps (higher is better: FR-FLOAT FRR Num 42 42 Num-V 58 58 float-V 58 float4-V 58 Methodology: start up, run 20 seconds, record frame rate. Verified that all were set to "Open as 32 bit". All compiled by the new ASC, but Num and Num-V were compiled without -abcfuture and without version=16. It's remarkable that they're all approaching 60fps, slowly slowly. I disabled framerate limiting but I don't know for a fact that that works properly in standalone builds. It may be that the thing to do here is to run with larger canvas sizes or many more objects in the scene, on such a fast machine (or indeed on the slower machine too).
Assignee | ||
Updated•13 years ago
|
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Comment 25•13 years ago
|
||
Another methodology might be to change the test to calibrate its own workload to acheive 30fps, with the frame rate still set to 60 or 100fps. As for screen size vs # of objects, I think # of objects is a more interesting metric.
You need to log in
before you can comment on or make changes to this bug.
Description
•